AI models follow their values better when they first learn why those values matter

amu · May 7, 2026, 12:49pm

AI Models Adhere More Robustly to Values When Taught Their Underlying Rationale First

Large language models (LLMs) often struggle to consistently follow human values, especially under adversarial conditions like jailbreaks. A recent study from researchers at Stanford University and the Georgia Institute of Technology reveals a promising approach: training models first on explanations of why certain values matter before aligning them to those values. This two-stage process significantly enhances adherence, outperforming traditional methods across multiple benchmarks.

The Challenge of Value Alignment in AI

Aligning AI models with human values is a cornerstone of safe AI development. Techniques like reinforcement learning from human feedback (RLHF) have become standard, as seen in models such as ChatGPT. However, RLHF-trained models can falter when prompted cleverly to bypass safeguards, generating harmful or unethical responses. Prior work has explored debate-style training or constitutional AI to bolster robustness, but these methods add complexity.

The new research introduces a simpler, more intuitive strategy rooted in human learning analogies. Just as people internalize rules better when understanding their reasons, LLMs may benefit from grasping the rationale behind values like “do not harm humans” or “be truthful.”

The Two-Stage Training Method

The researchers propose a pipeline with two distinct phases:

Value Explanation Pretraining: Models are first fine-tuned on a dataset of value explanations. These are concise rationales justifying why specific behaviors align with human preferences. For instance, an explanation for avoiding harm might state: “Harming humans causes suffering and violates societal norms of safety and empathy.” The dataset comprises thousands of such pairs, generated using stronger LLMs like GPT-4 to ensure quality.
Standard Value Alignment: Following explanation pretraining, models undergo conventional supervised fine-tuning (SFT) or RLHF on preference datasets like Helpfulness and Harmlessness (HH-RLHF).

This sequence allows the model to internalize the “why” before learning the “what,” potentially creating a more stable internal representation of values.

The team applied this method to two base models: Microsoft’s Phi-2 (2.7 billion parameters) and Meta’s Llama-2-7B. For RLHF, they used the TRL library with PPO optimization, training for 6 billion tokens in the explanation phase and standard alignment thereafter.

Experimental Setup and Benchmarks

To evaluate effectiveness, the study tested across diverse benchmarks measuring value adherence:

Helpfulness and Harmlessness (HH-RLHF): A 160k-example dataset assessing helpful, harmless responses.
TruthfulQA: Probes for truthful answers over misleading but grammatically correct ones.
BOLD: International benchmark for harmful language detection across languages and demographics.
Jailbreak Benchmarks: Custom tests with adversarial prompts designed to elicit unsafe outputs, including real-world harms like weapon-making instructions or hate speech.

Models were compared against baselines: direct SFT/RLHF without explanations, as well as explanation-only training (no subsequent alignment).

Key Results: Superior Performance and Robustness

The results demonstrate clear gains from the explanation-first approach.

On HH-RLHF, explanation-pretrained Phi-2 achieved 85.2% harmlessness (vs. 78.4% for direct RLHF), while Llama-2-7B hit 82.1% (vs. 76.5%). Helpfulness scores also improved modestly.

TruthfulQA showed gains of 4-6% accuracy for both models.

BOLD results were striking: Explanation-pretrained models reduced harmful outputs by up to 20% compared to baselines, performing well across English, Chinese, and other languages.

Most compelling was jailbreak resistance. Direct RLHF models succumbed to 40-60% of adversarial prompts, generating unsafe content. In contrast, explanation-pretrained versions resisted 70-85% of jailbreaks. For example, Phi-2 post-explanation refused 92% of weapon-related queries, versus 65% for the baseline.

Ablation studies confirmed the necessity of both stages: Explanation-only models adhered well in standard settings but faltered under pressure, while direct alignment lacked depth.

Further analysis via probing revealed that explanation-pretrained models developed stronger internal associations between value violations and negative outcomes, as measured by logit differences and representation similarity.

Why Does This Work?

The authors hypothesize that value explanations enrich the model’s latent space, embedding rationales as contextual priors. During inference, these activate to reinforce alignment, even against tricky prompts. This mirrors cognitive science findings on principled moral reasoning.

Unlike debate or recursive methods, this approach requires no additional inference-time computation, making it scalable.

Limitations and Future Directions

The study used relatively small models (under 7B parameters), raising questions about scalability to giants like GPT-4. Datasets were English-centric for explanations, though BOLD tested multilingual robustness. Future work could explore larger models, diverse value sets (e.g., cultural variations), and integration with other alignment techniques.

The full paper, “Explanations Can Strengthen Alignment by Teaching Models Why Values Matter,” is available on arXiv. Code and datasets are released openly, inviting replication.

This method offers a lightweight enhancement to existing pipelines, potentially making AI safer without overhauling RLHF.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.