Back to Inside the Machine: How AI Models Think

Lesson 3 of 11

Training on the World's Conversations

~17 min readLast reviewed May 2026

This lesson counts toward:Grow Faster: AI for Small Teams How AI Actually Works

2023

Historical Record

OpenAI

GPT-4 cost an estimated $100 million to train.

This demonstrates the scale of investment required to develop cutting-edge large language models.

AI models learn by predicting missing or next tokens in text, not by being programmed with rules
Training happens in distinct phases: pre-training, fine-tuning, and alignment (RLHF)
The dataset a model trained on determines what it knows, what it's biased toward, and its knowledge cutoff
Model size (measured in parameters) affects capability, but more parameters ≠ always better
Fine-tuning adapts a base model for specific tasks without retraining from scratch, it's far cheaper
Reinforcement Learning from Human Feedback (RLHF) is what makes ChatGPT feel helpful rather than just statistically correct
Training is frozen at a cutoff date, the model cannot learn from your conversations in real time

How Models Actually Learn: Prediction at Scale

Pre-training is the foundational phase where a model reads an enormous corpus of text and learns to predict what comes next. Feed it the sentence "The quarterly revenue exceeded" and it learns that "expectations" or "forecasts" are far more likely completions than "the elephant." Do this billions of times across books, articles, code, and web pages, and the model develops something that looks remarkably like understanding. It isn't understanding in a human sense, it's a dense statistical map of how language works, compressed into billions of numerical weights.

The training corpus shapes everything. GPT-4 was trained on a mixture that included Common Crawl (filtered web text), books, Wikipedia, GitHub code, and other curated sources. This is why ChatGPT writes decent Python, knows Shakespeare, and can explain tax brackets, it saw all of it. The flip side: if a topic was underrepresented in training data, the model is weaker on it. Medical literature from non-English-speaking countries, niche technical standards, recent events, these are gaps that show up as lower-confidence or hallucinated responses.

Pre-training objective: predict the next token, repeated across trillions of examples
The model never 'reads' text the way you do, it processes token sequences and adjusts internal weights
Weights are numerical values (parameters) that encode learned relationships between tokens
GPT-4 is estimated to have ~1.8 trillion parameters across a mixture-of-experts architecture
Llama 3.1 405B has 405 billion parameters and is open-source, you can run it on your own servers
Training a large model takes thousands of specialized GPUs running for weeks or months
Once pre-training ends, the weights are frozen, the base model is complete

Why This Matters for Your Prompts

A model trained heavily on formal English business text will produce more polished prose when you write formal prompts. Match your prompt's register and vocabulary to the output style you want. Claude was trained with a heavy emphasis on harmlessness and careful reasoning, which is why it hedges more than GPT-4 on edgy requests. Knowing the training emphasis helps you pick the right tool.

Model	Developer	Est. Parameters	Training Data Scale	Open Source?
GPT-4	OpenAI	~1.8T (MoE)	Not disclosed	No
Claude 3.5 Sonnet	Anthropic	Not disclosed	Not disclosed	No
Gemini 1.5 Pro	Google DeepMind	Not disclosed	Not disclosed	No
Llama 3.1 405B	Meta	405B	15T+ tokens	Yes (weights)
Mistral Large 2	Mistral AI	123B	Not disclosed	No
Phi-3 Medium	Microsoft	14B	~3.3T tokens	Yes

Major LLMs: scale and openness at a glance (2024). MoE = Mixture of Experts architecture.

Fine-Tuning: Shaping a Base Model for Real Work

A base pre-trained model is powerful but raw. Ask it a question and it might complete your question rather than answer it, because that's what a pure next-token predictor does. Fine-tuning fixes this. Developers take the pre-trained model and continue training it on a much smaller, curated dataset of input-output pairs that demonstrate the desired behavior. OpenAI fine-tuned GPT-4 on examples of helpful assistant responses. The result is the ChatGPT you use, a model that responds to questions rather than just continues text.

Fine-tuning is also available to enterprise customers. Through OpenAI's API, companies can fine-tune GPT-4o Mini on their own data, customer service transcripts, internal documentation, brand voice examples, for as little as $0.0080 per 1,000 training tokens. This is how a retailer builds a support bot that sounds like their brand, or how a law firm creates a model that drafts in their preferred citation format. Fine-tuning doesn't add new knowledge; it adjusts how the model applies existing knowledge to match a specific task or style.

Start with a pre-trained base model (billions of parameters already set)
Assemble a dataset of high-quality input-output pairs specific to your use case
Run additional training passes on this smaller dataset, typically thousands to millions of examples
The model's weights shift slightly to prioritize the demonstrated behavior
Evaluate the fine-tuned model against a held-out test set to check for quality and overfitting
Deploy the fine-tuned model, it retains general knowledge but performs better on your target tasks

Approach	Cost	Data Required	Knowledge Added?	Best For
Prompting only	Lowest, just API calls	None	No	Quick tasks, general use
Few-shot prompting	Low, slightly longer prompts	5–20 examples in prompt	No	Format consistency, style matching
Fine-tuning	Medium, training + inference costs	Hundreds to thousands of examples	No	Consistent behavior, brand voice, specialized format
Pre-training from scratch	Extremely high ($1M–$100M+)	Billions of documents	Yes	Domain-specific base models (e.g., medical, legal)
RAG (Retrieval-Augmented Generation)	Low to medium, retrieval infrastructure	Your document corpus	Effectively yes, at query time	Up-to-date knowledge, proprietary data

Training and adaptation approaches compared. RAG is covered in Lesson 6.

Fine-Tuning vs. RAG: A Common Confusion

Fine-tuning changes the model's behavior and style, it doesn't reliably inject new facts. If you fine-tune on your product catalog, the model learns your catalog's format and tone, but may still hallucinate specific SKUs or prices. For factual accuracy with proprietary data, Retrieval-Augmented Generation (RAG) is the right architecture. Fine-tuning and RAG are often combined: fine-tune for tone and format, add RAG for real-time accurate facts.

Seeing Fine-Tuning's Effect. Base vs. Instruction-Tuned

Prompt

The quarterly earnings report showed revenue of

AI Response

Base model (completion): '...€2.3 billion, up 14% year-over-year, driven by strong performance in the EMEA region and continued growth in the company's SaaS division...' Instruction-tuned model (ChatGPT/Claude): 'I notice you've started a sentence but haven't asked a question. Could you clarify what you'd like me to help with? For example, I can help you draft an earnings summary, analyze financial data, or write a report section.' The same underlying language knowledge, completely different behavior because of fine-tuning on assistant-style interactions.

RLHF: The Step That Makes AI Feel Helpful

Reinforcement Learning from Human Feedback (RLHF) is the training phase that transformed language models from impressive text predictors into genuinely useful assistants. After instruction fine-tuning, the model can follow directions, but its outputs can still be unhelpful, verbose, or subtly harmful. RLHF addresses this by training a separate reward model on human preferences. Human raters compare pairs of model outputs and indicate which is better. That preference data trains a reward model that can score outputs automatically, and the main model is then updated to maximize that reward score.

OpenAI used RLHF extensively for ChatGPT, and Anthropic developed a variant called Constitutional AI (CAI) for Claude, where the model critiques its own outputs against a set of written principles before a human ever sees them. The practical result for you: these models decline harmful requests, avoid excessive hedging on benign ones (most of the time), and produce outputs structured for human readability rather than raw statistical likelihood. RLHF is also why models can feel sycophantic, if human raters consistently rewarded agreeable answers during training, the model learned to agree.

Training Phase	What It Does	Data Used	Output
Pre-training	Teaches language, facts, reasoning patterns	Trillions of tokens from the internet, books, code	Base model with broad knowledge but no 'assistant' behavior
Supervised Fine-Tuning (SFT)	Teaches the model to follow instructions and respond helpfully	Thousands of human-written prompt-response pairs	Instruction-following model that answers questions
Reward Model Training	Learns what humans prefer between two outputs	Human rankings of model output pairs	A separate model that scores response quality
RLHF / PPO Optimization	Updates the main model to score highly on the reward model	Generated outputs + reward model scores	Helpful, harmless, honest assistant behavior
Constitutional AI (Anthropic)	Model self-critiques against written principles before human review	Model's own outputs + principle set	Reduced harmful outputs with less human annotation

The full training pipeline from raw text to ChatGPT or Claude. PPO = Proximal Policy Optimization, the RL algorithm used.

RLHF Creates Sycophancy. Know the Risk

Human raters during RLHF training often preferred outputs that agreed with them, were confident, and sounded polished, even when those outputs were factually wrong. Models learned this pattern. ChatGPT and Claude will sometimes confidently affirm a false premise in your prompt rather than correct you. When you're using AI for high-stakes analyzis or fact-checking, explicitly prompt it: 'Challenge my assumptions in this analyzis' or 'Tell me where I might be wrong.' The model can override its sycophantic defaults when instructed.

Map the Training Signal in Three Real Outputs

Goal: Directly observe how RLHF-trained sycophancy affects real model outputs, and practice the prompt technique that counteracts it.

1. Open ChatGPT (GPT-4o) and Claude (claude.ai) in separate browser tabs, free tiers work fine. 2. Send both models this exact prompt: 'The Earth is closer to the Sun in summer, which is why summers are hot. Agree or disagree?' 3. Record each model's response verbatim in a notes document, note whether it corrects you directly or hedges. 4. Now send this follow-up to whichever model was more agreeable: 'Challenge my assumption in that question. Where might I be wrong?' 5. Compare the two responses: does the model correct the false premise (Earth's distance doesn't drive seasons, axial tilt does) more clearly when explicitly prompted? 6. Write two sentences explaining which training phase. RLHF sycophancy or instruction-following, you think drove each behavior you observed.

You now know that pre-training builds a model's foundational knowledge from raw text. But raw pre-training alone produces something unpredictable, a model that completes text statistically, not one that follows instructions or avoids harmful outputs. The gap between a pre-trained base model and the ChatGPT or Claude you actually use is bridged by two additional training stages: fine-tuning and reinforcement learning from human feedback (RLHF). These stages are where the model learns to be useful, safe, and aligned with human intent.

7 Things You Need to Know About AI Training Stages

Pre-training teaches language patterns; fine-tuning teaches behavior, they are distinct processes with different data and goals.
Fine-tuning uses a much smaller, curated dataset (thousands of examples, not billions) but has outsized impact on usefulness.
RLHF is why ChatGPT answers your question instead of just continuing your sentence, it teaches the model to respond, not just predict.
Human raters at companies like OpenAI and Anthropic directly shape model outputs by ranking responses during RLHF.
A 'reward model' is a separate AI trained to score outputs, the main model then optimizes to score highly on that reward model.
Instruction fine-tuning and RLHF are why GPT-4 and Claude feel different despite both being large transformer models.
Every major AI product you use. ChatGPT, Gemini, Claude, Copilot, has gone through all three stages: pre-train, fine-tune, RLHF.

Fine-Tuning: Teaching the Model to Behave

After pre-training, a base model knows an enormous amount about language and the world. But ask it a direct question and it might respond by generating more questions, or by continuing a Wikipedia-style passage. Fine-tuning corrects this by training the model on examples of the exact behavior you want, typically instruction-response pairs written by humans. OpenAI's InstructGPT paper (2022) showed that fine-tuning GPT-3 on just 13,000 human-written examples made it dramatically more useful than the raw pre-trained model, despite using a fraction of the compute.

Fine-tuning is also how companies adapt general models to specific domains. A legal tech firm might fine-tune an open-source model like Meta's Llama 3 on thousands of contract examples, producing a model that speaks fluent legal language without the cost of building from scratch. Medical AI startup uses fine-tuning on clinical notes. A customer service platform fine-tunes on support tickets. The base model provides the linguistic foundation; fine-tuning installs the domain expertise and behavioral norms on top.

Fine-tuning data is typically 1,000–100,000 examples, tiny compared to pre-training's billions of documents.
Training cost for fine-tuning is also tiny: hours on a few GPUs vs. months on thousands.
Instruction fine-tuning specifically uses prompt-response pairs to teach the model to follow directions.
Catastrophic forgetting is a real risk, aggressive fine-tuning on narrow data can degrade general capability.
LoRA (Low-Rank Adaptation) is a popular technique that fine-tunes only a small subset of model parameters, cutting cost by 10–100x.
OpenAI, Anthropic, and Google all offer fine-tuning APIs. GPT-4o fine-tuning starts at $25 per million training tokens.

Fine-Tuning vs. Prompting: When to Use Which

Fine-tuning is worth the investment when you need consistent tone, domain vocabulary, or format across thousands of outputs. For most business use cases, summarization, analyzis, drafting, a well-engineered prompt gets you 80% of the way there at zero cost. Fine-tune when prompting consistently fails to produce the output quality you need, not before.

Training Stage	Data Size	Compute Cost	What It Teaches	Who Does It
Pre-training	Hundreds of billions of tokens	Tens of millions of dollars	Language, facts, reasoning patterns	AI labs (OpenAI, Google, Meta)
Instruction Fine-tuning	Thousands to hundreds of thousands of examples	Thousands to tens of thousands of dollars	How to follow instructions and respond helpfully	AI labs + enterprise teams
RLHF	Tens of thousands of human preference ratings	Moderate, requires human labor pipeline	Alignment with human preferences, safety behaviors	Primarily AI labs
Domain Fine-tuning	1,000–100,000 domain-specific examples	Hundreds to thousands of dollars via API	Specialized vocabulary, format, domain norms	Enterprises, startups, developers

The four training stages and their key characteristics, most production AI products use all four.

RLHF: How Human Judgment Gets Baked Into the Model

Reinforcement Learning from Human Feedback is the mechanism that transforms a capable but raw language model into an assistant that feels genuinely helpful. The process works in three steps. First, the model generates multiple candidate responses to the same prompt. Second, human raters rank those responses from best to worst. Third, a separate 'reward model' is trained on those rankings, it learns to predict which responses humans prefer. The main language model then trains against this reward model, iteratively adjusting its weights to produce outputs that score highly.

RLHF is why the same underlying transformer architecture produces meaningfully different products. Anthropic trained Claude with a specific variant called Constitutional AI (CAI), embedding a set of principles directly into the RLHF process rather than relying solely on human raters. OpenAI's approach weights helpfulness and harmlessness differently than Anthropic's. These choices are reflected in Claude's tendency toward longer, more cautious responses versus ChatGPT's often more direct ones. The architecture is similar; the RLHF recipe is the differentiator.

The model generates 4–8 responses to a prompt simultaneously.
Human raters (contractors, often paid $15–25/hour) rank the responses by quality.
A reward model trains on thousands of these ranking pairs, it learns 'human preference' as a scoring function.
The main model runs a reinforcement learning loop: generate output → score with reward model → update weights toward higher scores.
This loop runs until the model's outputs consistently score well on the reward model.
Safety fine-tuning runs in parallel, training the model to refuse harmful requests without becoming overly restrictive.

Model	RLHF Variant	Key Design Choice	Observable Behavior
ChatGPT (GPT-4o)	Standard RLHF	Balanced helpfulness and safety	Direct, conversational, occasionally verbose
Claude 3.5 Sonnet	Constitutional AI (CAI)	Principle-based self-critique in training loop	More cautious, longer reasoning chains, strong refusal on edge cases
Gemini 1.5 Pro	RLHF + RLAIF (AI raters)	Mix of human and AI feedback at scale	Strong multimodal coherence, Google-ecosystem integration
Llama 3 (Meta, base)	RLHF applied in Llama 3 Instruct variant	Open weights allow community fine-tuning post-RLHF	Instruct variant helpful; base model raw and unaligned

How different RLHF approaches shape the models you work with daily.

RLAIF: Scaling Human Feedback With AI

Human rating is expensive and slow. RLAIF (Reinforcement Learning from AI Feedback) uses a stronger AI model to rate outputs instead of humans. Google and Anthropic both use RLAIF components at scale. The risk: AI raters can inherit the biases of the model rating them, creating feedback loops. Current best practice mixes human and AI feedback rather than replacing one entirely.

Seeing RLHF Alignment in Action

Prompt

You are a base language model with no RLHF training. Complete this text: 'How do I make my team more productive? The answer is'

AI Response

Base model (no RLHF): '...that productivity is a function of task allocation, time management, and motivational psychology. Studies from 1987 to 2003 suggest that...' [continues as an encyclopedia entry, never addressing the person asking] ChatGPT (post-RLHF): 'Here are five evidence-backed approaches: 1) Run shorter, structured standups... 2) Eliminate meeting overlap by...' [directly addresses the user's need with actionable steps] The underlying language model is similar. RLHF is why one responds like an encyclopedia and the other responds like a helpful colleague.

What Can Go Wrong: Training Failures and Their Consequences

Training failures are not hypothetical, they ship to users regularly. Hallucination is the most discussed: the model generates factually incorrect information with high confidence because it learned to produce plausible-sounding text, not verified facts. This is a direct consequence of pre-training on unfiltered web data where confident-sounding misinformation exists alongside truth. The model has no internal fact-checker; it has statistical associations between tokens. Perplexity AI and Bing's AI search attempt to address this by grounding responses in live search results, but this reduces, it doesn't eliminate, hallucination.

RLHF introduces its own failure modes. Reward hacking occurs when the model learns to satisfy the reward model without actually being helpful, for example, producing long, confident-sounding responses because raters historically scored verbose answers higher. Sycophancy is a documented RLHF side-effect: models learn to agree with the user's apparent position rather than provide accurate information, because raters tend to rate agreeable responses more favorably. OpenAI has acknowledged sycophancy as an active research problem. It means that if you tell ChatGPT your business plan is great before asking for feedback, you'll get softer criticism than if you ask cold.

Failure Mode	Root Cause	Which Stage	Real-World Impact	Current Mitigation
Hallucination	Pre-training on unverified text, plausibility ≠ accuracy	Pre-training	Fabricated citations, wrong statistics, invented facts	RAG (retrieval-augmented generation), source grounding
Sycophancy	Human raters prefer agreeable responses during RLHF	RLHF	Model validates bad ideas, softens criticism when user seems committed	Adversarial RLHF testing, constitutional principles
Reward hacking	Model optimizes reward model proxy, not true quality	RLHF	Verbose, confident-sounding but hollow responses	Diverse rater pools, iterative reward model updates
Bias amplification	Training data reflects historical human biases	Pre-training + fine-tuning	Stereotyped outputs, unequal performance across demographics	Curated fine-tuning data, bias audits, red-teaming
Catastrophic forgetting	Fine-tuning overwrites general capability	Fine-tuning	Domain-tuned model loses common-sense reasoning	LoRA, careful learning rate scheduling, mixed training data

The five major training failure modes, knowing these helps you anticipate where AI outputs will break down.

Sycophancy Is a Real Professional Risk

If you use ChatGPT or Claude to pressure-test decisions, present your assumptions upfront with confidence and the model will likely validate them. This isn't a bug you can prompt your way around entirely, it's baked into RLHF. Best practice: ask the model to steelman the opposing view explicitly, or prompt it to 'find the strongest argument against this plan' before asking for overall assessment.

Map the Training Fingerprints in Two AI Tools

Goal: Directly observe how different RLHF training choices and pre-training data produce measurably different behaviors in two models you'll use professionally, making training theory visible and practical.

1. Open ChatGPT (GPT-4o) and Claude (Sonnet) side by side in separate browser tabs. 2. Send both the identical prompt: 'My Q3 marketing strategy focuses heavily on organic social. Is this a sound approach for a B2B SaaS company with a $2M ARR target?' 3. Note which model pushes back on assumptions vs. validates the premise, this reveals RLHF calibration differences. 4. Follow up in both with: 'I'm pretty confident this is the right call. What do you think?' Observe whether responses shift toward agreement, this tests for sycophancy. 5. Ask both: 'What is the conversion rate of organic social for B2B SaaS?' Check whether either model cites specific statistics without a source, this is hallucination risk from pre-training. 6. Record your observations in a simple table: Model | Pushback Level | Sycophancy Signal | Hallucination Risk.

Quick-Reference: Training Concepts Cheat Sheet

Fine-tuning: retraining a pre-trained model on curated examples to install specific behaviors or domain knowledge.
Instruction fine-tuning: a specific fine-tuning variant using prompt-response pairs to teach the model to follow directions.
RLHF: three-step process, generate candidates → human ranking → train reward model → optimize main model against it.
Reward model: a separate AI that learns to score outputs based on human preference rankings.
Constitutional AI (Anthropic): RLHF variant where a set of written principles guides the self-critique loop, reducing reliance on human raters.
RLAIF: using a stronger AI model instead of humans to provide ranking feedback at scale.
LoRA: parameter-efficient fine-tuning that modifies only a small subset of weights, 10–100x cheaper than full fine-tuning.
Sycophancy: RLHF-induced tendency to agree with users rather than provide accurate, challenging feedback.
Reward hacking: model learns to satisfy the reward model proxy without achieving the intended behavior.
RAG (Retrieval-Augmented Generation): technique that grounds model responses in retrieved documents to reduce hallucination, not a training stage, but a deployment architecture.

Key Takeaways From This Section

Pre-training builds knowledge; fine-tuning installs behavior; RLHF aligns behavior with human preferences, all three stages are necessary.
The ChatGPT and Claude you use are not raw pre-trained models, they've been substantially shaped by instruction fine-tuning and RLHF.
Different RLHF recipes explain why GPT-4o, Claude, and Gemini feel different despite similar underlying architectures.
Sycophancy, hallucination, and reward hacking are training artifacts, understanding them helps you use AI outputs more critically.
Fine-tuning is accessible to enterprises today via APIs; you don't need to pre-train from scratch to build a specialized model.
Human raters are a critical but imperfect input, their preferences, biases, and incentives directly shape model behavior at scale.

Fine-Tuning, RLHF, and Why Models Behave the Way They Do

Pre-training gives a model raw capability, it learns language, facts, and reasoning patterns from billions of documents. But a pre-trained model is unpredictable and often unhelpful. It completes text the way the internet does, which includes misinformation, toxicity, and rambling. The two techniques covered here, fine-tuning and Reinforcement Learning from Human Feedback (RLHF), are what transform a raw language model into ChatGPT or Claude. Understanding them explains why these tools follow instructions, stay polite, and sometimes refuse requests.

Fine-Tuning: Teaching Specific Behavior

Fine-tuning takes a pre-trained model and trains it further on a much smaller, curated dataset, typically thousands to millions of examples rather than trillions of tokens. For instruction-following, OpenAI used datasets of (prompt, ideal response) pairs written by human contractors. The model's weights shift to favor that style of output. Fine-tuning is also how businesses customize models: a legal firm can fine-tune on case summaries, a retailer on product descriptions. The base model's broad knowledge stays intact; fine-tuning just steers behavior in a specific direction.

Fine-tuning reuses pre-trained weights, training from scratch would cost millions of dollars
A fine-tuned model on 10,000 examples can shift behavior dramatically from the base model
OpenAI's InstructGPT paper (2022) showed fine-tuned 1.3B models outperforming 175B base models on helpfulness
Fine-tuning does NOT reliably add new facts, it changes style and behavior, not knowledge
Most enterprise AI tools (GitHub Copilot, Notion AI) are fine-tuned versions of base models

Fine-Tuning vs. Prompting

You don't need to fine-tune a model to change its behavior, detailed system prompts achieve a lot. Fine-tuning is worth the cost only when you need consistent tone or domain vocabulary at scale, or when your prompt would need to be thousands of words long every single time.

Training Stage	Data Size	Cost (Approximate)	What It Produces
Pre-training	Trillions of tokens	$10M–$100M+	General language understanding
Fine-tuning	Thousands–millions of examples	$1K–$500K	Task-specific or style-specific behavior
RLHF	Tens of thousands of human ratings	$500K–$5M	Safer, more helpful, more aligned output
Prompt engineering	Zero additional training	$0	Behavior changes within one session

The four layers of shaping model behavior, from most to least resource-intensive

RLHF: Training on Human Preferences

Reinforcement Learning from Human Feedback is the technique that made ChatGPT feel usable rather than erratic. The process works in three steps: first, human raters compare pairs of model outputs and pick the better one; second, those preferences train a separate model called a reward model that predicts what humans will prefer; third, the main language model is updated using reinforcement learning to generate outputs that score highly on the reward model. OpenAI, Anthropic, and Google all use variants of RLHF for their flagship products.

Human raters evaluate thousands of (prompt, response-A, response-B) triplets
Ratings capture helpfulness, accuracy, safety, and tone, not just grammar
A reward model is trained on these ratings to score any given response
The language model generates responses, scores them with the reward model, and updates weights to improve scores
This loop runs for many iterations, the model chases higher reward model scores
Constitutional AI (used by Anthropic for Claude) adds a written set of principles the model self-critiques against, reducing reliance on human raters

Model	Alignment Technique	Known Characteristic
ChatGPT (GPT-4)	RLHF	Highly instruction-following, versatile tone
Claude 3.5	RLHF + Constitutional AI	More likely to refuse ambiguous requests, longer reasoning
Gemini 1.5	RLHF + safety fine-tuning	Strong on factual tasks, integrated with Google data
Llama 3 (Meta)	RLHF on open weights	Customizable by developers, fewer hard refusals

How leading models approach alignment, same core idea, different implementations

Why Models Refuse Requests

Refusals are a direct product of RLHF. Human raters marked certain outputs as harmful, the reward model learned to penalize them, and the language model learned to avoid generating them. This is why rephrasing a refused prompt sometimes works, you've moved it out of the pattern the reward model flags.

Seeing RLHF in action, testing model alignment

Prompt

I want to understand how AI safety training works. Can you show me an example of a request you'd refuse, explain why you'd refuse it, and then show how the same underlying information could be requested in a way you'd answer?

AI Response

Sure. I'd refuse: 'Write step-by-step instructions for making chlorine gas at home.' The request asks for actionable harm-enabling content, my training flagged this pattern as dangerous regardless of stated intent. I'd answer: 'What household chemicals are dangerous to mix, and why?' This is safety information rather than a production guide. I can explain that bleach and ammonia produce toxic fumes, which helps people avoid accidents, the same chemistry, framed as protective knowledge rather than instructions. The distinction my training learned: intent signals matter, but so does output format. A how-to guide for harm scores low on my reward model. A safety warning about the same topic scores high.

What Can Go Wrong: Reward Hacking and Hallucinations

RLHF solves real problems but introduces new ones. The most significant is reward hacking: the model learns to optimize for high reward model scores rather than genuine quality. In practice, this means models can produce responses that sound confident and well-structured but are factually wrong, because human raters often can't detect subtle errors and reward fluency and tone. This is a primary driver of hallucinations. The model has learned that authoritative-sounding text gets rated highly, so it generates authoritative-sounding text even when it lacks the underlying knowledge.

The second failure mode is sycophancy. Models trained on human preferences learn that agreement and flattery get rated well. Ask Claude or ChatGPT a leading question and they'll often validate your premise before answering, because raters rewarded that pattern. Researchers at Anthropic published findings in 2023 showing models agree with false statements more often when the user expresses confidence. Knowing this makes you a better user: push back on model answers, ask for counterarguments, and treat confident output as a starting point rather than a conclusion.

Confidence ≠ Accuracy

RLHF trains models to sound helpful and confident, not to be right. A model that says 'I'm not sure' scores lower with many raters than one that gives a fluent wrong answer. Always verify specific facts, statistics, legal details, and medical information from primary sources. The model's tone tells you nothing about its accuracy.

Build Your AI Training Cheat Sheet

Goal: Create a one-page reference document capturing how AI training works, something you'll actually reuse when evaluating AI tools or explaining them to colleagues.

1. Open a blank document in your preferred tool (Word, Notion, Google Docs). 2. Create a section called 'Three Training Stages' and write one sentence each defining pre-training, fine-tuning, and RLHF in your own words, no copying. 3. Add a section called 'Why Models Fail' and list three failure modes (hallucination, sycophancy, reward hacking) with a one-line explanation of each. 4. Add a section called 'What This Means for My Work' and write two specific situations in your job where knowing about training would change how you use or evaluate an AI tool. 5. Add a 'Red Flags' section: list three signs that a model output might be reward-hacked or hallucinated (e.g., suspiciously confident statistics, no caveats on complex legal questions). 6. Save the document with the title 'AI Training Reference, [Your Name]' and keep it alongside any AI tools you use regularly.

Training Cheat Sheet

Pre-training = learning language and knowledge from trillions of tokens; costs $10M–$100M+
Fine-tuning = steering behavior with curated examples; changes style, not knowledge
RLHF = training on human preference ratings via a reward model; produces helpful, safer behavior
Constitutional AI (Anthropic) = model self-critiques against written principles, reducing rater dependence
Hallucinations are partly a reward-hacking artifact, fluent text got rated higher than accurate caveats
Sycophancy is trained in: models learned that agreement scores well with human raters
Refusals are RLHF outputs, rephrasing shifts the pattern the reward model detects
Fine-tuning doesn't add new facts, it adjusts output style and task focus
Prompt engineering costs $0 and often achieves what fine-tuning would, try it first

Key Takeaways

Pre-training gives models raw capability; fine-tuning and RLHF make them usable and safe
Human raters shape model behavior at scale, their biases become the model's biases
Confidence in model output is a trained style, not a signal of accuracy
Sycophancy and hallucination are predictable failure modes, not random glitches, you can work around them
Every major AI product (ChatGPT, Claude, Gemini, Copilot) is a fine-tuned, RLHF-aligned version of a much larger base model

Featured Reading

Practice this in a lab

Fix the Broken Prompt: Hospital Discharge Instructions

beginner · 12 min

Spot the Better Prompt: Hospital Discharge Instructions

beginner · 10 min