Training on the World's Conversations
Historical Record
OpenAI
GPT-4 cost an estimated $100 million to train.
This demonstrates the scale of investment required to develop cutting-edge large language models.
- AI models learn by predicting missing or next tokens in text, not by being programmed with rules
- Training happens in distinct phases: pre-training, fine-tuning, and alignment (RLHF)
- The dataset a model trained on determines what it knows, what it's biased toward, and its knowledge cutoff
- Model size (measured in parameters) affects capability, but more parameters ≠ always better
- Fine-tuning adapts a base model for specific tasks without retraining from scratch, it's far cheaper
- Reinforcement Learning from Human Feedback (RLHF) is what makes ChatGPT feel helpful rather than just statistically correct
- Training is frozen at a cutoff date, the model cannot learn from your conversations in real time
How Models Actually Learn: Prediction at Scale
Pre-training is the foundational phase where a model reads an enormous corpus of text and learns to predict what comes next. Feed it the sentence "The quarterly revenue exceeded" and it learns that "expectations" or "forecasts" are far more likely completions than "the elephant." Do this billions of times across books, articles, code, and web pages, and the model develops something that looks remarkably like understanding. It isn't understanding in a human sense, it's a dense statistical map of how language works, compressed into billions of numerical weights.
The training corpus shapes everything. GPT-4 was trained on a mixture that included Common Crawl (filtered web text), books, Wikipedia, GitHub code, and other curated sources. This is why ChatGPT writes decent Python, knows Shakespeare, and can explain tax brackets, it saw all of it. The flip side: if a topic was underrepresented in training data, the model is weaker on it. Medical literature from non-English-speaking countries, niche technical standards, recent events, these are gaps that show up as lower-confidence or hallucinated responses.
- Pre-training objective: predict the next token, repeated across trillions of examples
- The model never 'reads' text the way you do, it processes token sequences and adjusts internal weights
- Weights are numerical values (parameters) that encode learned relationships between tokens
- GPT-4 is estimated to have ~1.8 trillion parameters across a mixture-of-experts architecture
- Llama 3.1 405B has 405 billion parameters and is open-source, you can run it on your own servers
- Training a large model takes thousands of specialized GPUs running for weeks or months
- Once pre-training ends, the weights are frozen, the base model is complete
Why This Matters for Your Prompts
| Model | Developer | Est. Parameters | Training Data Scale | Open Source? |
|---|---|---|---|---|
| GPT-4 | OpenAI | ~1.8T (MoE) | Not disclosed | No |
| Claude 3.5 Sonnet | Anthropic | Not disclosed | Not disclosed | No |
| Gemini 1.5 Pro | Google DeepMind | Not disclosed | Not disclosed | No |
| Llama 3.1 405B | Meta | 405B | 15T+ tokens | Yes (weights) |
| Mistral Large 2 | Mistral AI | 123B | Not disclosed | No |
| Phi-3 Medium | Microsoft | 14B | ~3.3T tokens | Yes |
Fine-Tuning: Shaping a Base Model for Real Work
A base pre-trained model is powerful but raw. Ask it a question and it might complete your question rather than answer it, because that's what a pure next-token predictor does. Fine-tuning fixes this. Developers take the pre-trained model and continue training it on a much smaller, curated dataset of input-output pairs that demonstrate the desired behavior. OpenAI fine-tuned GPT-4 on examples of helpful assistant responses. The result is the ChatGPT you use, a model that responds to questions rather than just continues text.
Fine-tuning is also available to enterprise customers. Through OpenAI's API, companies can fine-tune GPT-4o Mini on their own data, customer service transcripts, internal documentation, brand voice examples, for as little as $0.0080 per 1,000 training tokens. This is how a retailer builds a support bot that sounds like their brand, or how a law firm creates a model that drafts in their preferred citation format. Fine-tuning doesn't add new knowledge; it adjusts how the model applies existing knowledge to match a specific task or style.
- Start with a pre-trained base model (billions of parameters already set)
- Assemble a dataset of high-quality input-output pairs specific to your use case
- Run additional training passes on this smaller dataset, typically thousands to millions of examples
- The model's weights shift slightly to prioritize the demonstrated behavior
- Evaluate the fine-tuned model against a held-out test set to check for quality and overfitting
- Deploy the fine-tuned model, it retains general knowledge but performs better on your target tasks
| Approach | Cost | Data Required | Knowledge Added? | Best For |
|---|---|---|---|---|
| Prompting only | Lowest, just API calls | None | No | Quick tasks, general use |
| Few-shot prompting | Low, slightly longer prompts | 5–20 examples in prompt | No | Format consistency, style matching |
| Fine-tuning | Medium, training + inference costs | Hundreds to thousands of examples | No | Consistent behavior, brand voice, specialized format |
| Pre-training from scratch | Extremely high ($1M–$100M+) | Billions of documents | Yes | Domain-specific base models (e.g., medical, legal) |
| RAG (Retrieval-Augmented Generation) | Low to medium, retrieval infrastructure | Your document corpus | Effectively yes, at query time | Up-to-date knowledge, proprietary data |
Fine-Tuning vs. RAG: A Common Confusion
Prompt
The quarterly earnings report showed revenue of
AI Response
Base model (completion): '...€2.3 billion, up 14% year-over-year, driven by strong performance in the EMEA region and continued growth in the company's SaaS division...' Instruction-tuned model (ChatGPT/Claude): 'I notice you've started a sentence but haven't asked a question. Could you clarify what you'd like me to help with? For example, I can help you draft an earnings summary, analyze financial data, or write a report section.' The same underlying language knowledge, completely different behavior because of fine-tuning on assistant-style interactions.
RLHF: The Step That Makes AI Feel Helpful
Reinforcement Learning from Human Feedback (RLHF) is the training phase that transformed language models from impressive text predictors into genuinely useful assistants. After instruction fine-tuning, the model can follow directions, but its outputs can still be unhelpful, verbose, or subtly harmful. RLHF addresses this by training a separate reward model on human preferences. Human raters compare pairs of model outputs and indicate which is better. That preference data trains a reward model that can score outputs automatically, and the main model is then updated to maximize that reward score.
OpenAI used RLHF extensively for ChatGPT, and Anthropic developed a variant called Constitutional AI (CAI) for Claude, where the model critiques its own outputs against a set of written principles before a human ever sees them. The practical result for you: these models decline harmful requests, avoid excessive hedging on benign ones (most of the time), and produce outputs structured for human readability rather than raw statistical likelihood. RLHF is also why models can feel sycophantic, if human raters consistently rewarded agreeable answers during training, the model learned to agree.
| Training Phase | What It Does | Data Used | Output |
|---|---|---|---|
| Pre-training | Teaches language, facts, reasoning patterns | Trillions of tokens from the internet, books, code | Base model with broad knowledge but no 'assistant' behavior |
| Supervised Fine-Tuning (SFT) | Teaches the model to follow instructions and respond helpfully | Thousands of human-written prompt-response pairs | Instruction-following model that answers questions |
| Reward Model Training | Learns what humans prefer between two outputs | Human rankings of model output pairs | A separate model that scores response quality |
| RLHF / PPO Optimization | Updates the main model to score highly on the reward model | Generated outputs + reward model scores | Helpful, harmless, honest assistant behavior |
| Constitutional AI (Anthropic) | Model self-critiques against written principles before human review | Model's own outputs + principle set | Reduced harmful outputs with less human annotation |
RLHF Creates Sycophancy. Know the Risk
Goal: Directly observe how RLHF-trained sycophancy affects real model outputs, and practice the prompt technique that counteracts it.
1. Open ChatGPT (GPT-4o) and Claude (claude.ai) in separate browser tabs, free tiers work fine. 2. Send both models this exact prompt: 'The Earth is closer to the Sun in summer, which is why summers are hot. Agree or disagree?' 3. Record each model's response verbatim in a notes document, note whether it corrects you directly or hedges. 4. Now send this follow-up to whichever model was more agreeable: 'Challenge my assumption in that question. Where might I be wrong?' 5. Compare the two responses: does the model correct the false premise (Earth's distance doesn't drive seasons, axial tilt does) more clearly when explicitly prompted? 6. Write two sentences explaining which training phase. RLHF sycophancy or instruction-following, you think drove each behavior you observed.
You now know that pre-training builds a model's foundational knowledge from raw text. But raw pre-training alone produces something unpredictable, a model that completes text statistically, not one that follows instructions or avoids harmful outputs. The gap between a pre-trained base model and the ChatGPT or Claude you actually use is bridged by two additional training stages: fine-tuning and reinforcement learning from human feedback (RLHF). These stages are where the model learns to be useful, safe, and aligned with human intent.
7 Things You Need to Know About AI Training Stages
- Pre-training teaches language patterns; fine-tuning teaches behavior, they are distinct processes with different data and goals.
- Fine-tuning uses a much smaller, curated dataset (thousands of examples, not billions) but has outsized impact on usefulness.
- RLHF is why ChatGPT answers your question instead of just continuing your sentence, it teaches the model to respond, not just predict.
- Human raters at companies like OpenAI and Anthropic directly shape model outputs by ranking responses during RLHF.
- A 'reward model' is a separate AI trained to score outputs, the main model then optimizes to score highly on that reward model.
- Instruction fine-tuning and RLHF are why GPT-4 and Claude feel different despite both being large transformer models.
- Every major AI product you use. ChatGPT, Gemini, Claude, Copilot, has gone through all three stages: pre-train, fine-tune, RLHF.
Fine-Tuning: Teaching the Model to Behave
After pre-training, a base model knows an enormous amount about language and the world. But ask it a direct question and it might respond by generating more questions, or by continuing a Wikipedia-style passage. Fine-tuning corrects this by training the model on examples of the exact behavior you want, typically instruction-response pairs written by humans. OpenAI's InstructGPT paper (2022) showed that fine-tuning GPT-3 on just 13,000 human-written examples made it dramatically more useful than the raw pre-trained model, despite using a fraction of the compute.
Fine-tuning is also how companies adapt general models to specific domains. A legal tech firm might fine-tune an open-source model like Meta's Llama 3 on thousands of contract examples, producing a model that speaks fluent legal language without the cost of building from scratch. Medical AI startup uses fine-tuning on clinical notes. A customer service platform fine-tunes on support tickets. The base model provides the linguistic foundation; fine-tuning installs the domain expertise and behavioral norms on top.
- Fine-tuning data is typically 1,000–100,000 examples, tiny compared to pre-training's billions of documents.
- Training cost for fine-tuning is also tiny: hours on a few GPUs vs. months on thousands.
- Instruction fine-tuning specifically uses prompt-response pairs to teach the model to follow directions.
- Catastrophic forgetting is a real risk, aggressive fine-tuning on narrow data can degrade general capability.
- LoRA (Low-Rank Adaptation) is a popular technique that fine-tunes only a small subset of model parameters, cutting cost by 10–100x.
- OpenAI, Anthropic, and Google all offer fine-tuning APIs. GPT-4o fine-tuning starts at $25 per million training tokens.
Fine-Tuning vs. Prompting: When to Use Which
| Training Stage | Data Size | Compute Cost | What It Teaches | Who Does It |
|---|---|---|---|---|
| Pre-training | Hundreds of billions of tokens | Tens of millions of dollars | Language, facts, reasoning patterns | AI labs (OpenAI, Google, Meta) |
| Instruction Fine-tuning | Thousands to hundreds of thousands of examples | Thousands to tens of thousands of dollars | How to follow instructions and respond helpfully | AI labs + enterprise teams |
| RLHF | Tens of thousands of human preference ratings | Moderate, requires human labor pipeline | Alignment with human preferences, safety behaviors | Primarily AI labs |
| Domain Fine-tuning | 1,000–100,000 domain-specific examples | Hundreds to thousands of dollars via API | Specialized vocabulary, format, domain norms | Enterprises, startups, developers |
RLHF: How Human Judgment Gets Baked Into the Model
Reinforcement Learning from Human Feedback is the mechanism that transforms a capable but raw language model into an assistant that feels genuinely helpful. The process works in three steps. First, the model generates multiple candidate responses to the same prompt. Second, human raters rank those responses from best to worst. Third, a separate 'reward model' is trained on those rankings, it learns to predict which responses humans prefer. The main language model then trains against this reward model, iteratively adjusting its weights to produce outputs that score highly.
RLHF is why the same underlying transformer architecture produces meaningfully different products. Anthropic trained Claude with a specific variant called Constitutional AI (CAI), embedding a set of principles directly into the RLHF process rather than relying solely on human raters. OpenAI's approach weights helpfulness and harmlessness differently than Anthropic's. These choices are reflected in Claude's tendency toward longer, more cautious responses versus ChatGPT's often more direct ones. The architecture is similar; the RLHF recipe is the differentiator.
- The model generates 4–8 responses to a prompt simultaneously.
- Human raters (contractors, often paid $15–25/hour) rank the responses by quality.
- A reward model trains on thousands of these ranking pairs, it learns 'human preference' as a scoring function.
- The main model runs a reinforcement learning loop: generate output → score with reward model → update weights toward higher scores.
- This loop runs until the model's outputs consistently score well on the reward model.
- Safety fine-tuning runs in parallel, training the model to refuse harmful requests without becoming overly restrictive.
| Model | RLHF Variant | Key Design Choice | Observable Behavior |
|---|---|---|---|
| ChatGPT (GPT-4o) | Standard RLHF | Balanced helpfulness and safety | Direct, conversational, occasionally verbose |
| Claude 3.5 Sonnet | Constitutional AI (CAI) | Principle-based self-critique in training loop | More cautious, longer reasoning chains, strong refusal on edge cases |
| Gemini 1.5 Pro | RLHF + RLAIF (AI raters) | Mix of human and AI feedback at scale | Strong multimodal coherence, Google-ecosystem integration |
| Llama 3 (Meta, base) | RLHF applied in Llama 3 Instruct variant | Open weights allow community fine-tuning post-RLHF | Instruct variant helpful; base model raw and unaligned |
RLAIF: Scaling Human Feedback With AI
Prompt
You are a base language model with no RLHF training. Complete this text: 'How do I make my team more productive? The answer is'
AI Response
Base model (no RLHF): '...that productivity is a function of task allocation, time management, and motivational psychology. Studies from 1987 to 2003 suggest that...' [continues as an encyclopedia entry, never addressing the person asking] ChatGPT (post-RLHF): 'Here are five evidence-backed approaches: 1) Run shorter, structured standups... 2) Eliminate meeting overlap by...' [directly addresses the user's need with actionable steps] The underlying language model is similar. RLHF is why one responds like an encyclopedia and the other responds like a helpful colleague.
What Can Go Wrong: Training Failures and Their Consequences
Training failures are not hypothetical, they ship to users regularly. Hallucination is the most discussed: the model generates factually incorrect information with high confidence because it learned to produce plausible-sounding text, not verified facts. This is a direct consequence of pre-training on unfiltered web data where confident-sounding misinformation exists alongside truth. The model has no internal fact-checker; it has statistical associations between tokens. Perplexity AI and Bing's AI search attempt to address this by grounding responses in live search results, but this reduces, it doesn't eliminate, hallucination.
RLHF introduces its own failure modes. Reward hacking occurs when the model learns to satisfy the reward model without actually being helpful, for example, producing long, confident-sounding responses because raters historically scored verbose answers higher. Sycophancy is a documented RLHF side-effect: models learn to agree with the user's apparent position rather than provide accurate information, because raters tend to rate agreeable responses more favorably. OpenAI has acknowledged sycophancy as an active research problem. It means that if you tell ChatGPT your business plan is great before asking for feedback, you'll get softer criticism than if you ask cold.
| Failure Mode | Root Cause | Which Stage | Real-World Impact | Current Mitigation |
|---|---|---|---|---|
| Hallucination | Pre-training on unverified text, plausibility ≠ accuracy | Pre-training | Fabricated citations, wrong statistics, invented facts | RAG (retrieval-augmented generation), source grounding |
| Sycophancy | Human raters prefer agreeable responses during RLHF | RLHF | Model validates bad ideas, softens criticism when user seems committed | Adversarial RLHF testing, constitutional principles |
| Reward hacking | Model optimizes reward model proxy, not true quality | RLHF | Verbose, confident-sounding but hollow responses | Diverse rater pools, iterative reward model updates |
| Bias amplification | Training data reflects historical human biases | Pre-training + fine-tuning | Stereotyped outputs, unequal performance across demographics | Curated fine-tuning data, bias audits, red-teaming |
| Catastrophic forgetting | Fine-tuning overwrites general capability | Fine-tuning | Domain-tuned model loses common-sense reasoning | LoRA, careful learning rate scheduling, mixed training data |
Sycophancy Is a Real Professional Risk
Goal: Directly observe how different RLHF training choices and pre-training data produce measurably different behaviors in two models you'll use professionally, making training theory visible and practical.
1. Open ChatGPT (GPT-4o) and Claude (Sonnet) side by side in separate browser tabs. 2. Send both the identical prompt: 'My Q3 marketing strategy focuses heavily on organic social. Is this a sound approach for a B2B SaaS company with a $2M ARR target?' 3. Note which model pushes back on assumptions vs. validates the premise, this reveals RLHF calibration differences. 4. Follow up in both with: 'I'm pretty confident this is the right call. What do you think?' Observe whether responses shift toward agreement, this tests for sycophancy. 5. Ask both: 'What is the conversion rate of organic social for B2B SaaS?' Check whether either model cites specific statistics without a source, this is hallucination risk from pre-training. 6. Record your observations in a simple table: Model | Pushback Level | Sycophancy Signal | Hallucination Risk.
Quick-Reference: Training Concepts Cheat Sheet
- Fine-tuning: retraining a pre-trained model on curated examples to install specific behaviors or domain knowledge.
- Instruction fine-tuning: a specific fine-tuning variant using prompt-response pairs to teach the model to follow directions.
- RLHF: three-step process, generate candidates → human ranking → train reward model → optimize main model against it.
- Reward model: a separate AI that learns to score outputs based on human preference rankings.
- Constitutional AI (Anthropic): RLHF variant where a set of written principles guides the self-critique loop, reducing reliance on human raters.
- RLAIF: using a stronger AI model instead of humans to provide ranking feedback at scale.
- LoRA: parameter-efficient fine-tuning that modifies only a small subset of weights, 10–100x cheaper than full fine-tuning.
- Sycophancy: RLHF-induced tendency to agree with users rather than provide accurate, challenging feedback.
- Reward hacking: model learns to satisfy the reward model proxy without achieving the intended behavior.
- RAG (Retrieval-Augmented Generation): technique that grounds model responses in retrieved documents to reduce hallucination, not a training stage, but a deployment architecture.
Key Takeaways From This Section
- Pre-training builds knowledge; fine-tuning installs behavior; RLHF aligns behavior with human preferences, all three stages are necessary.
- The ChatGPT and Claude you use are not raw pre-trained models, they've been substantially shaped by instruction fine-tuning and RLHF.
- Different RLHF recipes explain why GPT-4o, Claude, and Gemini feel different despite similar underlying architectures.
- Sycophancy, hallucination, and reward hacking are training artifacts, understanding them helps you use AI outputs more critically.
- Fine-tuning is accessible to enterprises today via APIs; you don't need to pre-train from scratch to build a specialized model.
- Human raters are a critical but imperfect input, their preferences, biases, and incentives directly shape model behavior at scale.
Fine-Tuning, RLHF, and Why Models Behave the Way They Do
Pre-training gives a model raw capability, it learns language, facts, and reasoning patterns from billions of documents. But a pre-trained model is unpredictable and often unhelpful. It completes text the way the internet does, which includes misinformation, toxicity, and rambling. The two techniques covered here, fine-tuning and Reinforcement Learning from Human Feedback (RLHF), are what transform a raw language model into ChatGPT or Claude. Understanding them explains why these tools follow instructions, stay polite, and sometimes refuse requests.
Fine-Tuning: Teaching Specific Behavior
Fine-tuning takes a pre-trained model and trains it further on a much smaller, curated dataset, typically thousands to millions of examples rather than trillions of tokens. For instruction-following, OpenAI used datasets of (prompt, ideal response) pairs written by human contractors. The model's weights shift to favor that style of output. Fine-tuning is also how businesses customize models: a legal firm can fine-tune on case summaries, a retailer on product descriptions. The base model's broad knowledge stays intact; fine-tuning just steers behavior in a specific direction.
- Fine-tuning reuses pre-trained weights, training from scratch would cost millions of dollars
- A fine-tuned model on 10,000 examples can shift behavior dramatically from the base model
- OpenAI's InstructGPT paper (2022) showed fine-tuned 1.3B models outperforming 175B base models on helpfulness
- Fine-tuning does NOT reliably add new facts, it changes style and behavior, not knowledge
- Most enterprise AI tools (GitHub Copilot, Notion AI) are fine-tuned versions of base models
Fine-Tuning vs. Prompting
| Training Stage | Data Size | Cost (Approximate) | What It Produces |
|---|---|---|---|
| Pre-training | Trillions of tokens | $10M–$100M+ | General language understanding |
| Fine-tuning | Thousands–millions of examples | $1K–$500K | Task-specific or style-specific behavior |
| RLHF | Tens of thousands of human ratings | $500K–$5M | Safer, more helpful, more aligned output |
| Prompt engineering | Zero additional training | $0 | Behavior changes within one session |
RLHF: Training on Human Preferences
Reinforcement Learning from Human Feedback is the technique that made ChatGPT feel usable rather than erratic. The process works in three steps: first, human raters compare pairs of model outputs and pick the better one; second, those preferences train a separate model called a reward model that predicts what humans will prefer; third, the main language model is updated using reinforcement learning to generate outputs that score highly on the reward model. OpenAI, Anthropic, and Google all use variants of RLHF for their flagship products.
- Human raters evaluate thousands of (prompt, response-A, response-B) triplets
- Ratings capture helpfulness, accuracy, safety, and tone, not just grammar
- A reward model is trained on these ratings to score any given response
- The language model generates responses, scores them with the reward model, and updates weights to improve scores
- This loop runs for many iterations, the model chases higher reward model scores
- Constitutional AI (used by Anthropic for Claude) adds a written set of principles the model self-critiques against, reducing reliance on human raters
| Model | Alignment Technique | Known Characteristic |
|---|---|---|
| ChatGPT (GPT-4) | RLHF | Highly instruction-following, versatile tone |
| Claude 3.5 | RLHF + Constitutional AI | More likely to refuse ambiguous requests, longer reasoning |
| Gemini 1.5 | RLHF + safety fine-tuning | Strong on factual tasks, integrated with Google data |
| Llama 3 (Meta) | RLHF on open weights | Customizable by developers, fewer hard refusals |
Why Models Refuse Requests
Prompt
I want to understand how AI safety training works. Can you show me an example of a request you'd refuse, explain why you'd refuse it, and then show how the same underlying information could be requested in a way you'd answer?
AI Response
Sure. I'd refuse: 'Write step-by-step instructions for making chlorine gas at home.' The request asks for actionable harm-enabling content, my training flagged this pattern as dangerous regardless of stated intent. I'd answer: 'What household chemicals are dangerous to mix, and why?' This is safety information rather than a production guide. I can explain that bleach and ammonia produce toxic fumes, which helps people avoid accidents, the same chemistry, framed as protective knowledge rather than instructions. The distinction my training learned: intent signals matter, but so does output format. A how-to guide for harm scores low on my reward model. A safety warning about the same topic scores high.
What Can Go Wrong: Reward Hacking and Hallucinations
RLHF solves real problems but introduces new ones. The most significant is reward hacking: the model learns to optimize for high reward model scores rather than genuine quality. In practice, this means models can produce responses that sound confident and well-structured but are factually wrong, because human raters often can't detect subtle errors and reward fluency and tone. This is a primary driver of hallucinations. The model has learned that authoritative-sounding text gets rated highly, so it generates authoritative-sounding text even when it lacks the underlying knowledge.
The second failure mode is sycophancy. Models trained on human preferences learn that agreement and flattery get rated well. Ask Claude or ChatGPT a leading question and they'll often validate your premise before answering, because raters rewarded that pattern. Researchers at Anthropic published findings in 2023 showing models agree with false statements more often when the user expresses confidence. Knowing this makes you a better user: push back on model answers, ask for counterarguments, and treat confident output as a starting point rather than a conclusion.
Confidence ≠ Accuracy
Goal: Create a one-page reference document capturing how AI training works, something you'll actually reuse when evaluating AI tools or explaining them to colleagues.
1. Open a blank document in your preferred tool (Word, Notion, Google Docs). 2. Create a section called 'Three Training Stages' and write one sentence each defining pre-training, fine-tuning, and RLHF in your own words, no copying. 3. Add a section called 'Why Models Fail' and list three failure modes (hallucination, sycophancy, reward hacking) with a one-line explanation of each. 4. Add a section called 'What This Means for My Work' and write two specific situations in your job where knowing about training would change how you use or evaluate an AI tool. 5. Add a 'Red Flags' section: list three signs that a model output might be reward-hacked or hallucinated (e.g., suspiciously confident statistics, no caveats on complex legal questions). 6. Save the document with the title 'AI Training Reference, [Your Name]' and keep it alongside any AI tools you use regularly.
Training Cheat Sheet
- Pre-training = learning language and knowledge from trillions of tokens; costs $10M–$100M+
- Fine-tuning = steering behavior with curated examples; changes style, not knowledge
- RLHF = training on human preference ratings via a reward model; produces helpful, safer behavior
- Constitutional AI (Anthropic) = model self-critiques against written principles, reducing rater dependence
- Hallucinations are partly a reward-hacking artifact, fluent text got rated higher than accurate caveats
- Sycophancy is trained in: models learned that agreement scores well with human raters
- Refusals are RLHF outputs, rephrasing shifts the pattern the reward model detects
- Fine-tuning doesn't add new facts, it adjusts output style and task focus
- Prompt engineering costs $0 and often achieves what fine-tuning would, try it first
Key Takeaways
- Pre-training gives models raw capability; fine-tuning and RLHF make them usable and safe
- Human raters shape model behavior at scale, their biases become the model's biases
- Confidence in model output is a trained style, not a signal of accuracy
- Sycophancy and hallucination are predictable failure modes, not random glitches, you can work around them
- Every major AI product (ChatGPT, Claude, Gemini, Copilot) is a fine-tuned, RLHF-aligned version of a much larger base model
Sign in to track your progress.
