Prediction One Token at a Time
How AI Generates Text. One Word at a Time
Every time ChatGPT answers a question, writes an email, or explains a concept, it's doing one thing repeatedly: predicting the next most likely word. That's it. No understanding, no consciousness, no lookup table of facts, just a sophisticated probability engine trained on hundreds of billions of words. This sounds reductive until you realize the prediction process requires modeling grammar, logic, context, tone, and world knowledge simultaneously. Knowing exactly how this works makes you a better prompter, helps you spot AI errors before they cost you, and tells you which tasks AI genuinely excels at versus where it reliably fails.
7 Things You Need to Know About AI Text Generation
- AI generates text token by token, not word by word, a token is roughly 3-4 characters or 0.75 words on average.
- Each token is chosen based on probability scores across the entire model vocabulary (50,000+ possible tokens in GPT-4).
- The model reads your full prompt before generating a single output token, context shapes every prediction.
- Temperature controls randomness: low temperature (0.1) = predictable and precise, high temperature (0.9) = creative and varied.
- Models have a context window, a hard limit on how much text they can 'see' at once. GPT-4 Turbo: 128,000 tokens. Claude 3.5: 200,000 tokens.
- AI has no memory between separate conversations unless the product explicitly stores and re-injects it.
- Hallucinations happen because the model is optimizing for plausible text, not verified truth, it doesn't know what it doesn't know.
Tokens: The Real Unit of AI Language
Before a model reads a single character of your prompt, it converts everything into tokens. A token is a chunk of text, sometimes a full word, sometimes part of one. 'Running' is one token. 'Unbelievable' might split into 'Un', 'believ', 'able', three tokens. Numbers, punctuation, and spaces each consume tokens too. This matters practically: OpenAI charges by token, and GPT-4 costs $0.03 per 1,000 input tokens and $0.06 per 1,000 output tokens as of early 2024. A 500-word document is roughly 650-700 tokens. A long legal contract at 10,000 words runs around 13,000 tokens, about $0.39 just to read it.
Tokenization also explains some quirky AI behavior. Ask ChatGPT to count the letter 'r' in 'strawberry' and it often gets it wrong, because 'strawberry' may tokenize in a way that obscures individual characters from the model's perspective. The model never sees raw letters; it sees token IDs. Rare words, technical jargon, and non-English text tend to tokenize inefficiently, meaning they consume more tokens and cost more to process. Writing prompts in plain, common language isn't just good communication practice, it's also more economical and often produces better results.
- 1 token ≈ 4 characters or 0.75 English words
- 100 tokens ≈ 75 words ≈ a short paragraph
- 1,000 tokens ≈ 750 words ≈ a 1.5-page memo
- Common English words = 1 token each ('the', 'and', 'is')
- Long or rare words = 2-4 tokens ('cryptocurrency', 'pharmaceutical')
- Non-Latin scripts (Chinese, Arabic) = 1-2 characters per token, making them 2-3x more expensive to process
- Code is token-dense: a 50-line Python script can easily hit 400-600 tokens
Check Your Token Count Before Sending Long Prompts
| Model | Context Window | Approximate Page Limit | Best For |
|---|---|---|---|
| GPT-3.5 Turbo | 16,385 tokens | ~22 pages | Fast, cheap tasks, summaries, drafts |
| GPT-4 Turbo | 128,000 tokens | ~170 pages | Long documents, complex reasoning |
| GPT-4o | 128,000 tokens | ~170 pages | Multimodal tasks, speed + quality balance |
| Claude 3.5 Sonnet | 200,000 tokens | ~270 pages | Book-length analyzis, legal/financial docs |
| Gemini 1.5 Pro | 1,000,000 tokens | ~1,350 pages | Entire codebases, very long research corpora |
| Llama 3 (Meta) | 8,192 tokens | ~11 pages | Local/private deployment, shorter tasks |
Probability, Temperature, and Why AI Isn't Deterministic
After tokenizing your prompt, the model assigns a probability score to every possible next token, all 50,000+ of them. 'The sky is ___': 'blue' might score 34%, 'clear' 18%, 'dark' 12%, and so on down to absurd options with 0.001% probability. The model doesn't always pick the top scorer. A parameter called temperature adjusts how the model samples from this distribution. At temperature 0, it always picks the highest-probability token, maximally predictable, sometimes robotic. At temperature 1.0, lower-probability tokens get a real chance, producing more varied, creative output. Most production applications set temperature between 0.3 and 0.7.
This is why identical prompts produce different outputs on different runs, unless temperature is explicitly set to zero. ChatGPT's default temperature is around 0.7, which is why it 'feels' natural and varied rather than mechanical. Claude tends to run slightly lower, which is why it often feels more precise and measured. When you need consistency, extracting data from invoices, classifying customer feedback, generating structured JSON, lower temperature is your friend. When you need creative ideation, higher temperature surfaces unexpected angles. You can't adjust temperature directly in ChatGPT's interface, but you can prompt for it: 'Give me 5 distinct, varied approaches' forces the model toward diversity.
- Model tokenizes your full prompt into token IDs
- Token IDs pass through the neural network's attention layers (more on this in Part 2)
- Network outputs a probability distribution over all ~50,000+ vocabulary tokens
- Temperature parameter reshapes this distribution, flattening it (high temp) or sharpening it (low temp)
- One token is sampled from the distribution
- That token is appended to the context, and the process restarts
- Generation continues until the model produces a stop token or hits the maximum output length
| Temperature Setting | Behavior | Use Case | Risk |
|---|---|---|---|
| 0.0 – 0.2 | Highly deterministic, repetitive phrasing | Data extraction, classification, structured output | Formulaic, may miss nuance |
| 0.3 – 0.5 | Consistent but natural-sounding | Summarization, professional emails, reports | Occasionally flat tone |
| 0.6 – 0.8 | Balanced creativity and coherence | Marketing copy, explanations, brainstorming | Occasional unexpected tangents |
| 0.9 – 1.0 | High variance, surprising outputs | Creative writing, ideation, poetry | May produce incoherent results |
| Above 1.0 | Chaotic, unreliable | Rarely used in production | High hallucination risk |
Temperature vs. Top-P: Two Knobs, One Goal
Prompt
You are a precise data analyzt. Extract the following information from this invoice and return it ONLY as a JSON object with no additional commentary. Be exact, do not infer or estimate any values not explicitly stated. Invoice text: 'Invoice #4421, dated March 3 2024, from Apex Supplies Ltd. Total due: $3,847.50. Payment terms: Net 30. Line items: Office furniture $2,200, Delivery $147.50, Installation $1,500.'
AI Response
{ "invoice_number": "4421", "date": "2024-03-03", "vendor": "Apex Supplies Ltd", "total_due": 3847.50, "payment_terms": "Net 30", "line_items": [ { "description": "Office furniture", "amount": 2200.00 }, { "description": "Delivery", "amount": 147.50 }, { "description": "Installation", "amount": 1500.00 } ] }
Hallucinations: Why AI Confidently Gets Things Wrong
Hallucination is the term for when an AI model generates text that is fluent, confident, and factually wrong. It's not a bug in the traditional sense, it's a direct consequence of how text generation works. The model is always optimizing for the most plausible next token, not the most truthful one. It has no internal fact-checker. When asked about an obscure topic, a real person, or a specific statistic, the model fills gaps with whatever pattern fits the context. A 2023 study by Stanford found hallucination rates in GPT-3.5 as high as 27% on medical question-answering tasks. GPT-4 reduced this significantly, but never to zero.
Hallucinations cluster around specific failure modes. Fake citations are notorious, ask for academic sources and the model will generate plausible-looking author names, journal titles, and DOIs that don't exist. Specific numbers (revenue figures, dates, statistics) are high-risk. Names of real people combined with specific claims ('In her 2019 Forbes interview, [executive] said...') frequently contain fabricated quotes. The model has no way to signal uncertainty, it generates 'I'm not sure' as a learned phrase, not as an actual epistemic flag. Perplexity AI partially addresses this by grounding responses in live web search results, which is why it's a better choice than ChatGPT for factual research tasks.
Historical Record
New York lawyer
A New York lawyer was sanctioned in 2023 after submitting a court brief containing six AI-generated case citations, all fabricated. ChatGPT had invented the case names, courts, and legal holdings with complete confidence.
This incident demonstrates how AI hallucinations can have real legal consequences when models generate plausible-looking but entirely false information.
| Content Type | Hallucination Risk | Why | Mitigation |
|---|---|---|---|
| Creative writing, fiction | Very Low | No factual standard to violate | None needed |
| Summarizing text you provide | Low | Model works from given context | Check for omissions |
| General explanations of concepts | Low–Medium | Well-represented in training data | Verify key claims |
| Specific statistics and numbers | High | Numbers are easy to plausibly fabricate | Cross-check primary sources |
| Academic or legal citations | Very High | Model pattern-matches citation formats | Never use without verification |
| Recent events (post-training cutoff) | Very High | Model has no data; must confabulate | Use Perplexity or Bing AI instead |
| Niche professional knowledge | High | Sparse training data = more guesswork | Consult domain expert |
Temperature, Sampling, and Why the Same Prompt Gets Different Answers
You now know that AI picks the next token by calculating probabilities across its entire vocabulary. But there's a second layer of decision-making happening simultaneously, one that controls how adventurous or conservative those picks are. This is where temperature and sampling strategies come in. They don't change what the model knows; they change how it chooses from what it knows. Understanding this distinction separates professionals who get consistent, reliable outputs from those who wonder why ChatGPT seems unpredictable.
Temperature: The Randomness Dial
Temperature is a numerical setting, typically between 0 and 2, applied after the model calculates its token probabilities. At temperature 0, the model always picks the highest-probability token. Deterministic, predictable, occasionally repetitive. At temperature 1, it samples from the probability distribution as-is. At temperature 2, it flattens the distribution, making low-probability tokens far more likely to get selected. Think of it like a volume dial for creativity. Low temperature produces precise, consistent outputs. High temperature produces varied, sometimes surprising, sometimes incoherent, results.
- Temperature 0: Same output every time, best for code, data extraction, structured tasks
- Temperature 0.3–0.5: Slightly varied but still focused, good for summarization, Q&A
- Temperature 0.7–1.0: Balanced creativity and coherence, default for most ChatGPT interactions
- Temperature 1.5–2.0: High variance, experimental, useful for brainstorming, risky for accuracy
- OpenAI's API defaults ChatGPT to temperature 1; Claude defaults closer to 0.7
- You cannot set temperature directly in ChatGPT's free interface, but you can approximate it with prompt instructions like 'be precise and consistent'
Control Temperature Through Your Prompt
| Task Type | Ideal Temperature | Why | Example Tool Setting |
|---|---|---|---|
| Extract data from text | 0–0.2 | Needs deterministic, repeatable output | API: temperature=0 |
| Write a product description | 0.6–0.8 | Creative but on-brand | ChatGPT default |
| Brainstorm campaign ideas | 1.0–1.2 | Maximize variety of ideas | Claude: higher creativity mode |
| Generate code | 0–0.3 | Syntax errors spike at high temps | GitHub Copilot: low temp |
| Write a poem or story | 1.0–1.5 | Originality requires variance | Midjourney prompts: exploratory |
| Summarize a legal document | 0–0.2 | Accuracy over creativity | API: temperature=0 |
Top-P Sampling: The Other Control Nobody Talks About
Alongside temperature, models use a second filter called top-p sampling (also called nucleus sampling). Instead of considering all possible tokens, top-p restricts selection to the smallest set of tokens whose combined probability adds up to p. At top-p 0.9, the model only samples from tokens that together account for 90% of the probability mass, cutting off the long tail of unlikely, potentially nonsensical choices. This is why even at higher temperatures, ChatGPT rarely produces complete gibberish: top-p is quietly pruning the worst options before temperature even applies.
Temperature and top-p work together, not independently. OpenAI recommends adjusting one or the other, not both simultaneously, when using the API. In practice, most professionals using consumer tools like ChatGPT or Claude never touch these settings directly. But knowing they exist explains a lot: why Claude tends to feel more measured than GPT-4 Turbo, why Perplexity AI gives consistent factual answers while creative tools feel more alive. Different products ship with different defaults baked in.
- Model calculates raw probability scores for all ~100,000 tokens in its vocabulary
- Top-p filter removes low-probability tokens, keeping only the top probability mass
- Temperature scaling is applied to the remaining token probabilities
- The model samples one token from this filtered, scaled distribution
- That token is appended to the context window
- The entire process repeats, now with the new token included as context
- This continues until a stop token is reached or the max token limit is hit
| Parameter | What It Controls | Low Value Effect | High Value Effect | Who Sets It |
|---|---|---|---|---|
| Temperature | Randomness of token selection | Predictable, focused | Creative, inconsistent | API users; approximated via prompts |
| Top-P | Size of token candidate pool | Only top likely tokens considered | Wider range of tokens eligible | API users; product teams |
| Max Tokens | Maximum output length | Short, possibly cut-off responses | Longer outputs, higher cost | API users; some UI settings |
| Frequency Penalty | Penalizes repeated tokens | Model may repeat phrases | Forces more varied word choice | API users |
| Presence Penalty | Penalizes already-used topics | Model stays on original topic | Model explores new angles | API users |
Why Claude Feels Different From ChatGPT
The Context Window: Your AI's Working Memory
Every token the model generates gets added back into what's called the context window, the total text the model can 'see' at any given moment. This includes your original prompt, any conversation history, and every token the model has already generated in the current session. The model has no memory outside this window. When the conversation exceeds the window limit, older content gets dropped, silently, without warning. This is why long ChatGPT conversations sometimes feel like the model 'forgot' what you said earlier. It literally did.
Context window sizes have grown dramatically. GPT-3.5 shipped with a 4,096-token window (roughly 3,000 words). GPT-4 Turbo expanded this to 128,000 tokens, about 96,000 words, or a short novel. Claude 3 supports up to 200,000 tokens. Gemini 1.5 Pro reached 1 million tokens in testing. Larger windows mean the model can hold more of your document, your conversation history, or your codebase in view simultaneously. But larger windows also cost more per API call, because every token in the window gets processed at inference time.
- GPT-3.5 Turbo: 16,385 tokens (~12,000 words), current default for free ChatGPT
- GPT-4 Turbo: 128,000 tokens (~96,000 words), available on ChatGPT Plus ($20/month)
- Claude 3 Opus: 200,000 tokens (~150,000 words). Anthropic's flagship
- Gemini 1.5 Pro: 1,000,000 tokens in testing. Google's long-context leader
- Tokens ≠ words: English averages ~0.75 words per token; code runs higher; some languages use more tokens per word
- Filling the context window doesn't guarantee the model pays equal attention to all of it, early and late content gets more weight than middle content in very long windows
The 'Lost in the Middle' Problem
Prompt
I'm going to paste a 15-page market research report. Before I do, here are your instructions: 1. Focus on competitive positioning data only 2. Ignore methodology sections 3. Extract exactly 5 key findings as bullet points 4. Flag any claims that lack supporting data [REPORT STARTS HERE] [...15 pages of content...] Remember: apply the 4 instructions I gave you at the top of this prompt.
AI Response
Understood. Applying your four instructions to the report: **Key Findings (Competitive Positioning Only):** • Market share data shows Brand X holds 34% in the 25–34 demographic, up from 28% in 2022 • Brand Y's pricing strategy undercuts the category average by 18%, no supporting survey data provided ⚠️ • Distribution advantage: Brand X present in 3x more retail locations than nearest competitor • Customer loyalty scores favor Brand X by 12 points (NPS), sourced from 2,400-respondent survey ✓ • Brand Z's product refresh has not yet impacted purchase intent metrics, based on pre-launch data only ⚠️ **Flagged Claims (Insufficient Data):** Brand Y pricing claim (no source cited); Brand Z impact assessment (pre-launch data may not reflect actual market response).
Why Output Quality Degrades in Long Conversations
Two things happen as a conversation grows longer. First, older messages start getting truncated to stay within the context window, the model loses access to early instructions or context you established. Second, even within the window, the model's attention mechanism has to process more tokens on every single generation step, which can dilute focus. What feels like the AI 'getting lazy' or 'going off-track' in long sessions is often a mechanical consequence of context saturation, not a drop in model intelligence.
- Start a new chat session for genuinely new tasks, don't carry unrelated threads
- Re-state critical constraints every 10–15 exchanges in long working sessions
- Use system prompts (in API or tools like ChatGPT's custom instructions) to anchor persistent rules outside the conversation window
- Summarize long prior context yourself and paste the summary at the start of a new session
- For document analyzis, paste the document fresh in each new session rather than relying on previous session memory
Goal: Observe firsthand how context window saturation causes early instructions to decay, and discover whether anchoring instructions at both ends of a prompt extends their persistence.
1. Open ChatGPT (free or Plus) and start a new conversation. 2. Paste the following text at the top: 'For this entire conversation, always end every response with the phrase CONTEXT STABLE.' 3. Have a normal 5-message exchange on any work topic, ask follow-up questions, request revisions. 4. After 5 exchanges, check: is the model still appending CONTEXT STABLE to responses? 5. Continue the conversation for 10 more exchanges, deliberately asking long questions with detailed context. 6. Check again, note at which point the model stops following the original instruction. 7. Record the approximate exchange number where the instruction dropped out, then start a fresh session and re-test with the instruction placed at both the beginning AND end of your first message.
controlling the Output: Temperature, Tokens, and Practical Limits
You now know that AI generates text token by token, sampling from probability distributions shaped by training. That mechanism has direct, practical consequences for how you use tools like ChatGPT, Claude, and Gemini. Three variables, temperature, token limits, and stopping behavior, determine whether you get a tight executive summary or a rambling hallucination. Understanding them turns you from a passive user into someone who actually controls the output.
Temperature: The Creativity Dial
Temperature is a number, typically between 0 and 2, that scales the probability distribution before the model samples. At temperature 0, the model always picks the highest-probability token, output is deterministic and repetitive. At temperature 1, it samples naturally from the distribution the model learned. Above 1.5, low-probability tokens get a real chance, producing surprising, sometimes incoherent results. Most professional use cases live between 0.2 and 0.9. ChatGPT's default sits around 0.7. You rarely set this directly in the chat UI, but API users control it precisely.
- Temperature 0: Deterministic, consistent, best for data extraction and classification tasks
- Temperature 0.3–0.5: Focused creativity, good for summaries, reports, structured analyzis
- Temperature 0.7–0.9: Balanced, default range for most conversational and writing tasks
- Temperature 1.2+: High variance, useful for brainstorming, risky for factual accuracy
- Temperature does NOT affect what the model knows, only how it samples from what it knows
Mimic Temperature Control in the Chat UI
| Setting | Temperature Range | Best For | Risk |
|---|---|---|---|
| Precise / Factual | 0 – 0.3 | Extraction, classification, Q&A | Repetitive, rigid phrasing |
| Balanced | 0.4 – 0.8 | Summaries, emails, analyzis | Occasional mild drift |
| Creative | 0.9 – 1.2 | Brainstorming, copywriting, ideation | Inconsistency, tangents |
| Experimental | 1.3 – 2.0 | Novelty exploration only | Incoherence, hallucination |
Token Limits: Why Outputs Get Cut Off
Every model has a context window measured in tokens, the combined budget for your input and the model's output. GPT-4o supports up to 128,000 tokens of context. Claude 3.5 Sonnet handles 200,000. Gemini 1.5 Pro reaches 1 million. But the output limit is a separate constraint, most models cap generated responses at 4,096 to 8,192 tokens by default, regardless of context window size. When a response gets cut off mid-sentence, you've hit the output token ceiling, not a bug.
- Context window = input tokens + output tokens combined, they share the same budget
- Longer prompts leave less room for the model's response within that shared window
- A 1,000-word document you paste in consumes roughly 1,300–1,500 tokens
- If output cuts off, type 'continue', the model resumes from where it stopped
- For very long tasks, break them into chunks rather than one massive prompt
- API users can set max_tokens explicitly to control response length and cost
| Model | Context Window | Typical Output Limit | Approx. Cost per 1M Input Tokens |
|---|---|---|---|
| GPT-4o | 128K tokens | 4,096 tokens default | $5.00 |
| GPT-4o mini | 128K tokens | 4,096 tokens default | $0.15 |
| Claude 3.5 Sonnet | 200K tokens | 8,192 tokens | $3.00 |
| Gemini 1.5 Pro | 1M tokens | 8,192 tokens | $3.50 |
| Gemini 1.5 Flash | 1M tokens | 8,192 tokens | $0.075 |
Tokens ≠ Words
Prompt
Summarize the following meeting transcript in exactly 5 bullet points, each under 20 words. Do not exceed this format. [Transcript pasted here]
AI Response
• Client wants pricing proposal by Friday EOD. Sarah owns delivery. • Q3 budget approved at $240K, no further sign-off needed. • Integration timeline pushed to October, dev team capacity constraint. • Competitor pricing flagged as risk. Marcus to run analyzis by Wednesday. • Next sync scheduled for Thursday 10am EST, full team required.
Stopping Behavior and Hallucination Risk
The model generates tokens until it hits a stopping condition, a special end-of-sequence token, a max token limit, or a stop string you define. It has no built-in sense of 'I'm done and this is accurate.' It stops because it ran out of budget or reached a termination signal. This is why hallucination is a structural feature, not a bug to be patched. The model is always completing a probable sequence, not retrieving verified facts. When confident-sounding wrong answers appear, the model found a high-probability token path that happened to be false.
Hallucination rates vary significantly by model and task type. GPT-4o hallucinates on factual recall tasks at roughly 3–5% in internal benchmarks, far lower than earlier models, but non-zero. Citation-heavy tasks (asking for specific paper titles, URLs, statistics) carry higher risk. Tasks grounded in content you provide, summarization, extraction, classification, carry far lower risk because the model is pattern-matching against your text, not reaching into training memory. Structure your prompts to keep the model anchored to source material whenever accuracy matters.
High-Risk Hallucination Scenarios
Goal: A personalized, single-page reference document you can consult before prompting on high-stakes tasks
1. Open a blank document in Notion, Google Docs, or any notes app, title it 'AI Output Controls Reference.' 2. Create a section called 'Temperature Equivalents' and write two prompt phrases you'll actually use: one for precise/factual tasks and one for creative/exploratory tasks. 3. Create a section called 'Token Budget Rules' and record the context window size for the AI tool you use most (check the provider's current documentation). 4. Write one sentence describing what to do when a response gets cut off mid-output. 5. Create a section called 'High Hallucination Risk Topics' and list three categories of information you will always verify externally before using in your work. 6. Add a 'Grounding Reminder' note: write a one-sentence rule you'll follow when accuracy is critical (e.g., 'Paste the source document and ask the model to extract, never ask it to recall from memory').
Quick-Reference Cheat Sheet
- AI generates text one token at a time, sampling from probability distributions, not retrieving stored answers
- Temperature controls randomness: lower = more predictable, higher = more varied
- Context window = shared budget for input + output tokens; longer prompts leave less room for responses
- 1,000 words ≈ 1,333 tokens in English, non-English languages consume more
- Output cutoffs mean you've hit a token limit, type 'continue' to resume
- Hallucination is structural: the model always generates probable sequences, not verified facts
- Grounded tasks (summarize THIS document) hallucinate far less than memory-recall tasks
- Perplexity and Bing Copilot reduce hallucination via live search, but aren't immune
- You can mimic temperature settings through prompt language even without API access
Key Takeaways
- Temperature is the single most important variable controlling output character, even if you can only influence it through prompt language
- Token limits are a hard constraint; understanding them prevents confusion and lets you plan prompts for long tasks
- Hallucination is not random failure, it's a predictable risk that spikes on specific task types you can learn to avoid
- Grounding the model in provided content (rather than asking it to recall) is the most reliable accuracy strategy available today
- The gap between context window size and output limit matters, large context windows don't automatically mean long responses
Featured Reading
Sign in to track your progress.
