Back to Inside the Machine: How AI Models Think

Lesson 6 of 11

Prediction One Token at a Time

~15 min readLast reviewed May 2026

This lesson counts toward:Grow Faster: AI for Small Teams How AI Actually Works

How AI Generates Text. One Word at a Time

Every time ChatGPT answers a question, writes an email, or explains a concept, it's doing one thing repeatedly: predicting the next most likely word. That's it. No understanding, no consciousness, no lookup table of facts, just a sophisticated probability engine trained on hundreds of billions of words. This sounds reductive until you realize the prediction process requires modeling grammar, logic, context, tone, and world knowledge simultaneously. Knowing exactly how this works makes you a better prompter, helps you spot AI errors before they cost you, and tells you which tasks AI genuinely excels at versus where it reliably fails.

7 Things You Need to Know About AI Text Generation

AI generates text token by token, not word by word, a token is roughly 3-4 characters or 0.75 words on average.
Each token is chosen based on probability scores across the entire model vocabulary (50,000+ possible tokens in GPT-4).
The model reads your full prompt before generating a single output token, context shapes every prediction.
Temperature controls randomness: low temperature (0.1) = predictable and precise, high temperature (0.9) = creative and varied.
Models have a context window, a hard limit on how much text they can 'see' at once. GPT-4 Turbo: 128,000 tokens. Claude 3.5: 200,000 tokens.
AI has no memory between separate conversations unless the product explicitly stores and re-injects it.
Hallucinations happen because the model is optimizing for plausible text, not verified truth, it doesn't know what it doesn't know.

Tokens: The Real Unit of AI Language

Before a model reads a single character of your prompt, it converts everything into tokens. A token is a chunk of text, sometimes a full word, sometimes part of one. 'Running' is one token. 'Unbelievable' might split into 'Un', 'believ', 'able', three tokens. Numbers, punctuation, and spaces each consume tokens too. This matters practically: OpenAI charges by token, and GPT-4 costs $0.03 per 1,000 input tokens and $0.06 per 1,000 output tokens as of early 2024. A 500-word document is roughly 650-700 tokens. A long legal contract at 10,000 words runs around 13,000 tokens, about $0.39 just to read it.

Tokenization also explains some quirky AI behavior. Ask ChatGPT to count the letter 'r' in 'strawberry' and it often gets it wrong, because 'strawberry' may tokenize in a way that obscures individual characters from the model's perspective. The model never sees raw letters; it sees token IDs. Rare words, technical jargon, and non-English text tend to tokenize inefficiently, meaning they consume more tokens and cost more to process. Writing prompts in plain, common language isn't just good communication practice, it's also more economical and often produces better results.

1 token ≈ 4 characters or 0.75 English words
100 tokens ≈ 75 words ≈ a short paragraph
1,000 tokens ≈ 750 words ≈ a 1.5-page memo
Common English words = 1 token each ('the', 'and', 'is')
Long or rare words = 2-4 tokens ('cryptocurrency', 'pharmaceutical')
Non-Latin scripts (Chinese, Arabic) = 1-2 characters per token, making them 2-3x more expensive to process
Code is token-dense: a 50-line Python script can easily hit 400-600 tokens

Check Your Token Count Before Sending Long Prompts

OpenAI's free Tokenizer tool at platform.openai.com/tokenizer lets you paste any text and see exactly how many tokens it uses. Use it before sending large documents to the API to avoid surprise costs. For ChatGPT Plus users ($20/month), token costs are bundled, but the context window limit still applies. Hitting the limit mid-conversation causes the model to 'forget' earlier content, which silently degrades response quality.

Model	Context Window	Approximate Page Limit	Best For
GPT-3.5 Turbo	16,385 tokens	~22 pages	Fast, cheap tasks, summaries, drafts
GPT-4 Turbo	128,000 tokens	~170 pages	Long documents, complex reasoning
GPT-4o	128,000 tokens	~170 pages	Multimodal tasks, speed + quality balance
Claude 3.5 Sonnet	200,000 tokens	~270 pages	Book-length analyzis, legal/financial docs
Gemini 1.5 Pro	1,000,000 tokens	~1,350 pages	Entire codebases, very long research corpora
Llama 3 (Meta)	8,192 tokens	~11 pages	Local/private deployment, shorter tasks

Context window comparison across major models (2024). Page estimates based on 750 words per page, 1,000 tokens per page.

Probability, Temperature, and Why AI Isn't Deterministic

After tokenizing your prompt, the model assigns a probability score to every possible next token, all 50,000+ of them. 'The sky is ___': 'blue' might score 34%, 'clear' 18%, 'dark' 12%, and so on down to absurd options with 0.001% probability. The model doesn't always pick the top scorer. A parameter called temperature adjusts how the model samples from this distribution. At temperature 0, it always picks the highest-probability token, maximally predictable, sometimes robotic. At temperature 1.0, lower-probability tokens get a real chance, producing more varied, creative output. Most production applications set temperature between 0.3 and 0.7.

This is why identical prompts produce different outputs on different runs, unless temperature is explicitly set to zero. ChatGPT's default temperature is around 0.7, which is why it 'feels' natural and varied rather than mechanical. Claude tends to run slightly lower, which is why it often feels more precise and measured. When you need consistency, extracting data from invoices, classifying customer feedback, generating structured JSON, lower temperature is your friend. When you need creative ideation, higher temperature surfaces unexpected angles. You can't adjust temperature directly in ChatGPT's interface, but you can prompt for it: 'Give me 5 distinct, varied approaches' forces the model toward diversity.

Model tokenizes your full prompt into token IDs
Token IDs pass through the neural network's attention layers (more on this in Part 2)
Network outputs a probability distribution over all ~50,000+ vocabulary tokens
Temperature parameter reshapes this distribution, flattening it (high temp) or sharpening it (low temp)
One token is sampled from the distribution
That token is appended to the context, and the process restarts
Generation continues until the model produces a stop token or hits the maximum output length

Temperature Setting	Behavior	Use Case	Risk
0.0 – 0.2	Highly deterministic, repetitive phrasing	Data extraction, classification, structured output	Formulaic, may miss nuance
0.3 – 0.5	Consistent but natural-sounding	Summarization, professional emails, reports	Occasionally flat tone
0.6 – 0.8	Balanced creativity and coherence	Marketing copy, explanations, brainstorming	Occasional unexpected tangents
0.9 – 1.0	High variance, surprising outputs	Creative writing, ideation, poetry	May produce incoherent results
Above 1.0	Chaotic, unreliable	Rarely used in production	High hallucination risk

Temperature settings and their practical effects. Most consumer AI products like ChatGPT and Claude don't expose this setting directly in their chat interfaces.

Temperature vs. Top-P: Two Knobs, One Goal

Professional AI platforms like OpenAI's API and Anthropic's API expose two sampling parameters: temperature and top-p (nucleus sampling). Top-p limits token selection to the smallest set of tokens whose combined probability exceeds a threshold, so top-p 0.9 means 'only consider tokens from the top 90% of probability mass.' Most practitioners adjust one or the other, not both. For business applications, temperature is more intuitive to tune. Top-p is preferred when you want to eliminate very low-probability 'wild card' tokens entirely.

controlling Output Style via Prompt When You Can't Set Temperature

Prompt

You are a precise data analyzt. Extract the following information from this invoice and return it ONLY as a JSON object with no additional commentary. Be exact, do not infer or estimate any values not explicitly stated. Invoice text: 'Invoice #4421, dated March 3 2024, from Apex Supplies Ltd. Total due: $3,847.50. Payment terms: Net 30. Line items: Office furniture $2,200, Delivery $147.50, Installation $1,500.'

AI Response

{ "invoice_number": "4421", "date": "2024-03-03", "vendor": "Apex Supplies Ltd", "total_due": 3847.50, "payment_terms": "Net 30", "line_items": [ { "description": "Office furniture", "amount": 2200.00 }, { "description": "Delivery", "amount": 147.50 }, { "description": "Installation", "amount": 1500.00 } ] }

Hallucinations: Why AI Confidently Gets Things Wrong

Hallucination is the term for when an AI model generates text that is fluent, confident, and factually wrong. It's not a bug in the traditional sense, it's a direct consequence of how text generation works. The model is always optimizing for the most plausible next token, not the most truthful one. It has no internal fact-checker. When asked about an obscure topic, a real person, or a specific statistic, the model fills gaps with whatever pattern fits the context. A 2023 study by Stanford found hallucination rates in GPT-3.5 as high as 27% on medical question-answering tasks. GPT-4 reduced this significantly, but never to zero.

Hallucinations cluster around specific failure modes. Fake citations are notorious, ask for academic sources and the model will generate plausible-looking author names, journal titles, and DOIs that don't exist. Specific numbers (revenue figures, dates, statistics) are high-risk. Names of real people combined with specific claims ('In her 2019 Forbes interview, [executive] said...') frequently contain fabricated quotes. The model has no way to signal uncertainty, it generates 'I'm not sure' as a learned phrase, not as an actual epistemic flag. Perplexity AI partially addresses this by grounding responses in live web search results, which is why it's a better choice than ChatGPT for factual research tasks.

2023

Historical Record

New York lawyer

A New York lawyer was sanctioned in 2023 after submitting a court brief containing six AI-generated case citations, all fabricated. ChatGPT had invented the case names, courts, and legal holdings with complete confidence.

This incident demonstrates how AI hallucinations can have real legal consequences when models generate plausible-looking but entirely false information.

Content Type	Hallucination Risk	Why	Mitigation
Creative writing, fiction	Very Low	No factual standard to violate	None needed
Summarizing text you provide	Low	Model works from given context	Check for omissions
General explanations of concepts	Low–Medium	Well-represented in training data	Verify key claims
Specific statistics and numbers	High	Numbers are easy to plausibly fabricate	Cross-check primary sources
Academic or legal citations	Very High	Model pattern-matches citation formats	Never use without verification
Recent events (post-training cutoff)	Very High	Model has no data; must confabulate	Use Perplexity or Bing AI instead
Niche professional knowledge	High	Sparse training data = more guesswork	Consult domain expert

Hallucination risk by content type. Training cutoffs: GPT-4o (April 2024), Claude 3.5 (early 2024), Gemini 1.5 (early 2024).

Temperature, Sampling, and Why the Same Prompt Gets Different Answers

You now know that AI picks the next token by calculating probabilities across its entire vocabulary. But there's a second layer of decision-making happening simultaneously, one that controls how adventurous or conservative those picks are. This is where temperature and sampling strategies come in. They don't change what the model knows; they change how it chooses from what it knows. Understanding this distinction separates professionals who get consistent, reliable outputs from those who wonder why ChatGPT seems unpredictable.

Temperature: The Randomness Dial

Temperature is a numerical setting, typically between 0 and 2, applied after the model calculates its token probabilities. At temperature 0, the model always picks the highest-probability token. Deterministic, predictable, occasionally repetitive. At temperature 1, it samples from the probability distribution as-is. At temperature 2, it flattens the distribution, making low-probability tokens far more likely to get selected. Think of it like a volume dial for creativity. Low temperature produces precise, consistent outputs. High temperature produces varied, sometimes surprising, sometimes incoherent, results.

Temperature 0: Same output every time, best for code, data extraction, structured tasks
Temperature 0.3–0.5: Slightly varied but still focused, good for summarization, Q&A
Temperature 0.7–1.0: Balanced creativity and coherence, default for most ChatGPT interactions
Temperature 1.5–2.0: High variance, experimental, useful for brainstorming, risky for accuracy
OpenAI's API defaults ChatGPT to temperature 1; Claude defaults closer to 0.7
You cannot set temperature directly in ChatGPT's free interface, but you can approximate it with prompt instructions like 'be precise and consistent'

Control Temperature Through Your Prompt

When using ChatGPT or Claude without API access, add explicit instructions to mimic temperature effects. 'Give me exactly one answer, no alternatives' pushes toward low-temperature behavior. 'Give me five very different interpretations' mimics higher temperature. This works because the model's instruction-following tendency overrides its default sampling behavior in most cases.

Task Type	Ideal Temperature	Why	Example Tool Setting
Extract data from text	0–0.2	Needs deterministic, repeatable output	API: temperature=0
Write a product description	0.6–0.8	Creative but on-brand	ChatGPT default
Brainstorm campaign ideas	1.0–1.2	Maximize variety of ideas	Claude: higher creativity mode
Generate code	0–0.3	Syntax errors spike at high temps	GitHub Copilot: low temp
Write a poem or story	1.0–1.5	Originality requires variance	Midjourney prompts: exploratory
Summarize a legal document	0–0.2	Accuracy over creativity	API: temperature=0

Temperature settings mapped to common professional tasks

Top-P Sampling: The Other Control Nobody Talks About

Alongside temperature, models use a second filter called top-p sampling (also called nucleus sampling). Instead of considering all possible tokens, top-p restricts selection to the smallest set of tokens whose combined probability adds up to p. At top-p 0.9, the model only samples from tokens that together account for 90% of the probability mass, cutting off the long tail of unlikely, potentially nonsensical choices. This is why even at higher temperatures, ChatGPT rarely produces complete gibberish: top-p is quietly pruning the worst options before temperature even applies.

Temperature and top-p work together, not independently. OpenAI recommends adjusting one or the other, not both simultaneously, when using the API. In practice, most professionals using consumer tools like ChatGPT or Claude never touch these settings directly. But knowing they exist explains a lot: why Claude tends to feel more measured than GPT-4 Turbo, why Perplexity AI gives consistent factual answers while creative tools feel more alive. Different products ship with different defaults baked in.

Model calculates raw probability scores for all ~100,000 tokens in its vocabulary
Top-p filter removes low-probability tokens, keeping only the top probability mass
Temperature scaling is applied to the remaining token probabilities
The model samples one token from this filtered, scaled distribution
That token is appended to the context window
The entire process repeats, now with the new token included as context
This continues until a stop token is reached or the max token limit is hit

Parameter	What It Controls	Low Value Effect	High Value Effect	Who Sets It
Temperature	Randomness of token selection	Predictable, focused	Creative, inconsistent	API users; approximated via prompts
Top-P	Size of token candidate pool	Only top likely tokens considered	Wider range of tokens eligible	API users; product teams
Max Tokens	Maximum output length	Short, possibly cut-off responses	Longer outputs, higher cost	API users; some UI settings
Frequency Penalty	Penalizes repeated tokens	Model may repeat phrases	Forces more varied word choice	API users
Presence Penalty	Penalizes already-used topics	Model stays on original topic	Model explores new angles	API users

Key generation parameters and their practical effects on AI output

Why Claude Feels Different From ChatGPT

Anthropic and OpenAI ship their models with different default parameters and different training emphases. Claude (Anthropic) is trained with a heavier focus on harmlessness and tends toward more cautious, structured responses, partly a training choice, partly default parameter settings. GPT-4 Turbo tends to be more expansive and willing to explore edge cases. Neither is universally better; they're optimized for different use profiles. Knowing this helps you pick the right tool for a given task rather than assuming one AI is smarter than another.

The Context Window: Your AI's Working Memory

Every token the model generates gets added back into what's called the context window, the total text the model can 'see' at any given moment. This includes your original prompt, any conversation history, and every token the model has already generated in the current session. The model has no memory outside this window. When the conversation exceeds the window limit, older content gets dropped, silently, without warning. This is why long ChatGPT conversations sometimes feel like the model 'forgot' what you said earlier. It literally did.

Context window sizes have grown dramatically. GPT-3.5 shipped with a 4,096-token window (roughly 3,000 words). GPT-4 Turbo expanded this to 128,000 tokens, about 96,000 words, or a short novel. Claude 3 supports up to 200,000 tokens. Gemini 1.5 Pro reached 1 million tokens in testing. Larger windows mean the model can hold more of your document, your conversation history, or your codebase in view simultaneously. But larger windows also cost more per API call, because every token in the window gets processed at inference time.

GPT-3.5 Turbo: 16,385 tokens (~12,000 words), current default for free ChatGPT
GPT-4 Turbo: 128,000 tokens (~96,000 words), available on ChatGPT Plus ($20/month)
Claude 3 Opus: 200,000 tokens (~150,000 words). Anthropic's flagship
Gemini 1.5 Pro: 1,000,000 tokens in testing. Google's long-context leader
Tokens ≠ words: English averages ~0.75 words per token; code runs higher; some languages use more tokens per word
Filling the context window doesn't guarantee the model pays equal attention to all of it, early and late content gets more weight than middle content in very long windows

The 'Lost in the Middle' Problem

Research from Stanford (2023) showed that large language models perform significantly worse at retrieving information placed in the middle of a very long context window compared to information at the beginning or end. If you're pasting a 50-page document and asking questions about it, critical details buried in the middle may effectively be invisible to the model. Practical fix: put your most important instructions and key facts at the start or end of your prompt, not buried in the middle.

Using Context Window Strategically. Long Document analyzis

Prompt

I'm going to paste a 15-page market research report. Before I do, here are your instructions: 1. Focus on competitive positioning data only 2. Ignore methodology sections 3. Extract exactly 5 key findings as bullet points 4. Flag any claims that lack supporting data [REPORT STARTS HERE] [...15 pages of content...] Remember: apply the 4 instructions I gave you at the top of this prompt.

AI Response

Understood. Applying your four instructions to the report: **Key Findings (Competitive Positioning Only):** • Market share data shows Brand X holds 34% in the 25–34 demographic, up from 28% in 2022 • Brand Y's pricing strategy undercuts the category average by 18%, no supporting survey data provided ⚠️ • Distribution advantage: Brand X present in 3x more retail locations than nearest competitor • Customer loyalty scores favor Brand X by 12 points (NPS), sourced from 2,400-respondent survey ✓ • Brand Z's product refresh has not yet impacted purchase intent metrics, based on pre-launch data only ⚠️ **Flagged Claims (Insufficient Data):** Brand Y pricing claim (no source cited); Brand Z impact assessment (pre-launch data may not reflect actual market response).

Why Output Quality Degrades in Long Conversations

Two things happen as a conversation grows longer. First, older messages start getting truncated to stay within the context window, the model loses access to early instructions or context you established. Second, even within the window, the model's attention mechanism has to process more tokens on every single generation step, which can dilute focus. What feels like the AI 'getting lazy' or 'going off-track' in long sessions is often a mechanical consequence of context saturation, not a drop in model intelligence.

Start a new chat session for genuinely new tasks, don't carry unrelated threads
Re-state critical constraints every 10–15 exchanges in long working sessions
Use system prompts (in API or tools like ChatGPT's custom instructions) to anchor persistent rules outside the conversation window
Summarize long prior context yourself and paste the summary at the start of a new session
For document analyzis, paste the document fresh in each new session rather than relying on previous session memory

Map Your Context Window Usage

Goal: Observe firsthand how context window saturation causes early instructions to decay, and discover whether anchoring instructions at both ends of a prompt extends their persistence.

1. Open ChatGPT (free or Plus) and start a new conversation. 2. Paste the following text at the top: 'For this entire conversation, always end every response with the phrase CONTEXT STABLE.' 3. Have a normal 5-message exchange on any work topic, ask follow-up questions, request revisions. 4. After 5 exchanges, check: is the model still appending CONTEXT STABLE to responses? 5. Continue the conversation for 10 more exchanges, deliberately asking long questions with detailed context. 6. Check again, note at which point the model stops following the original instruction. 7. Record the approximate exchange number where the instruction dropped out, then start a fresh session and re-test with the instruction placed at both the beginning AND end of your first message.

controlling the Output: Temperature, Tokens, and Practical Limits

You now know that AI generates text token by token, sampling from probability distributions shaped by training. That mechanism has direct, practical consequences for how you use tools like ChatGPT, Claude, and Gemini. Three variables, temperature, token limits, and stopping behavior, determine whether you get a tight executive summary or a rambling hallucination. Understanding them turns you from a passive user into someone who actually controls the output.

Temperature: The Creativity Dial

Temperature is a number, typically between 0 and 2, that scales the probability distribution before the model samples. At temperature 0, the model always picks the highest-probability token, output is deterministic and repetitive. At temperature 1, it samples naturally from the distribution the model learned. Above 1.5, low-probability tokens get a real chance, producing surprising, sometimes incoherent results. Most professional use cases live between 0.2 and 0.9. ChatGPT's default sits around 0.7. You rarely set this directly in the chat UI, but API users control it precisely.

Temperature 0: Deterministic, consistent, best for data extraction and classification tasks
Temperature 0.3–0.5: Focused creativity, good for summaries, reports, structured analyzis
Temperature 0.7–0.9: Balanced, default range for most conversational and writing tasks
Temperature 1.2+: High variance, useful for brainstorming, risky for factual accuracy
Temperature does NOT affect what the model knows, only how it samples from what it knows

Mimic Temperature Control in the Chat UI

You can't set temperature directly in ChatGPT's standard interface, but your prompt language does the same job. Write 'give me only the single most likely answer' to push toward low-temperature behavior. Write 'give me five unusual angles I probably haven't considered' to push toward high-temperature exploration. The model responds to intent encoded in words, not just sliders.

Setting	Temperature Range	Best For	Risk
Precise / Factual	0 – 0.3	Extraction, classification, Q&A	Repetitive, rigid phrasing
Balanced	0.4 – 0.8	Summaries, emails, analyzis	Occasional mild drift
Creative	0.9 – 1.2	Brainstorming, copywriting, ideation	Inconsistency, tangents
Experimental	1.3 – 2.0	Novelty exploration only	Incoherence, hallucination

Temperature ranges and their practical trade-offs across common professional tasks

Token Limits: Why Outputs Get Cut Off

Every model has a context window measured in tokens, the combined budget for your input and the model's output. GPT-4o supports up to 128,000 tokens of context. Claude 3.5 Sonnet handles 200,000. Gemini 1.5 Pro reaches 1 million. But the output limit is a separate constraint, most models cap generated responses at 4,096 to 8,192 tokens by default, regardless of context window size. When a response gets cut off mid-sentence, you've hit the output token ceiling, not a bug.

Context window = input tokens + output tokens combined, they share the same budget
Longer prompts leave less room for the model's response within that shared window
A 1,000-word document you paste in consumes roughly 1,300–1,500 tokens
If output cuts off, type 'continue', the model resumes from where it stopped
For very long tasks, break them into chunks rather than one massive prompt
API users can set max_tokens explicitly to control response length and cost

Model	Context Window	Typical Output Limit	Approx. Cost per 1M Input Tokens
GPT-4o	128K tokens	4,096 tokens default	$5.00
GPT-4o mini	128K tokens	4,096 tokens default	$0.15
Claude 3.5 Sonnet	200K tokens	8,192 tokens	$3.00
Gemini 1.5 Pro	1M tokens	8,192 tokens	$3.50
Gemini 1.5 Flash	1M tokens	8,192 tokens	$0.075

Context windows and costs as of mid-2025, check provider pricing pages for current rates

Tokens ≠ Words

A useful rule of thumb: 1 token ≈ 0.75 words in English. So 1,000 words ≈ 1,333 tokens. Technical content, code, and non-English languages tokenize differently. German and Japanese, for example, use more tokens per word than English because the tokenizer was trained on predominantly English text. This matters when you're pasting large documents and watching your context budget.

controlling Output Length Explicitly

Prompt

Summarize the following meeting transcript in exactly 5 bullet points, each under 20 words. Do not exceed this format. [Transcript pasted here]

AI Response

• Client wants pricing proposal by Friday EOD. Sarah owns delivery. • Q3 budget approved at $240K, no further sign-off needed. • Integration timeline pushed to October, dev team capacity constraint. • Competitor pricing flagged as risk. Marcus to run analyzis by Wednesday. • Next sync scheduled for Thursday 10am EST, full team required.

Stopping Behavior and Hallucination Risk

The model generates tokens until it hits a stopping condition, a special end-of-sequence token, a max token limit, or a stop string you define. It has no built-in sense of 'I'm done and this is accurate.' It stops because it ran out of budget or reached a termination signal. This is why hallucination is a structural feature, not a bug to be patched. The model is always completing a probable sequence, not retrieving verified facts. When confident-sounding wrong answers appear, the model found a high-probability token path that happened to be false.

Hallucination rates vary significantly by model and task type. GPT-4o hallucinates on factual recall tasks at roughly 3–5% in internal benchmarks, far lower than earlier models, but non-zero. Citation-heavy tasks (asking for specific paper titles, URLs, statistics) carry higher risk. Tasks grounded in content you provide, summarization, extraction, classification, carry far lower risk because the model is pattern-matching against your text, not reaching into training memory. Structure your prompts to keep the model anchored to source material whenever accuracy matters.

High-Risk Hallucination Scenarios

Specific statistics without a source provided, named URLs and hyperlinks, people's credentials or publication histories, legal citations, drug dosages, and recent events after the model's training cutoff, these are the categories where confident-sounding wrong answers are most common. Always verify these externally. Perplexity AI and Bing Copilot reduce this risk by grounding answers in live web search, but they are not immune.

Build Your Personal AI Output Cheat Sheet

Goal: A personalized, single-page reference document you can consult before prompting on high-stakes tasks

1. Open a blank document in Notion, Google Docs, or any notes app, title it 'AI Output Controls Reference.' 2. Create a section called 'Temperature Equivalents' and write two prompt phrases you'll actually use: one for precise/factual tasks and one for creative/exploratory tasks. 3. Create a section called 'Token Budget Rules' and record the context window size for the AI tool you use most (check the provider's current documentation). 4. Write one sentence describing what to do when a response gets cut off mid-output. 5. Create a section called 'High Hallucination Risk Topics' and list three categories of information you will always verify externally before using in your work. 6. Add a 'Grounding Reminder' note: write a one-sentence rule you'll follow when accuracy is critical (e.g., 'Paste the source document and ask the model to extract, never ask it to recall from memory').

Quick-Reference Cheat Sheet

AI generates text one token at a time, sampling from probability distributions, not retrieving stored answers
Temperature controls randomness: lower = more predictable, higher = more varied
Context window = shared budget for input + output tokens; longer prompts leave less room for responses
1,000 words ≈ 1,333 tokens in English, non-English languages consume more
Output cutoffs mean you've hit a token limit, type 'continue' to resume
Hallucination is structural: the model always generates probable sequences, not verified facts
Grounded tasks (summarize THIS document) hallucinate far less than memory-recall tasks
Perplexity and Bing Copilot reduce hallucination via live search, but aren't immune
You can mimic temperature settings through prompt language even without API access

Key Takeaways

Temperature is the single most important variable controlling output character, even if you can only influence it through prompt language
Token limits are a hard constraint; understanding them prevents confusion and lets you plan prompts for long tasks
Hallucination is not random failure, it's a predictable risk that spikes on specific task types you can learn to avoid
Grounding the model in provided content (rather than asking it to recall) is the most reliable accuracy strategy available today
The gap between context window size and output limit matters, large context windows don't automatically mean long responses

Featured Reading

↗Evaluation of large language models within GenAI in qualitative ...

Practice this in a lab

Fix the Broken Prompt: Hospital Discharge Instructions

beginner · 12 min

Spot the Better Prompt: Hospital Discharge Instructions

beginner · 10 min