Back to Inside the Machine: How AI Models Think

Lesson 1 of 11

From Prompt to Output: The Real Process

~24 min readLast reviewed May 2026

This lesson counts toward:Grow Faster: AI for Small Teams How AI Actually Works

What Happens When You Send a Message to ChatGPT

2023

Historical Record

Klarna

In early 2023, Klarna deployed an AI assistant built on OpenAI's technology to handle customer service conversations. Within a month, it was managing two-thirds of all customer chats, work that previously required 700 human agents.

This demonstrates the rapid scaling potential of AI assistants in customer service operations.

This is the central tension in working with AI tools like ChatGPT: the model is fixed, but your input is a variable you control completely. Most professionals using ChatGPT, Claude, or Gemini treat the text box like a search engine, type a question, get an answer. That mental model leaves most of the model's capability untouched. Klarna's team understood something different: what you send in shapes everything that comes out. The journey your message takes from the moment you hit Enter to the moment text appears on screen is not magic. It's a specific, learnable process.

Understanding that process changes how you use every AI tool you'll encounter. It explains why some prompts produce sharp, precise outputs and others produce vague, hedging non-answers. It tells you why ChatGPT sometimes seems to misunderstand what you meant, and exactly how to fix it. It's the difference between a professional who uses AI effectively and one who tries it a few times, gets mediocre results, and concludes the technology is overhyped. The mechanics are not complicated. But they are specific, and they're worth knowing cold.

What this lesson covers

This lesson traces the full lifecycle of a single message to ChatGPT: how your text gets converted into a format the model can process, what the model actually does with it, and how a response gets generated word by word. No math required. By the end, you'll have a precise mental model of the process, one that makes you a significantly better prompt writer immediately.

The Moment You Hit Send

Picture a marketing director at a mid-sized SaaS company, call her Sarah. She opens ChatGPT, types 'write me a product launch email for our new analytics feature,' and hits Enter. From her perspective, there's a brief pause and then text appears. From the system's perspective, something much more interesting happens. The first thing ChatGPT does is not read her sentence the way a human would. It breaks it apart into fragments called tokens. A token is roughly three to four characters, sometimes a whole word, sometimes part of one. The word 'analytics' becomes a single token. The word 'new' is a token. The space before a word is often bundled into the token that follows it. Sarah's 11-word sentence becomes approximately 14 tokens.

Tokenization is not a technicality, it's the first place where things can go wrong. GPT-4 has a context window of 128,000 tokens, which sounds enormous until you realize a detailed project brief, three example emails, and a style guide can easily consume 3,000 to 5,000 tokens before Sarah has typed a single word of her actual request. Every token in that window costs money and affects processing time. OpenAI charges approximately $0.01 per 1,000 input tokens on GPT-4o as of mid-2024. A company running thousands of daily queries, like Klarna, optimizes token usage the same way a logistics firm optimizes fuel costs. For Sarah, the practical implication is simpler: longer isn't always better. Precise context beats verbose context every time.

Once tokenized, Sarah's message doesn't travel through the system as words at all. Each token gets converted into a number, specifically, a position in a vocabulary list of roughly 100,000 possible tokens that GPT-4 knows. Those numbers then get transformed into vectors: long strings of floating-point numbers that encode not just 'what this token is' but 'what this token means in relation to everything else in the sequence.' The word 'bank' near the word 'river' produces a different vector than 'bank' near 'loan.' This is how the model captures meaning rather than just matching words. Sarah's casual 11-word request has already become a rich mathematical structure before any 'thinking' happens.

Weak vs. Strong Prompt. Same Request, Different Token Efficiency

Prompt

WEAK (vague, forces the model to guess): "Write me a product launch email for our new analytics feature." STRONG (precise context, same token count): "Write a 200-word product launch email for B2B SaaS buyers. Feature: real-time dashboard analytics. Tone: confident, not salesy. CTA: book a demo. No jargon."

AI Response

The strong version gives the model six specific constraints in roughly the same number of tokens. The output will be dramatically more usable on the first attempt, reducing the back-and-forth that burns time and tokens.

Inside the Model: What 'Processing' Actually Means

Morgan Stanley's wealth management division deployed GPT-4 in 2023 to help financial advisors search across 100,000 internal research documents. The advisors would ask questions in plain English, 'What's our current view on European bank exposure?', and get synthesized answers with source citations. The reason this worked is the same reason Sarah's marketing email request works: the model doesn't retrieve a stored answer. It constructs one. GPT-4 is not a database. It's a prediction engine that has internalized patterns from an enormous volume of text, roughly 1 trillion words of training data, and uses those patterns to predict what a useful, coherent response looks like, one token at a time.

The core mechanism is called the transformer architecture, and the key component is attention. When the model processes Sarah's request for a product launch email, it doesn't treat all words as equally important. The attention mechanism runs calculations that determine which tokens in the input should most influence each token in the output. In her prompt, 'analytics feature' and 'product launch' carry enormous weight. The word 'me' carries almost none. This happens across multiple layers. GPT-4 has 96 of them, with each layer refining the model's understanding of relationships between concepts. By the time the model begins generating a response, it has effectively built a rich map of what the request is asking for, what domain it lives in, and what a good answer looks like.

For Morgan Stanley's advisors, this meant the model could handle questions that weren't literal matches to any document, it could synthesize across sources, resolve ambiguities, and produce answers that a keyword search engine never could. The same capability is available to every professional using ChatGPT. The model holds context across your entire conversation, weighing earlier messages against later ones. If you told ChatGPT in message three that your audience is non-technical, that constraint is still active in message twelve. The attention mechanism is why. It's not memory in the human sense, the model has no persistent state between separate conversations, but within a single session, it tracks everything.

How Different AI Tools Handle the Same Input

A UX researcher at a healthcare company recently ran the same complex prompt through ChatGPT (GPT-4o), Claude 3.5 Sonnet, and Gemini 1.5 Pro. The task: summarize a 4,000-word patient experience report and extract three actionable recommendations. All three models understood the request. All three produced structurally reasonable outputs. But the outputs were meaningfully different, in tone, in the specificity of recommendations, and in how they handled ambiguous language in the source document. This is not because one model is 'smarter.' It's because each model was trained differently, has different default behaviors, and processes the same tokens through different architectural choices.

Understanding these differences is practical knowledge, not trivia. Claude 3.5 Sonnet, built by Anthropic, tends to be more cautious with ambiguous healthcare language, often flagging uncertainty explicitly. GPT-4o tends toward confident synthesis even when the source material is unclear. Gemini 1.5 Pro has a context window of 1 million tokens, making it the right tool when you need to process truly massive documents, a full clinical trial report, say, or an entire codebase. The UX researcher's takeaway was not 'use one model for everything' but 'know what each model is optimized for and route tasks accordingly.' That's a professional-grade insight.

Model	Context Window	Strengths	Watch Out For
ChatGPT (GPT-4o)	128,000 tokens	Versatile, fast, strong at synthesis and coding	Confident tone even when uncertain, verify factual claims
Claude 3.5 Sonnet	200,000 tokens	Nuanced writing, careful reasoning, good at long documents	Can be more verbose; may over-hedge on sensitive topics
Gemini 1.5 Pro	1,000,000 tokens	Massive document processing, multimodal inputs	Less consistent on creative tasks; still maturing
Perplexity AI	~32,000 tokens (varies)	Real-time web search + synthesis, great for research	Not ideal for creative or generative tasks
GitHub Copilot (GPT-4)	~8,000–32,000 tokens	Code completion, debugging, inline suggestions	Context limited to open files; misses broader project context

Key characteristics of major AI tools as of mid-2024. Token counts and pricing change frequently, check provider documentation for current specs.

The Response: One Token at a Time

A senior consultant at McKinsey described using ChatGPT to draft a rapid-turnaround competitive analyzis. She noticed something interesting: when she asked for a structured output, 'give me a table comparing these five competitors on price, features, and market position', the model produced it cleanly. But when she asked a vaguer version, 'tell me about these competitors', the output was a series of paragraphs that buried the comparisons she needed. Same underlying information, radically different usability. What she was observing, without knowing it, is the token-by-token generation process in action.

When ChatGPT generates a response, it doesn't write the whole thing and then send it. It predicts the next most likely token, appends it, then predicts the next one, and so on, which is why you see text appear word by word in the interface. Each prediction is influenced by everything that came before: the original prompt, the conversation history, and every token the model has already generated in the current response. This is why the opening of a response matters so much. If the model starts generating in a rambling, unstructured way, it tends to continue that way, because its own output becomes part of the context shaping what comes next. The consultant's structured request gave the model a clear 'shape' to follow from the first token.

There's a parameter called temperature that controls how much randomness enters each token prediction. ChatGPT's default temperature sits around 0.7, a balance between predictable and creative. At temperature 0, the model always picks the single most likely next token, producing deterministic, consistent but sometimes robotic output. At temperature 1 or above, outputs get more creative and varied but also more prone to errors and hallucinations. You don't control temperature directly in the standard ChatGPT interface, but you do control it indirectly: asking for 'a creative, unexpected take' nudges the model toward higher-temperature behavior, while asking for 'a precise, factual summary' pulls it lower. The language of your prompt literally changes how the model generates its response.

The single most effective prompt habit

Specify the output format before describing the task. Instead of 'Summarize this report and give me recommendations,' write 'Give me a 3-item numbered list of recommendations, each under 30 words, based on this report.' The model locks onto that structure at token 1 and maintains it throughout. This one habit eliminates most of the reformatting work that follows a first-draft response.

What This Means in Practice

The tokenization-to-generation process has three direct implications for how you work with ChatGPT and similar tools. First, context is currency. Every token you include in a prompt is a resource, it takes up space in the context window, costs money in API-based applications, and competes for the model's attention. This doesn't mean be terse; it means be deliberate. A 50-token prompt with four specific constraints will outperform a 200-token prompt full of hedging and repetition. Klarna's engineers didn't make the AI smarter, they made the inputs leaner and more precise. The same principle applies whether you're a solo consultant or an enterprise team.

Second, the model has no intent, only prediction. When ChatGPT produces a response that misses what you meant, it's not being difficult or lazy. It predicted the most likely useful response given your input, and your input didn't fully specify what you needed. This reframes the debugging process entirely. Instead of 'why did it get this wrong,' the productive question is 'what was missing from my prompt that would have made the correct answer the most likely prediction?' That mindset shift, from evaluating AI outputs to engineering AI inputs, is what separates professionals who get consistent value from the tool from those who get intermittent results.

Third, the conversation structure matters as much as individual messages. Because the model attends to everything in the session, earlier messages set up later ones. Establishing your role, your audience, and your constraints at the start of a conversation, before you make any specific requests, changes the distribution of likely outputs across every subsequent exchange. The McKinsey consultant who got clean competitive analyzis tables didn't just write a better single prompt. She opened with 'I'm a strategy consultant preparing a board-level briefing. I need outputs that are precise, structured, and suitable for senior executives.' Every response in that session was shaped by that frame.

Trace Your Own Prompt Through the Pipeline

Goal: Experience firsthand how the same underlying request produces different outputs based on input precision, and develop a concrete sense of which prompt variables have the most impact on response quality.

1. Open ChatGPT (or Claude, or Gemini, any major chat interface). Start a fresh conversation. 2. Think of a real work task you've wanted AI help with, a document, an analyzis, an email. Write down the prompt you would normally send, exactly as you'd type it. 3. Count the approximate words in your draft prompt. Multiply by 1.3 to estimate the token count. 4. Identify every piece of information the model would have to guess or assume because you didn't specify it. List at least three assumptions. 5. Rewrite the prompt to eliminate those assumptions. Add: output format, length or scope, audience, and one constraint on tone or style. 6. Send the original prompt first. Copy the response into a separate document. 7. Send the rewritten prompt in a new conversation. Copy that response alongside the first. 8. Compare the two outputs directly. Note which required more editing, which was closer to what you actually needed, and what specific changes in the prompt produced the differences. 9. Write two sentences summarizing what you'd change about your default prompting approach based on what you observed.

Key Principles from This Section

Your input is the only variable you control. The model is fixed; your prompt is not. Every improvement in output quality traces back to an improvement in input quality.
Tokenization is the first transformation your text undergoes, and context window limits make token efficiency a real constraint, especially in production applications.
The model captures meaning through vector representations, not word matching. 'Bank' near 'river' and 'bank' near 'loan' are mathematically different inputs.
Attention mechanisms mean the model weighs some parts of your prompt more heavily than others. High-weight tokens, the specific task, the key constraints, drive the output.
Response generation is sequential and self-reinforcing. The structure of the first few tokens in a response shapes the entire output that follows.
Different models. GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, process the same input through different architectural choices and training, producing meaningfully different outputs for the same task.
Specifying output format at the start of a prompt is the single highest-leverage habit for improving first-draft quality.

What to Take Forward

ChatGPT converts your text to tokens, then to vectors, before any processing begins, meaning meaning is captured mathematically, not linguistically.
The context window (128K tokens for GPT-4o) holds your entire conversation, and the model attends to all of it when generating each response.
Token-by-token generation means early output shapes later output, give the model a clear structure to follow from the first line of your prompt.
Setting context at the start of a conversation (role, audience, constraints) improves every subsequent response in that session.
Choosing the right model for the task, not just defaulting to ChatGPT, is a professional skill. Match context window size, tone tendencies, and capability strengths to your specific need.
The debugging mindset for bad AI outputs: 'What was my prompt missing?' not 'What did the model get wrong?'

How Morgan Stanley Learned What Tokens Actually Cost

In 2023, Morgan Stanley rolled out an internal GPT-4 assistant to help financial advisors search 100,000+ pages of investment research. Early users loved it. Then the bills arrived. The team discovered that certain advisors were pasting entire client portfolios, sometimes 8,000 words of structured data, into every single prompt, asking questions like 'What should I recommend?' The model was processing millions of tokens per day on context that barely changed between queries. The cost wasn't the real problem. The quality was. Those bloated prompts were actually producing worse answers than shorter, targeted ones.

The Morgan Stanley team had stumbled onto something counterintuitive: more input doesn't mean better output. Their engineers dug into how GPT-4 was actually reading those massive prompts, and what they found reshaped how they built the entire system. The model wasn't reading the portfolio data the way a human analyzt would, skimming for relevance, weighting recent information, ignoring boilerplate. It was processing everything as tokens with equal initial weight, then using attention mechanisms to figure out what mattered. When you front-load a prompt with 6,000 tokens of spreadsheet data before asking your actual question, you're making the model do enormous work to locate the relevant signal.

Their fix was elegant. Instead of dumping raw data into prompts, they pre-processed portfolios into structured summaries, about 400 tokens, and placed the specific question first. Response quality jumped. Latency dropped by 40%. This case illustrates a principle that applies to every professional using ChatGPT, Claude, or Gemini: the journey from your keyboard to a response is a precise mechanical process, and understanding each step tells you exactly where to intervene to get better results.

Token Costs in Real Numbers

GPT-4o charges roughly $5 per million input tokens and $15 per million output tokens (as of mid-2024). A typical professional email is about 200 tokens. A 10-page report is roughly 4,000 tokens. Morgan Stanley's bloated prompts were costing 40x more per query than their optimized versions, not because the model was slow, but because tokenization scales linearly with input length.

The Attention Mechanism: Where Meaning Gets Built

When your tokenized prompt enters GPT-4, the first major transformation is embedding, each token gets converted into a vector, essentially a list of around 12,000 numbers that encodes its meaning in relationship to every other word the model has ever processed. The token 'bank' gets a different vector depending on whether nearby tokens suggest 'river' or 'loan.' This is fundamentally different from a keyword search, which just looks for exact matches. Embedding captures that 'physician' and 'doctor' are nearly identical concepts, and that 'not happy' is semantically close to 'disappointed', nuances that matter enormously when you're asking an AI to analyze customer sentiment or draft a negotiation email.

After embedding, the attention mechanism runs. This is the architectural breakthrough, introduced in the 2017 paper 'Attention Is All You Need', that made modern language models possible. Attention allows every token in your prompt to 'look at' every other token and decide how much to weight it when building meaning. When you write 'The board approved the merger but the CEO has reservations,' the word 'reservations' attends strongly to 'CEO' and 'merger' to understand it means doubts, not hotel rooms. This happens across 96 attention heads simultaneously in GPT-4, each head learning to track different types of relationships: grammatical structure, entity references, sentiment, logical dependencies.

The practical implication is significant. Attention operates over your entire prompt simultaneously, it's not sequential reading. This means the model genuinely processes context from anywhere in your message to interpret any specific word or phrase. But attention isn't infinite. GPT-4's context window is 128,000 tokens, but research consistently shows that retrieval accuracy degrades for information buried in the middle of very long contexts, a phenomenon researchers call the 'lost in the middle' problem. Information at the start and end of your prompt gets higher effective attention. The Morgan Stanley team's instinct to put the question first wasn't just tidiness; it was mechanically sound.

Attention-Aware Prompt Structure

Prompt

Question: Which of our three product lines has the weakest margin trend? Context: Q1 Product A margin: 34%, Q2: 31%, Q3: 29%. Product B margin: 22%, 23%, 22%. Product C margin: 41%, 38%, 34%. Answer in one sentence, then explain the trend in 2-3 sentences.

AI Response

Product C shows the weakest margin trend despite having the highest absolute margins, declining 7 percentage points over three quarters, a steeper relative drop than any other line. Product A has also declined consistently (5pp), while Product B has remained essentially flat. The pattern in C and A suggests either rising input costs or pricing pressure that isn't affecting B, which would be worth investigating separately.

A Hospital System, a Chatbot, and the Layers You Never See

In 2022, a large US hospital network deployed a patient-facing chatbot powered by GPT-3.5 to handle appointment scheduling and insurance questions. The clinical team was nervous about one thing: what if patients asked about symptoms or medications? OpenAI's base model had no particular reason to stay in its lane. The solution was a system prompt, a hidden block of instructions injected before every patient message, that established the bot's role, constraints, and persona. Patients never saw it. It read something like: 'You are an appointment scheduling assistant for [Hospital]. You only answer questions about scheduling, insurance, and visiting hours. For any medical questions, respond with: Please speak with a healthcare provider.'

This architecture, system prompt, then conversation history, then user message, is how virtually every production AI product is built. When you use ChatGPT's free tier, OpenAI's system prompt establishes safety guidelines and the model's identity. When you use a company's customer service bot, their system prompt shapes every response you receive. The model you're technically talking to might be identical in both cases. GPT-4o, say, but the effective behavior is entirely different. Understanding this explains why the same underlying model can feel like a cautious legal assistant in one product and a creative brainstorming partner in another.

Stage	What Happens	Where You Can Influence It	Common Mistake
Tokenization	Text split into ~4-character chunks; punctuation and spaces matter	Word choice, formatting, avoiding typos	Assuming the model reads words, not tokens
Embedding	Tokens converted to meaning vectors; context shapes interpretation	Providing relevant context near ambiguous terms	Using jargon without defining it in context
Attention	Every token weighs every other; position in prompt matters	Putting key question/task first; keeping prompts focused	Burying the actual request at the end of long context
System Prompt Layer	Hidden instructions set model behavior before your message	Choosing the right product/API configuration	Not knowing this layer exists or can be set
Forward Pass	96 transformer layers process the full context	Nothing directly, this is internal computation	Thinking more layers = longer wait (it's parallelized)
Sampling	Model selects each output token probabilistically	Temperature setting; asking for confident/creative output	Expecting deterministic answers to open-ended questions
Detokenization	Token IDs converted back to readable text	Nothing, automatic	Confusion when output has odd spacing or punctuation

The seven stages from your message to a ChatGPT response, and where professional users can intervene

Temperature: The Dial That Changes What 'Intelligence' Means

A product team at Spotify was using GPT-4 to generate playlist description copy, short, evocative blurbs like 'This is your Sunday morning, slow and golden.' They ran the same prompt 20 times and got nearly identical results. Frustrated, their copywriter assumed the AI was 'uncreative.' Their ML engineer checked the API configuration: temperature was set to 0.1. At that setting, the model almost always picks the single highest-probability next token, deterministic, consistent, safe. Useful for tasks like extracting structured data or answering factual questions where you want the same correct answer every time. Terrible for creative copy, where variation and surprise are the entire point.

Temperature is a parameter applied during the sampling stage, after the model has computed probability distributions across its entire 100,000+ token vocabulary for the next word. At temperature 1.0, sampling follows those probabilities naturally, producing varied, sometimes unexpected outputs. At temperature 0, it always picks the top candidate. At temperature 1.5 or higher, lower-probability tokens get amplified, producing genuinely strange and occasionally brilliant outputs, but also more errors and hallucinations. The Spotify team set temperature to 0.9 for their copy generation and suddenly had 20 meaningfully different options to choose from. Same model. Same prompt. Completely different utility.

Match Temperature to Your Task Type

Use low temperature (0–0.3) for extraction, classification, data formatting, factual Q&A, and code generation, tasks where correctness matters more than variety. Use mid-range (0.6–0.8) for drafting, summarization, and analyzis where you want coherent but not robotic output. Use high temperature (0.9–1.2) for brainstorming, creative writing, and generating options. ChatGPT's default sits around 0.7–0.8 for most tasks. If you're using the API directly, temperature is the single most impactful parameter you control.

What This Means When You're Waiting for a Response

Here's something most users never consider: ChatGPT doesn't write your response and then send it. It generates one token at a time, streaming each piece to your screen as it's produced. That's why you see text appearing word by word rather than all at once. Each token generation is a complete forward pass through the model, all 96 transformer layers, processing the full context window including everything generated so far. GPT-4 runs on clusters of NVIDIA A100 GPUs, and OpenAI routes requests across data centers in real time to manage load. When ChatGPT feels slow, it's usually because token generation speed is throttled under high demand, not because your question was harder.

This token-by-token generation creates an important characteristic: the model commits to each word as it goes. It doesn't draft a full answer, review it, and then send. This is why ChatGPT can confidently write itself into a corner, starting a sentence with apparent certainty, then producing tokens that contradict earlier ones because the probability landscape shifted mid-generation. This is the mechanical root of hallucination. The model isn't lying or guessing randomly; it's producing the statistically likely continuation of the text it has generated so far, which sometimes diverges from factual reality. Knowing this, the professional response is to treat confident-sounding outputs on factual matters as drafts requiring verification, not finished research.

Perplexity AI's approach addresses this directly. Rather than generating purely from model weights, Perplexity runs a web search first, injects retrieved sources into the context window as part of the system prompt layer, and then generates a response grounded in those documents. The generation process is identical to what GPT-4 does, tokenize, embed, attend, forward pass, sample, but the context it's attending to contains current, cited information rather than only the model's training data. This is called retrieval-augmented generation, or RAG, and it's the architecture behind most enterprise AI deployments where accuracy matters more than creativity. GitHub Copilot uses a similar approach, injecting your current file and related code as context before generating completions.

Map the Pipeline in Your Own Workflow

Goal: Identify which stages of the ChatGPT processing pipeline are creating friction or missed opportunity in a real task you use AI for regularly.

1. Choose one recurring task where you use ChatGPT, Claude, or a similar tool, something you do at least weekly (drafting, analyzis, summarization, etc.). 2. Write down your current standard prompt word-for-word, exactly as you typically send it. 3. Count the approximate word count of your prompt and multiply by 1.3 to estimate tokens. Note whether this is under 500 tokens (lean) or over 2,000 (heavy). 4. Identify where your question or core request appears in the prompt, first sentence, middle, or end. If it's not first, rewrite the prompt with the request at the top. 5. Look up or estimate the temperature setting for your tool. ChatGPT web interface uses ~0.7-0.8 by default. If you're doing a creative or generative task, note whether you need more variation. 6. Run your original prompt and your rewritten version on the same task. Save both outputs. 7. Compare the two responses: which better addresses your actual need, and which stage of the pipeline explains the difference? 8. Write a one-paragraph 'prompt spec' for this task, documenting the ideal structure, token target, and any constraints, so you can replicate it without rethinking from scratch each time. 9. Share your before/after prompts and outputs with a colleague who does similar work, and collect their reaction to the quality difference.

Principles Extracted from the Pipeline

Tokenization is mechanical, not semantic, the model reads chunks of characters, not words with inherent meaning, which is why formatting and phrasing affect output more than they seem like they should.
Attention is position-sensitive, your most important instruction or question belongs at the beginning of your prompt, not buried after paragraphs of context.
The system prompt layer is always present in production tools, every AI product you use has hidden instructions shaping behavior before your message is processed.
Temperature controls the creativity-accuracy tradeoff, deterministic tasks need low temperature, generative tasks need higher, and the default is a compromise that serves neither extreme well.
Token-by-token generation explains hallucination, the model commits forward without revision, so factual outputs require verification regardless of how confident the phrasing sounds.
Context window size and attention quality are different things, a 128K context window doesn't mean equal attention across all 128K tokens; information in the middle of long prompts is statistically disadvantaged.
Architecture shapes behavior. RAG systems like Perplexity and Copilot use the same generation mechanics but inject retrieved context, which is why they produce more current and verifiable outputs than pure generation.

Key Takeaways

Every prompt travels through seven distinct stages before you see a response, tokenization, embedding, attention, system prompt processing, forward pass, sampling, and detokenization.
Morgan Stanley's experience shows that prompt length and structure directly affect both cost and quality, more context isn't always better context.
The attention mechanism processes your entire prompt simultaneously, not sequentially, which is why the position of your request within a prompt measurably affects output quality.
Temperature is the most impactful API parameter for everyday professional use, matching it to your task type (low for accuracy, high for creativity) changes results significantly.
Hallucination is a structural feature of token-by-token generation, not a bug that will be patched away, building verification into your workflow is the appropriate response.
Production AI products like Perplexity, Copilot, and enterprise chatbots extend the basic pipeline with retrieval layers, making them more reliable for fact-sensitive professional work.

In 2023, the legal team at Allen & Overy, one of the world's largest law firms, began using an AI tool called Harvey, built on GPT-4, to draft contract summaries and flag risk clauses. Early users reported something curious: the same underlying model gave dramatically different output quality depending on how attorneys phrased their requests. Senior partners who understood what the model was doing under the hood consistently got tighter, more accurate summaries. Junior associates who treated it like a search engine got verbose, occasionally hallucinated responses. Same model, same firm, same documents, wildly different results. The difference wasn't talent. It was mental model.

What the senior partners had internalized, even if they couldn't articulate it technically, was that the model isn't retrieving answers from a database. It's predicting the most contextually appropriate continuation of everything it has received so far: the system prompt set by Harvey's engineers, the attorney's input, and the implicit patterns baked into billions of training examples. Every word they typed was shaping a probability distribution. Specific, structured, context-rich prompts narrowed that distribution toward useful outputs. Vague prompts left the model with too many equally plausible continuations, and it picked ones that sounded authoritative but weren't grounded in the actual document.

This is the core tension running through every professional use of ChatGPT, Claude, or Gemini: the model is extraordinarily capable, but it has no intent of its own. It has no goal beyond completing the sequence in the most statistically coherent way. That sounds like a limitation, and it is, but it's also the key to using these tools with precision. Once you understand the mechanism, you stop hoping the AI will 'figure out what you mean' and start engineering the context it needs to give you what you actually want.

What the model 'sees' when you hit send

ChatGPT processes your entire conversation as a single block of tokens every time you send a message. It doesn't have persistent memory of past sessions (unless explicitly built in). It sees: the system prompt (set by OpenAI or the platform), your full conversation history within this session, and your new message, all concatenated, all influencing the next token prediction.

Consider how Spotify's editorial team uses AI-assisted writing for playlist descriptions. A prompt like 'write a playlist description' produces generic filler. But when a Spotify editor writes 'You are a music journalist writing for a 28-year-old urban listener. This playlist contains late-night R&B from 2018–2022 with a melancholic tone. Write a 40-word description that feels like a recommendation from a friend,' the output is sharp, on-brand, and usable. The editor hasn't changed the model, they've changed the probability space the model is navigating. The richer context rules out thousands of plausible-but-wrong continuations before the model generates a single word.

This is token prediction working in your favor. The model's training exposed it to millions of music journalism examples, friend-recommendation speech patterns, and concise copywriting structures. Your detailed prompt activates that specific cluster of learned patterns rather than the average of everything the model knows about music. The temperature setting, which neither you nor the Spotify editor controls directly in standard ChatGPT, also shapes this: lower temperature means the model sticks closer to the highest-probability token at each step, producing more predictable, consistent output. Higher temperature introduces more variation, which is why ChatGPT's creative writing can feel surprisingly fresh.

The attention mechanism, the architectural innovation that makes modern language models work, is what allows all this context to interact. When the model processes the word 'melancholic' in the Spotify prompt, the attention layers connect it back to 'late-night,' 'R&B,' and 'friend recommendation' simultaneously. It's not reading left to right like a simple grammar checker. It's weighing every token against every other token in the context window, building a rich representation of meaning before committing to output. This is why longer, well-structured prompts genuinely outperform short ones, more signal for the attention mechanism to work with.

Weak vs. strong prompt, same task, different context richness

Prompt

WEAK: Summarize this report. STRONG: You are a senior analyzt briefing a CFO who has 3 minutes to read. Summarize the attached Q3 report in 5 bullet points. Lead with the single most important financial risk. Use plain language, no jargon. Flag any figures that require verification.

AI Response

The strong prompt activates a specific role (analyzt), audience (CFO), format (5 bullets), priority order (risk first), style constraint (plain language), and a quality-check instruction. Each element eliminates low-quality continuations. The weak prompt leaves all of those dimensions open, and the model fills them with statistical averages, which are rarely what you needed.

McKinsey's research division ran an internal study in early 2024 comparing consultant output quality when using AI with and without structured prompt frameworks. Consultants using a role-context-constraint-format structure consistently produced first drafts that required 40% less revision time than those using freeform prompts. This wasn't about following a rigid template, it was about giving the model's attention mechanism enough anchors to produce contextually appropriate output on the first pass. The consultants who understood why the structure worked adapted it fluidly to new tasks. Those who followed it mechanically without understanding it struggled when novel situations broke the template.

Prompt Element	What It Does in the Model	Example
Role assignment	Activates a specific cluster of training patterns	You are a tax attorney...
Audience definition	Shapes vocabulary and complexity calibration	...briefing a non-technical CEO
Format constraint	Narrows structural output choices	Respond in 3 bullet points
Tone instruction	Filters stylistic token probabilities	Direct and skeptical tone
Negative constraint	Explicitly rules out common failure modes	Do not include caveats or disclaimers
Context data	Gives attention mechanism concrete anchors	Based on this document: [paste text]

Each prompt element directly influences which token sequences the model assigns high probability, this is prompt engineering grounded in mechanism, not guesswork.

A product manager at a mid-sized SaaS company described her breakthrough moment with ChatGPT this way: 'I stopped asking it questions and started giving it jobs.' She had been using it like a search engine, querying it for answers. The shift was treating each prompt as a job specification: define the role, the deliverable, the audience, the constraints, and the format. Her time-to-usable-draft dropped from 25 minutes of back-and-forth to under 5 minutes of single-shot prompting. The model hadn't improved between Monday and Tuesday. Her understanding of what was happening inside it had.

This mental shift, from 'querying' to 'briefing', is the practical payoff of understanding token prediction and attention. When you know the model is building a probability-weighted continuation of your input, you understand why vague inputs produce average outputs. You understand why pasting the actual document beats describing the document. You understand why 'be concise' works better than hoping the model infers you want brevity. You're not manipulating the model, you're communicating with it in the language it actually processes: structured, specific, context-dense token sequences.

The 'cold start' problem in new conversations

Every new ChatGPT conversation starts with zero context about you, your role, or your standards. High-performing users keep a personal 'context block', 3–5 sentences describing their role, industry, preferred output style, and recurring constraints, that they paste at the start of any serious work session. This primes the model's attention mechanism before you ask your first real question. Takes 10 seconds. Consistently improves first-draft quality.

The broader implication reaches beyond prompt craft. Understanding that ChatGPT, Claude, and Gemini are all fundamentally token prediction engines, trained on human-generated text, shaped by RLHF to align with human preferences, constrained by context windows, and sensitive to temperature settings, tells you what these tools are structurally good and bad at. They're excellent at tasks where the right answer looks like skilled human writing: summarization, drafting, reformatting, explaining, brainstorming. They're unreliable at tasks requiring verified facts, real-time data, precise arithmetic, or causal reasoning about novel situations, because those tasks aren't well-served by predicting what a fluent continuation would look like.

Build Your Personal AI Context Block

Goal: Produce a reusable personal context block that primes ChatGPT's attention mechanism before your first real prompt, cutting revision time on AI-assisted work by giving the model role, style, and constraint anchors upfront.

1. Open a new document or note, this becomes a permanent reference you'll reuse. 2. Write one sentence describing your professional role and industry (e.g., 'I am a marketing manager at a B2B software company targeting mid-market CFOs'). 3. Write one sentence describing your most common AI use case (e.g., 'I primarily use AI to draft internal communications, summarize reports, and prepare talking points'). 4. Write one sentence specifying your preferred output style (e.g., 'I prefer direct, jargon-free language in short paragraphs or bullet points, no lengthy preambles'). 5. Write one sentence listing your most common constraints (e.g., 'Outputs should always be under 200 words unless I specify otherwise, and should never include unsourced statistics'). 6. Combine these four sentences into a single context block paragraph. 7. Open ChatGPT and paste your context block as the first message in a new conversation, then follow it immediately with a real work task you need done today. 8. Save the context block somewhere accessible, a pinned note, a text expander shortcut, or a browser bookmark, so you can paste it in under 10 seconds. 9. After the session, note one adjustment that would improve the block based on the output you received.

Token prediction is the mechanism: ChatGPT generates output one token at a time, selecting the most probable next token given everything in the context window, your prompt is literally shaping a probability distribution.
Attention connects everything: the transformer's attention mechanism weighs every token against every other token simultaneously, which is why richer prompts with more relevant context consistently outperform sparse ones.
RLHF shapes the surface: human feedback training steers outputs toward helpful, harmless, and honest responses, this is why ChatGPT sounds cooperative and avoids certain content, not because it has values but because those patterns were reinforced.
Context windows are finite and consequential: everything the model 'knows' in a session lives in the context window; what you include and exclude directly determines what the model can and cannot draw on when generating your response.
Briefing beats querying: professionals who treat AI prompts as job specifications, role, deliverable, audience, constraints, format, consistently get better first drafts than those who ask open-ended questions and hope for the best.
The model has no intent: it has no goal, no understanding, and no memory between sessions, this isn't a flaw to work around but a characteristic to design for, by supplying the context and structure the model cannot infer.

Every message you send is processed as tokens, discrete units the model predicts sequentially, not as meaning it comprehends
The attention mechanism is why context richness matters: more relevant anchors produce more targeted output
Temperature controls output randomness, lower for consistency, higher for creative variation
RLHF is why ChatGPT behaves helpfully and avoids harmful content, it's trained behavior, not built-in ethics
Your context block is now a tool: paste it at the start of any serious AI work session to prime the model before your first real request
The professionals getting the most from these tools aren't the most technically skilled, they're the ones with the most accurate mental model of what's actually happening when they hit send

Featured Reading

Practice this in a lab

Fix the Broken Prompt: Hospital Discharge Instructions

beginner · 12 min

Spot the Better Prompt: Hospital Discharge Instructions

beginner · 10 min