Back to Inside the Machine: How AI Models Think

Lesson 8 of 11

Size, Speed, Cost: What Actually Matters

~16 min readLast reviewed May 2026

This lesson counts toward:Grow Faster: AI for Small Teams How AI Actually Works

What Makes One AI Better Than Another

Not all AI models are equal, and the differences aren't random. GPT-4 outperforms GPT-3.5 on complex reasoning for a specific set of architectural and training reasons. Claude 3 Opus handles long documents better than most competitors because of deliberate design choices around context. Gemini Ultra beats smaller models on multimodal tasks because of how it was trained across data types. Once you understand the five core dimensions that separate weak AI from strong AI, you stop guessing which tool to use and start choosing with precision. This lesson is your reference guide for exactly that.

7 Things You Need to Know

Model size (parameter count) is the single most discussed quality signal, but it's not the only one that matters.
Training data quality beats training data quantity. A smaller model trained on curated text often outperforms a larger model trained on web scraps.
Context window size determines how much of your conversation or document the AI can actually 'see' at once.
Benchmarks like MMLU, HumanEval, and HellaSwag are the industry's standardized tests, real scores, real comparisons.
Fine-tuning transforms a general model into a specializt. GPT-4 base and a fine-tuned GPT-4 for legal contracts behave very differently.
Alignment and safety training (RLHF) shapes how a model responds, not just what it knows.
Inference speed and cost are practical quality dimensions that benchmarks rarely capture but professionals feel immediately.

Parameter Count: The Size Signal

Parameters are the numerical weights inside a neural network, the values adjusted during training until the model learns to predict language accurately. A model with more parameters has more capacity to store patterns, relationships, and knowledge. GPT-3 had 175 billion parameters. GPT-4's exact count is undisclosed, but credible estimates place it above 1 trillion in a mixture-of-experts architecture. Llama 3 70B, Meta's open-source model, has 70 billion, powerful enough to run on high-end consumer hardware and competitive with GPT-3.5 on many tasks.

The catch: parameter count is a capacity measure, not a performance guarantee. A 70B model trained on excellent data with careful fine-tuning regularly beats a 200B model trained carelessly. Mistral 7B, just 7 billion parameters, outperformed Llama 2 13B on several benchmarks when it launched in 2023, precisely because of architectural efficiency. Think of parameters as engine displacement: a well-tuned 2.0L engine beats a poorly maintained 3.0L. Size creates potential; everything else determines whether that potential is realized.

More parameters = more capacity to learn complex patterns and store factual knowledge
Mixture-of-Experts (MoE) architecture activates only a subset of parameters per query, efficiency without sacrificing scale
Open-source models (Llama 3, Mistral, Falcon) publish parameter counts; proprietary models (GPT-4, Claude 3) often don't
Running larger models locally requires more VRAM, a 70B model needs roughly 40GB of GPU memory at full precision
Quantization compresses model weights, letting a 70B model run in 24GB VRAM with minimal quality loss

When Parameter Count Actually Matters to You

If you're choosing between API pricing tiers, say, GPT-4 Turbo at $10/million input tokens vs. GPT-3.5 at $0.50/million, you're implicitly choosing by model size. For summarizing emails, GPT-3.5 is fast and cheap. For analyzing a 40-page contract with nuanced legal reasoning, the larger model earns its price. Match model scale to task complexity, not to habit.

Model Comparison: Size, Context, and Cost

Model	Est. Parameters	Context Window	Input Cost (per 1M tokens)	Best For
GPT-3.5 Turbo	~175B	16K tokens	$0.50	Fast drafting, simple Q&A, high-volume tasks
GPT-4 Turbo	Est. 1T+ (MoE)	128K tokens	$10.00	Complex reasoning, long documents, code
Claude 3 Haiku	Undisclosed	200K tokens	$0.25	Speed-sensitive tasks, large document intake
Claude 3 Opus	Undisclosed	200K tokens	$15.00	Research, nuanced analyzis, difficult writing
Gemini 1.5 Pro	Undisclosed	1M tokens	$3.50	Massive document sets, multimodal tasks
Llama 3 70B	70B	8K tokens	Open-source	Self-hosted deployments, privacy-sensitive work
Mistral 7B	7B	8K tokens	Open-source	Lightweight local use, fast inference on limited hardware

Pricing as of mid-2024. Context windows and costs update frequently, verify at each provider's pricing page.

Training Data: What the Model Learned From

Every AI model is a compressed reflection of its training data. GPT-4 was trained on a mixture of web text, books, code repositories, and licensed datasets, with a knowledge cutoff of April 2023 for its base version. Claude 3 was trained by Anthropic on a curated corpus with heavy emphasis on alignment-safe content and long-form reasoning tasks, which explains why it handles nuanced ethical questions and extended documents differently than GPT-4. The training data shapes not just what the model knows, but how it reasons, what biases it carries, and where it fails.

2023

Historical Record

The New York Times vs. OpenAI

In 2023, The New York Times filed a lawsuit against OpenAI over training data licensing practices, putting the legality of AI model training data acquisition under legal scrutiny.

The lawsuit highlighted concerns about whether companies have the legal right to use published content for training large language models without explicit permission or compensation.

Check the knowledge cutoff date before using any model for current events or recent data. GPT-4 Turbo cuts off at April 2023, Gemini 1.5 Pro at November 2023.
Models trained on multilingual data (like GPT-4 and Claude 3) handle translation and non-English tasks far better than English-dominant models.
Code-specialized models (GitHub Copilot, Code Llama) use training sets where 30-50% of tokens are source code, that's why they're better at coding than general models.
Synthetic data. AI-generated training examples, is increasingly used to fill gaps; Llama 3 used synthetic data for math and reasoning tasks.
Retrieval-Augmented Generation (RAG) lets you inject fresh, specific data at query time, partially compensating for stale training data.

Benchmark Scores: The Industry's Standardized Tests

Benchmark	What It Tests	Score Range	GPT-4 Turbo	Claude 3 Opus	Gemini 1.5 Pro
MMLU	57-subject knowledge (science, law, math, history)	0–100%	86.5%	86.8%	81.9%
HumanEval	Python coding: write a correct function from a docstring	0–100%	87.0%	84.9%	71.9%
MATH	Competition-level mathematics problems	0–100%	52.2%	45.0%	58.5%
HellaSwag	Common-sense reasoning about everyday situations	0–100%	95.3%	95.4%	92.5%
GPQA	Graduate-level science questions (hard)	0–100%	35.7%	50.4%	41.7%
MT-Bench	Multi-turn conversation quality (judge model scores)	1–10	9.32	9.12	8.90

Scores from official model cards and third-party evals (early 2024). Benchmark results vary by prompt format and evaluation methodology.

How to Read Benchmark Claims Without Being Fooled

Companies publish benchmark scores selectively. A model might top one leaderboard because it was fine-tuned on benchmark-style questions, a practice called 'benchmark overfitting.' Always look for third-party evaluations, not just the model card. LMSYS Chatbot Arena (chat.lmsys.org) uses blind human preference voting across thousands of real conversations and is harder to game than static benchmarks. When a company says 'state-of-the-art,' ask: on which benchmark, under what conditions, compared to what?

Context Windows: The AI's Working Memory

The context window is the maximum amount of text an AI can process in a single interaction, both your input and its output combined. GPT-4 Turbo offers 128,000 tokens (roughly 96,000 words, about the length of a full novel). Gemini 1.5 Pro pushes this to 1 million tokens, enough to process an entire codebase or a year's worth of meeting transcripts. Claude 3 models offer 200,000 tokens. Smaller models like Llama 3's base version top out at 8,000 tokens, fine for a single document, limiting for complex multi-document analyzis.

Context window size has direct workflow implications. If you're asking an AI to analyze a 60-page report, a model with a 16K context window will either truncate your document or require you to chunk it manually, adding friction and risking lost continuity between chunks. A 200K context window ingests the entire report in one shot. But large context windows don't eliminate a subtler problem: 'lost in the middle.' Research from Stanford (2023) showed that models reliably recall information at the start and end of long contexts but miss details buried in the middle. Even with a 1M token window, position matters.

The 'Lost in the Middle' Problem

Don't assume a large context window means the AI reads everything with equal attention. A 2023 Stanford study found that retrieval accuracy dropped significantly for information placed in the middle of long contexts, even in models like GPT-4 and Claude. For high-stakes analyzis of long documents, put your most critical questions and key facts near the beginning or end of your prompt, not buried in the middle of a massive document dump. This applies to every major model currently available.

Training Data, Fine-Tuning, and Why Context Windows Matter

You now know that model size and architecture shape raw capability. But two models with identical parameter counts can perform wildly differently, because what a model learned, and how it was shaped after initial training, determines real-world usefulness. This section covers the three factors professionals consistently underestimate: the quality and recency of training data, fine-tuning for specific behaviors, and context windows that determine how much a model can hold in working memory at once.

Training Data: Quantity Is the Floor, Quality Is the Ceiling

Every large language model learns by processing enormous text datasets. GPT-4 trained on roughly 1 trillion tokens, which represents hundreds of billions of words. But raw volume is only the starting point. The composition of that data shapes everything: a model trained heavily on Reddit threads will reason differently than one trained on academic papers and legal documents. OpenAI, Anthropic, and Google all curate proprietary datasets and apply filtering layers to remove low-quality text, which is why frontier models from these labs outperform open-source models trained on unfiltered web scrapes, even when parameter counts are similar.

The other critical dimension is the training cutoff date. ChatGPT's GPT-4 has a knowledge cutoff of April 2023. Claude 3.5 Sonnet cuts off at early 2024. Gemini 1.5 Pro extends further, with Google's search integration partially compensating for gaps. This matters practically: ask any of these models about events after their cutoff and they either refuse or hallucinate with confidence. Perplexity AI sidesteps this by combining a language model with live web retrieval, a different architectural choice that trades some reasoning depth for real-time accuracy.

Training data volume: GPT-4 ~1T tokens, Llama 3 70B ~15T tokens (more data, smaller model)
Data quality filtering removes spam, duplicates, and low-signal content before training begins
Domain weighting: models exposed to more code perform better at coding tasks, regardless of size
Multilingual coverage varies. GPT-4 handles 50+ languages well; many open-source models degrade sharply outside English
Knowledge cutoff is a hard wall: nothing after that date exists in the model's weights
Retrieval-augmented generation (RAG) is the standard fix for cutoff limitations in production systems

Check the cutoff before you trust the answer

When using any AI for market research, competitive analyzis, or regulatory questions, verify the model's training cutoff first. In ChatGPT, ask: 'What is your knowledge cutoff date?' For anything time-sensitive, either use Perplexity (live retrieval) or manually provide the current context in your prompt. A model confidently describing last year's market leader as current is worse than no answer at all.

Model	Provider	Training Cutoff	Live Web Access	Approx. Training Tokens
GPT-4o	OpenAI	Oct 2023	With browsing plugin	~1T (est.)
Claude 3.5 Sonnet	Anthropic	Apr 2024	No (native)	Undisclosed
Gemini 1.5 Pro	Google	Nov 2023	Yes (native)	Undisclosed
Perplexity (online)	Perplexity AI	Real-time	Yes (core feature)	Base model varies
Llama 3 70B	Meta (open source)	Dec 2023	No (base model)	~15T
Mistral Large	Mistral AI	Early 2024	No (base model)	Undisclosed

Training cutoffs and web access across major models, accurate as of mid-2024

Fine-Tuning: How Raw Models Become Useful Products

Pre-training produces a model that predicts text statistically, it's powerful but unpredictable, prone to generating harmful content, and not naturally helpful. Fine-tuning is the process that transforms this raw capability into a product you'd actually use. OpenAI applies Reinforcement Learning from Human Feedback (RLHF) to GPT-4, which means human raters scored thousands of model outputs and those scores trained a reward model that then shaped the AI's behavior. This is why ChatGPT answers questions helpfully instead of just completing text in the style of whatever was in its training data.

Anthropic takes a different approach with Claude, using a technique called Constitutional AI (CAI). Instead of relying entirely on human raters, they give the model a written 'constitution', a set of principles, and have it critique its own outputs against those principles. This makes Claude notably more cautious on sensitive topics and more consistent in its refusals than ChatGPT. Neither approach is objectively superior; they represent different trade-offs between helpfulness and safety that explain behavioral differences you've probably noticed between the two products. GitHub Copilot adds another fine-tuning layer on top of OpenAI's base model, trained specifically on public GitHub repositories.

Pre-training: model learns language patterns from massive text datasets, this is where knowledge comes from
Supervised fine-tuning (SFT): model is trained on high-quality example conversations showing desired behavior
RLHF: human raters score outputs; those scores train a reward model that further shapes responses
Constitutional AI (Anthropic): model self-critiques against written principles, reducing human rater dependency
Instruction tuning: a lighter fine-tuning step that teaches models to follow explicit commands accurately
Domain fine-tuning: companies like Harvey (legal AI) fine-tune base models on proprietary legal text for specializt performance
Alignment tax: aggressive safety fine-tuning sometimes reduces raw capability, a real trade-off labs manage carefully

Fine-Tuning Method	Used By	Primary Effect	Trade-off
RLHF	OpenAI (ChatGPT)	More helpful, conversational responses	Can be sycophantic, agrees too readily
Constitutional AI	Anthropic (Claude)	More consistent safety behavior	More refusals on edge cases
Instruction tuning	Most frontier models	Better instruction-following	Minimal capability cost
Domain fine-tuning	GitHub Copilot, Harvey, Notion AI	Expert performance in one domain	Loses generalist breadth
Direct Preference Optimization (DPO)	Mistral, Llama derivatives	Cheaper alternative to RLHF	Slightly less refined behavior

Fine-tuning methods and their real-world effects on model behavior

Why Claude refuses more than ChatGPT

Claude's Constitutional AI training makes it more likely to decline requests it interprets as potentially harmful, even when the intent is clearly benign. This isn't a bug; it's a deliberate design choice reflecting Anthropic's safety priorities. If Claude refuses a legitimate business task, rephrasing with explicit context ('I'm a compliance officer reviewing...') often resolves it. ChatGPT's RLHF training makes it more permissive but also more prone to telling you what you want to hear rather than what's accurate.

Context Windows: The Working Memory of AI

Context window size is the most practically important technical spec for daily professional use. It defines how much text, your prompt, the conversation history, any documents you've pasted in, the model can process at once. GPT-4o handles 128,000 tokens (roughly 96,000 words, or a short novel). Claude 3.5 Sonnet offers the same 200,000-token window. Gemini 1.5 Pro extends to 1 million tokens in its research preview, enough to process an entire codebase or year's worth of meeting transcripts in a single prompt. These numbers matter because anything outside the context window simply doesn't exist for the model.

Bigger context windows solve real problems: you can paste an entire 50-page report and ask specific questions without manually chunking the document. But there's a catch called the 'lost in the middle' problem, current models perform better on information at the start and end of a long context than in the middle. Anthropic's research showed Claude's recall accuracy drops for content buried deep in very long contexts. This means for critical tasks, strategic document placement inside your prompt still matters, even when the context window technically fits everything. Practical professionals treat context windows as capacity, not as a guarantee of equal attention.

Using a large context window effectively, document analyzis

Prompt

I'm going to paste a 40-page vendor contract below. After reading it, answer three questions: 1. What are the termination clauses and required notice periods? 2. Are there any auto-renewal provisions, and what are the opt-out deadlines? 3. Identify any liability caps and whether they favor the vendor or the client. Focus only on these three areas. Flag any ambiguous language with [AMBIGUOUS]. [PASTE FULL CONTRACT TEXT HERE]

AI Response

**1. Termination Clauses** Section 14.2 allows either party to terminate for convenience with 90 days written notice. Section 14.3 permits immediate termination for material breach if uncured within 30 days of written notice. [AMBIGUOUS] Section 14.5 references 'substantial non-performance' without defining the threshold. **2. Auto-Renewal Provisions** Section 3.1 includes automatic annual renewal unless written notice is provided at least 60 days before the contract anniversary date. No grace period is specified. **3. Liability Caps** Section 19.4 caps vendor liability at fees paid in the prior 3 months, this strongly favors the vendor. Client liability is uncapped for IP indemnification obligations under Section 21.

Don't confuse context window size with memory

Context windows reset between sessions. Claude's 200K token window means it can process 200K tokens in one conversation, not that it remembers your previous conversations. When you start a new chat in ChatGPT or Claude, the model has no memory of past sessions unless you're using a product feature that explicitly stores and re-injects history (like ChatGPT's Memory feature or a custom GPT with retrieval). Sensitive information shared in one session is not retained, but also not available to help you in the next one.

Benchmark Scores vs. Real-World Performance

AI labs publish benchmark scores obsessively. MMLU (massive multitask language understanding), HumanEval (coding), MATH, HellaSwag. These numbers create a ranking system that dominates tech press coverage. GPT-4 scored 86.4% on MMLU at launch; Claude 3 Opus scored 86.8%; Gemini Ultra claimed 90%. The problem is that benchmarks measure narrow, standardized tasks, and labs have strong incentives to optimize specifically for those tests. A model that scores 90% on a coding benchmark might still produce buggy code for your specific stack because the benchmark tested Python fundamentals, not your React/TypeScript monorepo.

Real-world performance depends on your specific use case far more than headline benchmark rankings suggest. analyzts at firms like Andreessen Horowitz and independent researchers at LMSYS run head-to-head comparisons on actual user tasks, the Chatbot Arena leaderboard, which uses blind human preference voting, consistently shows different rankings than lab-published benchmarks. For your professional workflow, the only benchmark that matters is performance on your actual tasks. This is why running parallel tests, same prompt in ChatGPT, Claude, and Gemini, is worth 30 minutes of your time before committing to any tool for a recurring workflow.

Benchmark	What It Measures	Limitation	Better Alternative
MMLU	Knowledge across 57 subjects	Multiple choice only; no open-ended reasoning	Test on your actual domain questions
HumanEval	Python code generation	Narrow language/task coverage	Test on your actual codebase problems
MATH	Competition math problems	Not representative of business math tasks	Test on your financial modeling scenarios
Chatbot Arena (LMSYS)	Human preference in real conversations	Subjective; task mix may not match yours	Closest to real-world signal available
HellaSwag	Commonsense reasoning completion	Largely saturated, top models all score 95%+	No longer differentiates frontier models

Common AI benchmarks, what they actually measure, and their limitations for professional evaluation

Quick Reference: Key Concepts So Far

Training data quality beats quantity, curated datasets outperform raw web scrapes at equivalent model size
Knowledge cutoff is a hard limit, verify it before trusting any time-sensitive output
RLHF (OpenAI) → more helpful but potentially sycophantic; Constitutional AI (Anthropic) → more consistent but more refusals
Context window = working memory per session, NOT persistent memory across sessions
128K tokens ≈ 96,000 words ≈ a short novel; Claude 3.5 Sonnet offers 200K tokens
'Lost in the middle' effect: models recall information at prompt start/end better than content buried in the middle
Benchmark scores are lab-optimized; Chatbot Arena human preferences are closer to real-world signal
Domain fine-tuning (Copilot, Harvey) beats general models in specialized tasks, but loses generalist flexibility
RAG (retrieval-augmented generation) is the standard solution when models need current or proprietary information

Run Your Own Model Comparison Test

Goal: Produce a concrete, evidence-based decision about which model to use for one specific professional task, replacing guesswork with a repeatable evaluation method.

1. Choose one recurring task from your actual work, a type of email you write often, a report section you draft weekly, or a data interpretation you perform regularly. 2. Write a single, specific prompt for that task. Keep it under 150 words and include all necessary context. 3. Open ChatGPT (GPT-4o), Claude.ai (Claude 3.5 Sonnet), and Gemini Advanced in three separate browser tabs. 4. Paste the identical prompt into all three without modification. Record the time each takes to respond. 5. Score each response on three criteria: accuracy, tone/style fit, and actionability. Use a simple 1-5 scale for each. 6. Note which model required the least editing to produce a usable output, this is your efficiency signal. 7. Document your winning model and the specific task type. This becomes the start of your personal AI tool routing guide.

You now understand training data and model architecture. The final piece is knowing how to read AI capability signals, benchmarks, context windows, and multimodal reach, so you can match the right tool to the right job without guessing. Most professionals pick AI tools based on brand familiarity. The ones who get the best results pick based on fit: task type, input format, output quality, and cost. These next concepts give you the vocabulary and the framework to do exactly that.

Benchmarks: What the Numbers Actually Mean

AI benchmarks are standardized tests, think of them as the SAT for language models. MMLU (Massive Multitask Language Understanding) tests knowledge across 57 subjects. HumanEval tests code generation. HellaSwag tests commonsense reasoning. When OpenAI claims GPT-4 scores 86.4% on MMLU, that's a real, reproducible number against a fixed test set. The catch: benchmark scores measure performance on known test types, not real-world usefulness. A model can ace MMLU and still write mediocre marketing copy. Use benchmarks to filter out weak models, not to crown a winner.

MMLU: General knowledge across law, medicine, history, STEM, best proxy for 'how smart is it overall'
HumanEval: Coding accuracy, critical if you use GitHub Copilot or ask models to write scripts
HellaSwag: Commonsense completion, predicts how naturally the model finishes sentences
TruthfulQA: Measures tendency to hallucinate, lower hallucination rate = higher score
MATH benchmark: Formal mathematics. GPT-4 scores ~52%, humans average ~40% (competition level)
MT-Bench: Multi-turn conversation quality, most relevant for chatbot and assistant use cases

Benchmark shortcut for professionals

For most business tasks, prioritize MT-Bench (conversation quality) and TruthfulQA (hallucination rate) over raw MMLU scores. A model that stays accurate and coherent across a long conversation is worth more than one that aces trivia but drifts off-topic by message five.

Model	MMLU Score	HumanEval (Code)	Context Window	Best For
GPT-4o	88.7%	90.2%	128K tokens	General reasoning, coding, vision
Claude 3.5 Sonnet	88.3%	92.0%	200K tokens	Long documents, nuanced writing
Gemini 1.5 Pro	85.9%	84.1%	1M tokens	Massive document analyzis
Llama 3 70B	82.0%	81.7%	8K tokens	Self-hosted, cost-sensitive tasks
Mistral Large	81.2%	45.1%	32K tokens	European data compliance needs

Benchmark comparison across leading models (2024 figures). Scores shift with new versions, treat as relative guidance, not absolute truth.

Context Windows and Multimodal Capability

The context window is how much text a model can hold in working memory during a single conversation. GPT-4o supports 128,000 tokens, roughly 96,000 words, or a full novel. Claude 3.5 Sonnet supports 200,000 tokens. Gemini 1.5 Pro goes to 1 million tokens, enough to process an entire codebase. This matters immediately when you're summarizing long contracts, analyzing research reports, or maintaining a complex back-and-forth over many messages. Small context windows force you to chunk your work, which breaks continuity and increases errors.

Multimodal capability means the model processes more than text. GPT-4o handles text, images, audio, and video frames in a single prompt. Gemini 1.5 Pro processes text, images, audio, and video natively. Claude 3.5 Sonnet reads images and text but not audio. Midjourney and DALL-E 3 are image-output specializts, they don't reason, they generate visuals. GitHub Copilot is code-specializt multimodal: it reads your existing codebase as context. Matching modality to your actual input type cuts errors and saves time immediately.

Capability	GPT-4o	Claude 3.5 Sonnet	Gemini 1.5 Pro	Perplexity
Text input/output	✓	✓	✓	✓
Image input	✓	✓	✓	✓
Audio input	✓	✗	✓	✗
Video input	✓ (frames)	✗	✓	✗
Web search / live data	✓ (with tool)	✗ (limited)	✓	✓ (core feature)
Code execution	✓ (Code Interpreter)	✗	✓	✗

Multimodal capability matrix. Perplexity's strength is real-time retrieval, not generation, use it for research, not drafting.

Why context window size changes your workflow

With a 4K token window (older GPT-3.5), you could paste roughly 3 pages of text. With Claude's 200K window, you paste a 150-page report and ask questions about any section. This isn't incremental, it removes an entire category of manual chunking work that used to consume hours.

Choosing the right model for a specific task

Prompt

I need to analyze a 120-page vendor contract for liability clauses and unusual termination terms. Which AI tool should I use and why? My company uses Microsoft 365.

AI Response

For a 120-page contract (~90,000 words), you need a model with a large context window. Claude 3.5 Sonnet (200K tokens via Claude.ai) or Gemini 1.5 Pro are your best fits, both can ingest the full document without chunking. Since you're on Microsoft 365, Copilot for Microsoft 365 can analyze documents directly in Word using GPT-4, which keeps data within your existing compliance boundary. Upload the contract, then ask: 'List all liability caps, indemnification clauses, and termination-for-convenience terms with their section numbers.' Avoid using free-tier tools with unknown data retention policies for sensitive contracts.

Cost, Speed, and the Quality Trade-off

Every AI model sits on a triangle: quality, speed, and cost. You rarely get all three. GPT-4o costs approximately $5 per million input tokens and $15 per million output tokens via API. GPT-4o mini costs $0.15 input / $0.60 output, 33x cheaper, with meaningfully lower reasoning quality. For high-volume, simpler tasks (classifying support tickets, summarizing short emails), the mini model is the right call. For complex analyzis, client-facing writing, or high-stakes decisions, pay for the full model. Perplexity Pro costs $20/month flat and includes real-time search, the cheapest route to live information.

Don't optimize for cheapest, optimize for fit

Using GPT-4o mini to draft a board presentation because it's cheaper is a false economy. You'll spend more time editing than the cost difference justifies. Run expensive models on high-value, low-frequency tasks. Run cheap or fast models on repetitive, low-stakes tasks. The mistake most teams make is applying one model to everything.

Task Type	Recommended Model	Why	Approx. Cost Signal
Summarize 50-page report	Claude 3.5 Sonnet	Large context, strong comprehension	Mid, $20/mo subscription
Write client proposal	GPT-4o or Claude 3.5	High output quality	Mid, subscription tier
Classify 10,000 support emails	GPT-4o mini or Llama 3	Speed + low cost at volume	Low. API pricing
Research competitor landscape	Perplexity Pro	Live web retrieval	Low, $20/mo flat
Generate product images	Midjourney or DALL-E 3	Image-specializt models	Low-Mid per image
Debug and write code	GitHub Copilot + GPT-4o	Code-native context reading	Mid, $10-19/mo

Task-to-model matching guide. 'Cost signal' is relative, actual API costs depend on volume and token length.

Build Your Personal AI Model Selection Card

Goal: A personalized, dated AI tool selection reference card mapped to your actual work tasks, with model names, context window requirements, costs, and relevant benchmarks recorded.

1. Open a blank document or notes app, this becomes your permanent reference card. 2. List the five most common AI-related tasks you do or want to do at work (examples: drafting emails, analyzing data, research, coding, summarizing documents). 3. For each task, use the tables in this lesson to assign a primary model recommendation and one backup option. 4. Add the context window size next to each model you selected, note whether your task requires more than 32K tokens. 5. Record the monthly cost for each tool's subscription or your estimated API cost if volume is high. 6. Add one benchmark that matters most for each task (e.g., TruthfulQA for research tasks, HumanEval for coding tasks). 7. Save the card with today's date, plan to review and update it every six months as models improve.

Benchmarks filter weak models. MMLU for general smarts, TruthfulQA for hallucination risk, MT-Bench for conversation quality
Context window = working memory: 128K (GPT-4o), 200K (Claude 3.5), 1M (Gemini 1.5 Pro)
Multimodal means audio/image/video input. GPT-4o and Gemini 1.5 Pro lead; Claude handles images only
Perplexity is the go-to for live web data; it's a retrieval tool, not a generation specializt
GPT-4o mini is 33x cheaper than GPT-4o, use it for volume tasks, not high-stakes outputs
GitHub Copilot reads your codebase as context, that's its core advantage over generic models
Match modality to input type: don't paste audio transcripts into a text-only model when an audio-native one exists
Update your model choices every 6 months, the benchmark leaders shift faster than most software categories

Benchmark scores are useful filters, not final verdicts, real-world task fit matters more than leaderboard position
Context window size determines whether you can process a full document or must manually chunk it
Multimodal capability should match your input type, check before you paste
Cost optimization means tiering tasks by complexity, not defaulting to the cheapest or most expensive model
The best AI users build a small toolkit of 2-4 specialized models rather than forcing one tool to do everything

Featured Reading

Practice this in a lab

Fix the Broken Prompt: Hospital Discharge Instructions

beginner · 12 min

Spot the Better Prompt: Hospital Discharge Instructions

beginner · 10 min