When ChatGPT gets it wrong: spotting and fixing errors
~24 min readWhen ChatGPT Gets It Wrong: Spotting and Fixing Errors
In early 2023, a team of lawyers at a New York firm called Levidow, Levidow & Oberman used ChatGPT to research case precedents for a personal injury lawsuit. The AI produced a polished legal brief citing six court cases — complete with docket numbers, judge names, and ruling summaries. The brief was filed. The problem: at least six of those cases did not exist. ChatGPT had invented them wholesale, a phenomenon now known as 'hallucination.' When the opposing counsel couldn't locate the cases, a federal judge demanded explanations. The lawyers faced sanctions, public humiliation, and a legal profession suddenly very alert to what AI tools actually do — and don't do.
This wasn't a fringe failure caused by misuse. The attorneys were intelligent professionals doing exactly what most people do with ChatGPT: asking it a reasonable question, receiving a confident, well-formatted answer, and trusting the output. The AI wrote in the authoritative register of a legal document. It used specific details — the kind of details that signal accuracy. Nothing in the response flagged uncertainty. This is the core tension every ChatGPT user needs to understand: the model's fluency and its accuracy are completely independent of each other. A response can be beautifully written and completely wrong at the same time.
The Levidow case became a landmark lesson not because lawyers are careless, but because they're not. Professionals trained to scrutinize language trusted an AI output they shouldn't have. The lesson extracted from this isn't 'don't use AI.' It's something more precise and more useful: every ChatGPT output contains at least one of three types of content — things that are accurate, things that are plausible-sounding but wrong, and things that are missing entirely. Your job, as the human in the loop, is to know which category you're dealing with and respond accordingly.
What 'Hallucination' Actually Means
Three Types of Errors — and Why They Look Identical
Think about how a smart, well-read colleague behaves when they're slightly out of their depth. They don't freeze or say 'I have no idea.' They synthesize what they know, make confident-sounding inferences, and present the result as if it's solid. ChatGPT does the same thing, but at scale and without the self-awareness to know when it's doing it. This is why the errors are so dangerous for professionals: they arrive dressed in the same formatting, tone, and confidence as the accurate outputs. Understanding the three distinct error types is the first skill in working with AI effectively.
The first error type is factual hallucination — the Levidow problem. The model generates specific details (names, dates, statistics, citations) that sound credible but are fabricated. This happens because the model learned that legal briefs contain case citations, so it produces case citations. It learned that research papers cite studies, so it produces study citations. The content fits the pattern of the format, even when the underlying facts don't exist. Factual hallucinations are the most dangerous type because they're the hardest to detect without external verification. A hallucinated statistic reads identically to a real one.
The second error type is reasoning errors — where the facts are real but the logic connecting them is wrong. The third is omission errors — where the response is technically accurate but leaves out critical context that changes the meaning entirely. A marketing manager asking ChatGPT for advice on a promotional pricing strategy might get a tactically sound answer that completely ignores the legal constraints around price advertising in their industry. Nothing in the response is fabricated. Everything is just... incomplete in a way that could cost the business. These three error types — hallucination, faulty reasoning, and critical omission — require different detection strategies.
Prompt
What were the key findings of the McKinsey Global Institute's 2021 report on remote work productivity?
AI Response
The McKinsey Global Institute's 2021 report found that remote workers were approximately 13% more productive than office workers, with knowledge workers showing the highest gains. The report also noted that companies with flexible work policies saw 20% lower attrition rates and recommended that organizations adopt a hybrid model with at least three remote days per week for optimal output. [Note: This response contains hallucinated specifics. McKinsey published research on remote work in this period, but the exact figures and recommendations above were generated to fit the pattern of a credible research summary — not retrieved from the actual report. Always verify specific statistics against the source document.]
A Different Industry, the Same Pattern
Move from law to healthcare, and the error pattern repeats — just with different stakes. A clinical content team at a mid-sized health information company was using ChatGPT to draft patient education articles in 2023. For articles about common conditions like hypertension or type 2 diabetes, the outputs were generally strong: accurate, readable, well-structured. The team's workflow saved them roughly 60% of drafting time. Then a writer used the same workflow for an article on drug interactions with a newer medication class. ChatGPT produced confident, plausible guidance — some of which contradicted current FDA labeling. No hallucinated citations this time. Just subtly wrong clinical reasoning, dressed in authoritative health-content prose.
The team caught it, because they had a clinical review step. But the near-miss revealed something important about how error risk scales with topic specificity. ChatGPT performs better on well-represented, stable topics — conditions that have been written about extensively across its training data. It performs worse on niche topics, recent developments, and areas where the training data itself may contain conflicting or outdated information. The health content team's eventual rule: use ChatGPT freely for structure, tone, and general explanations; treat every specific clinical claim as unverified until a qualified reviewer confirms it. That's not distrust of AI — it's a calibrated workflow.
| Error Type | What It Looks Like | Most Common Triggers | Detection Method |
|---|---|---|---|
| Factual Hallucination | Specific names, dates, stats, citations that don't exist | Niche topics, specific citations, recent events, numerical data | External verification against primary sources |
| Reasoning Error | Real facts connected by flawed logic | Multi-step analysis, cause-effect claims, comparisons | Step-by-step logic review; ask ChatGPT to explain its reasoning |
| Critical Omission | Accurate but dangerously incomplete | Legal/regulatory context, exceptions, edge cases | Ask: 'What important caveats or exceptions apply here?' |
| Outdated Information | Accurate as of training cutoff, wrong today | Market data, regulations, product specs, personnel | Check training cutoff; verify time-sensitive claims independently |
| Context Mismatch | Correct in general, wrong for your specific situation | Generic advice applied to specific industries or geographies | Add context to your prompt; ask 'Does this apply to [your context]?' |
The Analyst Who Changed Her Workflow
Sarah Chen, a senior market analyst at a consumer goods consultancy in Chicago, started using ChatGPT in mid-2023 to accelerate competitive analysis reports. Her initial approach mirrored most professionals': prompt, read, paste. Within three months, a client flagged an error in a report — a market share figure for a competitor that was off by nearly eight percentage points. ChatGPT had produced a number that was plausible for the category but sourced from a pattern, not a dataset. The figure had passed through Sarah's review because it looked right — it fit her prior assumptions about the market. That's the second danger: confirmation bias amplifies AI errors. We're less likely to scrutinize outputs that match what we already believe.
Sarah's response was methodical. She didn't abandon ChatGPT — she restructured how she used it. She stopped asking ChatGPT for numbers and started asking it for frameworks, structures, and analytical approaches. She used it to generate the skeleton of a competitive analysis — the categories to investigate, the questions to ask, the dimensions to compare — and then populated the skeleton with verified data from sources like Statista, Bloomberg, and primary research. Her output quality went up. Her verification time went down, because she was only fact-checking specific data points rather than auditing entire AI-generated narratives. The tool didn't change. Her mental model of what it was good for did.
The Separation Principle
What This Means When You're Actually Working
The practical implication isn't a checklist — it's a mindset shift. Before you accept any ChatGPT output, ask yourself one question: what would it cost if this specific claim were wrong? A brainstormed list of email subject lines? The cost of an error is near zero — you'd notice a bad subject line before sending. A regulatory compliance summary you're about to share with a client? The cost of an undetected error could be significant. This risk-calibrated approach means you don't treat every sentence with equal suspicion, which would make the tool useless. You focus your verification energy on the claims with actual stakes.
This also changes how you read ChatGPT responses. Most people read AI output the way they read a trusted colleague's email — linearly, looking for the main point. Effective users read it the way an editor reads a submitted draft: looking for specific claims that need sourcing, logical jumps that need examination, and gaps where important context should be. This isn't slower, once it becomes habit. Experienced ChatGPT users report that their review process takes roughly 20-30% of the time they used to spend on the full drafting task — a net time saving, even with active scrutiny built in.
There's also a prompting dimension to error prevention — not just error detection. The way you phrase your prompt influences the type and frequency of errors in the response. Vague, broad prompts invite the model to fill in gaps with plausible-sounding content. Specific, constrained prompts narrow the space where hallucinations can occur. A prompt that says 'Summarize the competitive landscape of the US cloud storage market' gives the model enormous room to generate plausible-but-wrong specifics. A prompt that says 'Based only on what you know with high confidence, list the top five US cloud storage providers by market presence, and flag anything you're uncertain about' produces a more honest, more verifiable output. The prompt is your first line of defense.
Goal: Develop a personal, role-specific error detection instinct by testing ChatGPT against your actual work context — not a generic example.
1. Open ChatGPT and ask it a question you'd realistically ask in your job — something involving specific facts, figures, or industry details. Write down the exact prompt you used. 2. Read the response carefully and highlight or list every specific factual claim: any number, name, date, statistic, citation, or concrete recommendation. 3. Categorize each claim using the five error types from the comparison table: hallucination risk, reasoning claim, potential omission, time-sensitive, or context-specific. 4. Choose the three claims with the highest potential cost if wrong and verify each one against an independent source (a primary report, official website, or recognized industry database). 5. Note which claims were accurate, which were wrong, and which were unverifiable — record this in a simple table with columns: Claim | Source Checked | Status. 6. Return to ChatGPT and add this line to your original prompt: 'Flag any claims you're uncertain about or that I should verify independently.' Compare this new response to your first one. 7. Identify one structural change to your standard prompting approach — based on what you found — that would reduce your verification burden in future tasks. Write it as a one-sentence rule you'll apply going forward. 8. Share your one-sentence rule with a colleague who also uses ChatGPT and compare your rules. Note whether they identified a different error type as their primary risk.
What These Cases Teach Us
- Fluency is not accuracy. ChatGPT's confident, well-formatted prose is a product of its language model architecture — not a signal that the underlying facts are correct. The two qualities are entirely independent.
- Error types are distinct and require different responses. Hallucinations need external verification. Reasoning errors need logic review. Omissions need targeted follow-up questions. Treating all errors the same makes you inefficient at catching any of them.
- Error risk scales with topic specificity. ChatGPT is more reliable on widely-covered, stable topics and less reliable on niche, recent, or highly specific claims — especially numerical data and citations.
- Confirmation bias is the user's contribution to AI errors. We overlook wrong outputs that match our expectations. Active scrutiny requires deliberately questioning claims that feel right, not just ones that feel off.
- The prompt is your first error-prevention tool. Specific, constrained prompts with explicit requests for uncertainty flagging produce more verifiable, more honest outputs than broad, open-ended questions.
- Structure tasks and fact tasks have different reliability profiles. Using ChatGPT for frameworks, drafts, and analytical structures — then populating them with verified data — produces better results than treating it as a factual oracle.
- A calibrated workflow beats blanket trust or blanket distrust. The goal is a sustainable process where your verification effort is proportional to the stakes of each specific claim.
Key Takeaways So Far
- ChatGPT generates text by predicting likely word sequences — it doesn't retrieve verified facts, which is why hallucinations occur even when the output looks authoritative.
- The five error types are: factual hallucination, reasoning errors, critical omissions, outdated information, and context mismatches — each requiring a different detection strategy.
- The Levidow legal case, the health content near-miss, and Sarah Chen's analyst workflow all illustrate the same pattern: professional trust in fluent AI output, followed by a costly or near-costly error.
- Risk-calibrated review — focusing scrutiny on high-stakes specific claims rather than auditing every sentence — is the practical approach used by experienced AI users.
- Prompting for uncertainty ('flag what you're not sure about') is a simple, high-impact technique that shifts the error detection burden onto the model before you even start reviewing.
The Confident Fabricator: How a Law Firm Learned to Stop Trusting Smooth Sentences
In 2023, lawyers at Levidow, Levidow & Oberman submitted a legal brief citing six court cases to support their argument. ChatGPT had generated the citations. Every single case was fabricated — plausible-sounding names, realistic docket numbers, invented judges. The cases did not exist. When the opposing counsel couldn't find them, a federal judge demanded explanations. The attorneys faced sanctions and public humiliation. What made this disaster so instructive wasn't that ChatGPT lied maliciously. It was that the model produced citations with the same confident, authoritative tone it uses for everything else. There was no hesitation. No asterisk. No 'I'm not certain about this one.' The fluency was the trap.
This case became infamous, but the underlying pattern repeats across industries every week. A marketing analyst uses ChatGPT to pull competitor statistics for a board presentation. A consultant asks for market size figures for an emerging sector. A journalist requests background facts about a CEO. In each case, the model produces clean, specific, confident-sounding information — and in each case, some of that information may be entirely invented. The technical term is 'hallucination,' but that word undersells the problem. These aren't vague confusions. They're precise fabrications: specific percentages, named individuals, exact dates, quoted passages that were never written.
The principle the law firm's experience reveals is this: ChatGPT's confidence level is not a signal of accuracy. The model generates the most probable next token — the word or phrase that statistically fits the context — not the most verifiably true one. When asked for a court case supporting a specific legal argument, producing a convincing-sounding case name is exactly what the model is optimized to do. It has no internal fact-checker running in parallel. It cannot feel uncertainty the way a human expert does. Fluency and accuracy are completely separate variables in how this technology works.
Why Hallucinations Feel So Real
Three Categories of Errors — and Why Each Needs a Different Fix
Not all ChatGPT errors are the same kind of wrong. The law firm's fabricated citations represent one category: invented specifics — facts, names, statistics, and sources that don't exist. But there are two other distinct failure modes that trip up professionals just as often, and each requires a different detection strategy. Conflating them leads to patchy verification habits. You might dutifully Google every statistic but completely miss an error that lives in the logic of an argument rather than a data point. Understanding the three categories turns error-spotting from a vague anxiety into a concrete skill.
- Factual fabrication: Invented specifics presented as real — statistics, citations, names, dates, product features, historical events. These are the most dangerous because they're hardest to detect without external verification.
- Logical errors: Conclusions that don't follow from premises, flawed reasoning chains, or correct facts assembled into an incorrect argument. ChatGPT can cite real statistics and still draw the wrong conclusion from them.
- Outdated information: Accurate facts as of the model's training cutoff (early 2024 for GPT-4o) that have since changed — market conditions, regulations, personnel, product versions, pricing. The model presents these with no expiration date attached.
Each category has a different origin inside the model. Factual fabrications happen when the model's training data didn't contain the specific answer, so it generates a plausible-sounding substitute. Logical errors happen because the model predicts text, not truth — it can produce a grammatically coherent argument that is logically broken. Outdated information happens because the model was frozen at a point in time and has no mechanism to know what has changed since. Recognizing which type of error you're dealing with tells you immediately whether to Google it, interrogate the reasoning, or check a current source.
Logical Errors in the Wild: A Strategy Consultant's Near-Miss
A strategy consultant at a mid-sized firm was preparing a competitive analysis for a retail client. She asked ChatGPT to analyze why a competitor's same-store sales had declined, providing it with several real data points she'd gathered. The model produced a polished five-paragraph analysis attributing the decline to three factors: supply chain disruptions, a failed loyalty program relaunch, and increased competition from discount retailers. The facts it used were real. The writing was crisp. But the logic was broken: the model had treated correlation as causation, and it had weighted the loyalty program failure heavily — even though the data she'd provided showed the decline predated the loyalty program relaunch by two quarters.
She caught it only because she knew the timeline. A colleague reviewing the same output without that context would have accepted the analysis entirely. This is the insidious quality of logical errors in ChatGPT output: they hide inside correct facts and fluent prose. The model assembled real information into a plausible-sounding narrative without checking whether the causal sequence actually held. For professionals whose job is reasoning — consultants, analysts, lawyers, strategists — this category of error is arguably more dangerous than a fabricated statistic, because it's harder to spot with a simple fact-check.
Prompt
Here is an analysis I generated with your help earlier: [paste analysis] Now act as a skeptical critic. Identify any places where the reasoning jumps from correlation to causation, where a conclusion doesn't follow logically from the evidence provided, or where an important alternative explanation has been ignored. Be specific and direct.
AI Response
Looking at this analysis critically: 1. Causal leap in paragraph 2: The analysis states the loyalty program relaunch 'caused' the sales decline, but the data you provided shows the decline began Q2 2022 while the relaunch was Q4 2022. The relaunch may have failed to reverse the decline, but it cannot be its cause. 2. Missing alternative: Supply chain disruptions are cited without comparing them to industry-wide benchmarks. If competitors faced identical disruptions without the same sales impact, this factor's weight in the analysis is overstated. 3. Paragraph 4 assumes the discount retail threat is new, but no baseline comparison is provided. This needs either a market share trend or a direct citation to support it as a recent shift rather than an ongoing condition.
The Outdated Information Problem: Where Healthcare Gets Burned
A clinical operations manager at a regional hospital network used ChatGPT to draft a summary of reimbursement rates for a specific category of outpatient procedures. The model produced a detailed, formatted table — CPT codes, descriptions, and dollar figures. The figures were accurate. For 2022. CMS had updated the rates in January 2024, and several codes had been restructured entirely. The operations manager was working in late 2024. She submitted the summary internally before anyone caught the discrepancy, which required a correction memo and a conversation with the CFO she'd rather not have had.
Healthcare is an extreme case because regulatory and reimbursement data changes on fixed annual cycles — but the same vulnerability exists anywhere rules, prices, or standards shift over time. Tax law. Software documentation. Compliance requirements. Import regulations. Organizational structures. ChatGPT's training data has a cutoff, and the model does not spontaneously flag when its information is likely to be stale. It answers questions about 2022 reimbursement rates with the same present-tense confidence it uses for everything else. For any domain where currency of information matters, treating ChatGPT output as 'current' is the error — not the model's fault, but definitely the user's risk to manage.
| Error Type | What It Looks Like | How to Detect It | How to Fix It |
|---|---|---|---|
| Factual Fabrication | Specific names, stats, citations, dates that sound precise and authoritative | Google the specific claim; check primary sources; ask ChatGPT 'how confident are you and what's your source?' | Never use unverified specifics in external-facing work; use ChatGPT for structure, not sourcing |
| Logical Error | Correct facts assembled into flawed conclusions; causation/correlation confusion; missing alternatives | Ask ChatGPT to critique its own reasoning; check the logical sequence manually; look for unsupported causal claims | Use the 'skeptical critic' prompt; explicitly ask for counterarguments before finalizing |
| Outdated Information | Regulatory figures, pricing, personnel, software features, market data presented in present tense | Check the model's training cutoff; verify time-sensitive claims against current official sources | Use Perplexity or ChatGPT with web browsing enabled for anything where recency matters; always date-check regulatory content |
A Product Manager's Playbook: Building Verification Into the Workflow
Priya, a senior product manager at a SaaS company, uses ChatGPT daily for competitive research, feature framing, and stakeholder communications. Early on, she made the same mistakes most people make: treating outputs as drafts to polish rather than claims to verify. After one embarrassing moment — citing a competitor's pricing that had changed six months earlier during a sales call — she rebuilt her workflow around a simple rule: ChatGPT generates structure and language; I own the facts. This distinction sounds obvious but changes everything about how you interact with the tool.
Her current process runs in two passes. In the first pass, she uses ChatGPT freely — brainstorming, drafting, structuring arguments, generating options. She's not verifying anything yet; she's generating raw material. In the second pass, she reads the output specifically hunting for claims that are specific and verifiable: any number, any named company or person, any regulatory reference, any feature description, any trend assertion. Each one gets a quick check against a primary source — the company's own website, an official report, a current news article. Claims that can't be verified get either removed or reframed as hypotheses. The whole process adds maybe ten minutes to a typical work product. The credibility protection is worth far more.
The Two-Pass Rule
What This Means When You're Working Under Pressure
The scenarios above — a law firm, a consultant, a hospital administrator, a product manager — share a common pressure: real deadlines, real stakes, real audiences. In each case, the professional using ChatGPT wasn't careless. They were busy, and the output looked good. This is the actual context in which AI errors cause damage. Nobody reads a beautifully formatted, fluently written analysis and thinks 'I should check whether this reasoning is logically sound.' The polished surface creates a cognitive shortcut that bypasses critical evaluation. Recognizing this bias in yourself is as important as any technical understanding of how the model works.
The practical implication is that your verification effort should be proportional to the stakes of the output — not the length of it. A 500-word internal Slack summary about a process change carries very different risk than a 500-word section in a board presentation. Both might take the same time to generate with ChatGPT. The board presentation section needs every factual claim checked against a source; the Slack summary probably doesn't. Calibrating your verification effort this way is how experienced AI users protect their credibility without spending an hour fact-checking every paragraph they generate. Risk-weighted verification, not blanket skepticism.
There's also a category of errors that no amount of external verification will catch: errors of omission. ChatGPT can produce a technically accurate, logically sound analysis that still misleads because it left out the most important counterargument, the most relevant exception, or the most critical caveat. The model optimizes for producing complete-seeming responses, not genuinely complete ones. For high-stakes work, the final quality check isn't 'is this accurate?' but 'what would a knowledgeable skeptic say is missing from this?' That question — asked of yourself or built into a follow-up prompt — catches the class of error that fact-checking alone never will.
Goal: Build the habit of reading ChatGPT outputs through three distinct lenses — factual accuracy, logical soundness, and currency — and experience firsthand how many errors survive a single casual read-through.
1. Choose a work task you'd realistically use ChatGPT for — a competitive summary, a policy explanation, a market overview, or a project proposal. Pick something with real stakes. 2. Write and submit a prompt for that task in ChatGPT. Let it generate a full response without any special instructions — just your normal ask. 3. Read the output once through without editing. Note your gut reaction: does it feel credible? Identify what made it feel that way. 4. Now re-read specifically hunting for Category 1 errors — factual fabrications. Highlight every specific number, name, citation, date, or product claim. List them in a separate document. 5. For each highlighted item, spend 60 seconds checking it against a primary source (official website, published report, news article). Mark each one: Verified, Unverified, or Incorrect. 6. Re-read the output a third time looking for Category 2 errors — logical issues. Ask yourself: does each conclusion follow from the evidence? Are there causal claims? Are alternative explanations ignored? Write down any reasoning you'd challenge. 7. Identify the three most time-sensitive claims in the output — anything that could have changed in the past 12 months. Check at least one against a current source. 8. Rewrite the output incorporating your corrections. Where claims couldn't be verified, either remove them or reframe them explicitly as estimates or hypotheses. 9. Compare the original and revised versions side by side. Note how many changes you made and what category each fell into. This comparison is your learning artifact.
Lessons From the Cases
- Fluency is not accuracy. The smoothness of ChatGPT's prose is a product of how it generates text, not a signal that the content is correct. These are completely independent qualities.
- There are three distinct error types — fabrication, logical errors, and outdated information — and each requires a different detection approach. Treating them as one problem leads to incomplete checking.
- High-stakes outputs deserve proportionally higher verification effort. A board presentation and a Slack message are not the same risk category, even if they're the same length.
- Asking ChatGPT to critique its own output is a legitimate and effective tool. The model can identify weaknesses in its own reasoning when explicitly asked to do so.
- Errors of omission are invisible to fact-checking. The question 'what's missing?' is as important as 'is this accurate?'
- The two-pass approach — generate first, verify second — separates the creative and critical modes of working with AI, which makes both more effective.
- Currency matters in any domain where rules, prices, or conditions change. For anything time-sensitive, ChatGPT's output is a starting point, not a current source.
- ChatGPT's training cutoff means it presents stale information with the same confidence as current facts — you are responsible for checking recency on anything time-sensitive.
- The most dangerous errors are often the most polished — if an output reads beautifully, that's the moment to read it most critically.
- Verification effort should be risk-weighted: match your checking intensity to the audience and stakes of the output, not its length.
- Building a two-pass workflow (generate, then verify) protects your credibility without adding significant time to most tasks.
- Asking ChatGPT to play 'skeptical critic' on its own outputs is one of the highest-value prompting techniques for professional work.
When ChatGPT Sounds Right But Isn't: The Confidence Problem
In 2023, a New York attorney named Steven Schwartz submitted a legal brief citing six court cases to support his client's position. ChatGPT had generated the research. The problem: every single case was fabricated. Mata v. Avianca became a landmark cautionary tale not because the lawyer was careless, but because ChatGPT's output was indistinguishable from real legal citations — complete with plausible case numbers, realistic judge names, and coherent summaries of rulings that never happened. The court sanctioned Schwartz and his firm. The story made front pages globally.
What makes this case instructive isn't the scale of the error — it's the mechanism. ChatGPT doesn't retrieve facts from a database the way Google searches the web. It predicts the most statistically plausible next word, then the next, then the next. When asked for case law, it produces text that looks exactly like case law, because that's what follows the pattern of legal research prompts in its training data. It has no internal alarm that fires when a case doesn't exist. Confidence in the output is baked into the generation process itself. This is the core tension every professional must understand before trusting AI with anything consequential.
The extracted principle here is stark: fluency is not accuracy. ChatGPT's polished, authoritative tone is a feature of how language models work, not a signal of factual reliability. Every professional who uses it — lawyer, marketer, analyst, consultant — needs a personal verification layer that sits between AI output and final use. Building that layer isn't paranoia; it's the difference between Schwartz's outcome and a colleague who used the same tool effectively on the same day.
Why ChatGPT Hallucinates
The Pattern: Where Errors Cluster
A pharmaceutical marketing team at a mid-sized biotech firm learned this pattern the hard way in late 2023. They used ChatGPT to draft a competitive landscape summary, asking it to compare their drug's clinical trial results against three named competitors. The prose was sharp, the structure was clean, and the statistics cited were wrong — not wildly wrong, but subtly off. One competitor's Phase 3 completion rate was stated as 78% when the published figure was 71%. Another drug's approval timeline was compressed by eight months. None of it was invented wholesale; it was distorted, like a photocopy of a photocopy.
The team's medical affairs lead caught the errors during review — but only because she already knew the landscape. A newer team member using the same output would have circulated misinformation internally. This points to a critical pattern: ChatGPT makes the most errors in domains where precision matters most — statistics, dates, named entities, regulatory details, and anything requiring knowledge of events after its training cutoff. It performs best on tasks that don't require factual precision: structuring arguments, rewriting for clarity, generating options, summarizing documents you've already verified.
Prompt
Summarize the key differences between IFRS 16 and ASC 842 lease accounting standards. For any detail you're uncertain about or that may have changed since your training, flag it explicitly with [VERIFY].
AI Response
Both IFRS 16 and ASC 842 require lessees to recognize most leases on the balance sheet, replacing the old operating/finance lease distinction for lessees under IFRS 16 entirely. Key differences include: 1. Lessee P&L treatment: IFRS 16 separates depreciation and interest expense; ASC 842 allows a straight-line single lease cost for operating leases. 2. Short-term lease threshold: Both exempt leases under 12 months, but application details differ. [VERIFY — specific practical expedient elections may have updated guidance.] 3. Lessor accounting: Largely unchanged under both, but alignment between the two frameworks is imperfect. [VERIFY — any FASB or IASB amendments post-2023.] Recommendation: Confirm items marked [VERIFY] against current FASB/IASB publications before use in client materials.
The Fix Isn't Less AI — It's Smarter Use
A senior analyst at a management consultancy describes her approach as 'trust the structure, verify the substance.' She uses ChatGPT to build the skeleton of every deliverable — the argument flow, the section headers, the framing of trade-offs — and treats every specific number, name, or regulatory reference as unverified until she's confirmed it herself. In her words: 'I'm not fact-checking AI. I'm using AI for the 60% of work that doesn't require facts, and doing the fact-sensitive 40% myself.' Her team's output speed doubled. Their error rate didn't increase.
This division of labor maps cleanly onto error types. ChatGPT rarely gets wrong the things that don't have a single correct answer — tone, structure, persuasion, creative options, reframing a problem. It most frequently gets wrong the things that do have a single correct answer — specific figures, proper names, citations, recent events, and domain-specific technical details. Once you internalize this map, you stop being surprised by errors and start routing tasks accordingly.
| Task Type | ChatGPT Reliability | Verification Needed | Example |
|---|---|---|---|
| Structuring an argument | High | Minimal | Outline for a client proposal |
| Rewriting for clarity | High | Minimal | Simplifying a technical report |
| Brainstorming options | High | Minimal | Marketing campaign angles |
| Summarizing your own document | High | Light | Meeting notes → action items |
| Named statistics or data | Low–Medium | Always verify | Market size figures, survey results |
| Legal or regulatory specifics | Low | Always verify | Compliance deadlines, case citations |
| Recent events (post-cutoff) | Very Low | Always verify | Anything after early 2024 |
| Technical domain details | Variable | Verify if consequential | Medical dosages, engineering specs |
Teaching ChatGPT to Show Its Uncertainty
A content strategist at a B2B SaaS company found a practical fix that changed how her entire team works. Instead of asking ChatGPT to simply produce content, she added a standing instruction to every research-adjacent prompt: 'Flag anything you're not confident about.' The model began marking uncertain claims with phrases like 'I believe this figure was approximately...' or 'You should confirm this against current documentation.' It wasn't perfect, but it shifted the team's mental posture — they stopped reading AI output as finished product and started reading it as a first draft with annotations.
The deeper lesson from her experience is about prompt design as a form of quality control. You can build verification behaviors directly into your prompts. Asking ChatGPT to 'list any assumptions you've made' or 'note which claims I should independently verify' doesn't make the model more accurate — it makes the uncertainty visible, which is the next best thing. Perplexity AI takes a different architectural approach, citing sources inline, which helps with recency but doesn't eliminate hallucination. The professional's job is to pick the right tool for the right task and build the right habits around it.
The Two-Prompt Verification Trick
What this means in practice is a shift in how you read AI output. Treat every ChatGPT response the way a good editor treats a first draft from a new writer — assume the ideas may be strong, assume the specific details need checking. This isn't a burden if you've already decided which tasks you're routing to AI. You're using ChatGPT for the heavy lifting on structure and language, and you're applying your professional judgment to the details that could cause real damage if wrong.
The professionals who get burned by AI errors share a common behavior: they read polished output and assume polish equals accuracy. The professionals who use AI effectively share a different behavior: they've built a mental model of where errors live, they've designed prompts that surface uncertainty, and they've established a fast personal verification routine for anything consequential. That routine doesn't have to be elaborate — it just has to exist.
The Schwartz case, the biotech team's distorted statistics, the SaaS strategist's flagging system — each story points to the same operating principle. ChatGPT is a powerful collaborator with a specific, knowable failure mode. Understanding that failure mode doesn't make you a skeptic; it makes you someone who can use the tool at full power without the risk of a front-page mistake. The goal isn't distrust. The goal is calibrated trust — knowing exactly what to hand over and exactly what to hold onto.
Goal: Produce a personalized AI Trust Map that reflects your specific role and tasks — a reference you'll use every time you work with ChatGPT.
1. Open ChatGPT and paste in a real work task you've been putting off — a summary, a draft email, a short analysis. Make it something you'd actually use. 2. Add this instruction at the end of your prompt: 'After your response, list any specific facts, figures, dates, or named sources you've included, and rate your confidence in each as High, Medium, or Low.' 3. Read the full response. Highlight or copy every item ChatGPT rated Medium or Low confidence. 4. Send a follow-up prompt: 'Which single claim in your response are you most likely to be wrong about, and what would I need to check to verify it?' 5. Take the highest-risk claim and spend 3 minutes verifying it using a primary source — the original report, the official website, or a known-reliable publication. 6. Note whether ChatGPT's self-assessment was accurate. Did it flag the thing that turned out to be wrong? 7. Create a simple two-column note (in any app you use daily) titled 'AI Trust Map' with columns: 'Hand to AI' and 'Verify myself.' Populate it with at least four task types from your actual job based on what you've learned. 8. Save this note. It becomes your standing operating procedure for AI-assisted work going forward.
- Fluency is not accuracy — ChatGPT's confident tone is a product of how it generates text, not a signal that the content is correct.
- Errors cluster in specific places: statistics, citations, named entities, regulatory details, and anything after the model's training cutoff.
- The most effective professionals split tasks: use AI for structure, framing, and language; apply personal verification to facts that matter.
- You can build verification behaviors into prompts by asking ChatGPT to flag uncertainty, list assumptions, or identify its weakest claims.
- Polished output is the most dangerous kind of error — it bypasses the skepticism that rough output would trigger.
- Tools like Perplexity AI offer source citations and can reduce recency errors, but no current tool eliminates hallucination entirely.
- A fast, consistent personal verification routine — not elaborate, just existent — is what separates professionals who trust AI safely from those who get burned.
- ChatGPT predicts plausible text — it doesn't retrieve verified facts, which is why it can fabricate citations with complete confidence
- High-risk error zones: specific numbers, legal or regulatory references, named sources, and events after its training cutoff
- Low-risk zones: structuring arguments, rewriting, brainstorming, and summarizing documents you've already verified
- Prompt design is quality control — asking for flagged uncertainty makes errors visible even when it can't eliminate them
- The two-prompt verification trick (ask ChatGPT what it's least confident about) catches a meaningful share of errors in seconds
- Calibrated trust — knowing exactly what to delegate and what to hold — is the professional skill that makes AI genuinely useful
Why did ChatGPT fabricate convincing legal citations in the Mata v. Avianca case?
A marketing analyst asks ChatGPT to help structure a competitive analysis report. Based on the reliability framework in this lesson, what's the safest approach?
You add the instruction 'flag anything you're not confident about' to a research prompt. What does this actually achieve?
Which of the following task types carries the HIGHEST risk of a consequential ChatGPT error?
A consultant reads a ChatGPT response that is well-written, clearly structured, and cites a specific regulation with a plausible-sounding reference number. She thinks: 'This looks professional, so it's probably accurate.' What's the core error in her reasoning?
Sign in to track your progress.
