Back to Lead with AI: Manager's Playbook

Lesson 2 of 5

Spot the Real Results: Vendor Claims Decoded

~24 min readLast reviewed May 2026

This lesson counts toward:Lead Through AI: The Executive Playbook Teach Smarter, Learn Faster Lead AI Adoption, Drive Results Run Smarter: AI for Operations Leaders Grow Faster: AI for Small Teams

Evaluating AI Vendor Claims

Picture this: It's Tuesday afternoon and a software vendor has just wrapped up a 45-minute demo in your conference room. Their slides were slick. The live demo looked impressive. And now your CFO is asking you whether the company should sign the $80,000 annual contract. The vendor claimed their AI tool will "reduce your team's administrative workload by 60%" and "deliver ROI within 90 days." Your gut says something feels off, but you can't quite put your finger on what. You don't want to look uninformed in front of your leadership team, and you definitely don't want to approve a purchase that fails to deliver. This is exactly the situation AI literacy is built for, and exactly what this lesson helps you navigate.

Why AI Vendor Claims Deserve Serious Scrutiny

The AI software market is projected to exceed $500 billion by 2028, according to IDC. That growth has attracted thousands of vendors, many of whom are competing fiercely for budget dollars. In that environment, marketing language tends to get... elastic. Claims like "our AI is 98% accurate," "powered by the latest GPT technology," or "trusted by over 10,000 companies" are common. Some of those claims are meaningful. Many are not. The problem is that evaluating them requires a specific kind of literacy, not technical expertise, but the ability to ask the right questions and recognize the difference between a benchmark and a real-world result. Most non-technical professionals were never taught this skill. This lesson fixes that.

Vendor claims typically fall into three categories. First, performance claims: "Our AI achieves 95% accuracy" or "processes 10,000 documents per hour." Second, outcome claims: "Customers save 8 hours per week" or "reduce churn by 30%." Third, technology claims: "Built on GPT-4," "uses proprietary large language models," or "enterprise-grade AI." Each category requires a different set of questions. A performance claim without context about the test conditions is nearly meaningless. An outcome claim without information about which customers, doing what tasks, in what industry, is a marketing number, not a business case. A technology claim that name-drops a well-known AI model tells you about ingredients, not about whether the final product actually works for your specific use case.

Here is the core principle to carry through this entire lesson: AI tools are not evaluated the same way traditional software is evaluated. Traditional software either does what it says or it doesn't, a spreadsheet either calculates correctly or it has a bug. AI systems produce probabilistic outputs. They are right most of the time, wrong some of the time, and confidently wrong in ways that can be hard to detect. That fundamental difference changes how you assess vendor claims, how you structure pilots, and how you define success. When a vendor says their tool is "highly accurate," your first question should always be: accurate at what, measured how, and compared to what baseline?

The Benchmark Trap

Vendors frequently quote accuracy or performance numbers from internal benchmarks, tests they designed themselves using data they selected. Always ask: "Is this benchmark result from an independent third-party evaluation, or did your team run the test?" Also ask: "Was the test data similar to our actual data?" A hiring AI trained on tech-company resumes may perform very differently when your company processes resumes for retail or healthcare roles. Benchmark numbers are a starting point for a conversation, not a buying signal.

The Five Questions That Cut Through the Noise

Experienced technology buyers develop a mental checklist they run through every vendor claim. You don't need a technical background to use this checklist, you need a clear head and a willingness to ask follow-up questions that vendors sometimes find uncomfortable. The five questions below work across every AI product category, from AI-powered CRM tools to automated HR screening software to AI writing assistants. They work in the demo room, in the proposal review, and in the pilot debrief. Commit these to memory and you will immediately become the most rigorous AI evaluator in most rooms.

Accurate at what, exactly? Ask the vendor to define the specific task their accuracy number measures. "95% accurate" at summarizing a document is very different from "95% accurate" at extracting contract clauses or detecting fraud.
Measured on whose data? Ask whether the benchmark used the vendor's own curated dataset or independent, real-world data. Ask whether the test data resembles your industry, your documents, your customers.
Compared to what baseline? A 40% improvement over doing nothing is very different from a 40% improvement over the current best alternative. Ask what the control condition was.
Which customers, doing what? When vendors cite customer outcomes, ask for case studies from companies your size, in your industry, doing the same type of work. Outcomes from a 5,000-person enterprise rarely transfer to a 200-person company.
What does failure look like? Ask the vendor to describe a case where their tool underperformed or failed. A vendor who can't answer this question honestly has not done serious evaluation of their own product.

Putting the Questions to Work: A Real Vendor Scenario

Imagine you manage a customer service team of 25 people. A vendor is pitching an AI tool that automatically drafts responses to customer emails, claiming it will "cut response time by 50% and reduce agent workload by 6 hours per week per agent." Using the five questions above, you'd immediately probe: 50% faster than what, your current average or an industry benchmark? The 6-hour figure, was that measured in a contact center handling 500 emails per day or one handling 50? Which types of customer emails did the AI handle well, and which ones did it struggle with? Were the test customers in your industry? And critically: what happened when the AI got it wrong, did agents catch the errors, and how long did correction take? Those questions take five minutes to ask and could save you from an $80,000 mistake.

Analyzing a vendor's claim before a procurement meeting

Prompt

I'm a customer service manager evaluating an AI email-response tool. The vendor claims their tool reduces agent workload by 6 hours per week per agent and cuts response time by 50%. They say it's "98% accurate" and "trusted by over 500 companies." I have a meeting with them tomorrow. Help me prepare 8 specific questions I should ask to evaluate whether these claims are credible and relevant to my team of 25 agents handling about 200 customer emails per day in the financial services industry.

AI Response

Great, here are 8 targeted questions to bring to your meeting: 1. **Define the 98% accuracy figure.** What specific task is it measuring, correct grammar, appropriate tone, accurate information, or resolution rate? Who conducted the measurement, and when? 2. **Clarify the 6-hours-per-week claim.** Which customers achieved this? What was their email volume, team size, and industry? How long after implementation was it measured? 3. **Unpack the 50% response-time reduction.** Is that 50% faster than your current average handle time, or compared to a manual-only baseline? What was the starting point? 4. **Ask about financial services specifically.** Of the 500+ companies, how many are in financial services? Regulatory language and compliance requirements make this sector different, can they share a relevant case study? 5. **Ask about failure modes.** What types of customer emails does the AI handle poorly? How does it behave with complaints, sensitive topics, or account-specific queries that require system lookups? 6. **Request a pilot structure.** Can you run a 30-day pilot on a subset of your email queue before committing? What metrics will they help you track? 7. **Ask about human review.** Does the tool send responses automatically or queue them for agent approval? What's the recommended review workflow for a regulated industry? 8. **Ask about error correction time.** When the AI drafts an incorrect response, how long does it take an agent to identify and fix it? Does that time get factored into the workload savings calculation?

Comparing How Major AI Vendors Present Their Claims

Different AI tool categories, writing assistants, meeting summarizers, HR screening tools, sales intelligence platforms, present claims in different ways. Understanding the typical claim patterns for each category helps you spot what's missing. The table below compares common claim types across four AI tool categories that non-technical professionals encounter most often, and flags the critical questions each claim type requires.

AI Tool Category	Common Vendor Claim	What It Sounds Like	Critical Question to Ask
Writing Assistants (Grammarly AI, Jasper, Copy.ai)	Improves writing quality by X%	"Teams using our tool produce 40% higher-quality content"	How is 'quality' defined and measured? By whom?
Meeting Summarizers (Otter.ai, Fireflies, Microsoft Copilot)	Saves X hours per week	"Users save 5 hours per week on meeting follow-up"	Does this include time spent correcting AI errors or reformatting outputs?
HR Screening Tools (HireVue, Workday AI, Eightfold)	X% more accurate than human review	"Our AI is 3x more accurate at predicting job performance"	Accurate compared to what outcome metric, over what time period, in which roles?
Sales Intelligence (Salesforce Einstein, Gong, Apollo AI)	Increases win rate by X%	"Customers see a 28% increase in deal close rates"	Which customer segment? Enterprise or SMB? What was the baseline close rate?
Document Processing (Adobe AI, Notion AI, DocuSign AI)	Processes documents X times faster	"Reduce contract review time by 70%"	For which document types and complexity levels? What error rate is acceptable?

Common AI vendor claim patterns by tool category, with the critical question each claim requires.

What Good Evidence Actually Looks Like

Not all vendor evidence is equally credible. There is a meaningful difference between a vendor saying "our customers love us" and a vendor presenting a structured case study with before-and-after metrics from a named company in your industry. Recognizing the hierarchy of evidence helps you quickly sort strong claims from weak ones. At the top of the credibility ladder: independent third-party evaluations, peer-reviewed research, and audited customer outcome data. In the middle: detailed named case studies with specific metrics and methodology. At the bottom: anonymous testimonials, aggregate statistics across wildly different customer types, and internal benchmark results without disclosed methodology.

2022

Historical Record

Microsoft

Microsoft claims that Copilot for Microsoft 365 saves users an average of 1.2 hours per week, based on a study Microsoft commissioned.

Understanding the source and methodology of vendor-published performance claims is essential when evaluating AI tool effectiveness.

There is one more dimension of evidence that gets overlooked: recency. AI tools improve rapidly, but they can also degrade. A model that performed well in 2022 may have been updated in ways that changed its behavior. Conversely, a case study from 2021 may not reflect the current product at all. Always ask when the cited results were measured and whether the product has had major updates since then. For AI tools specifically, "major update" can mean a completely different underlying model, which can mean meaningfully different performance characteristics. Asking "is this the same model version that was used in the case study?" is a legitimate and important question.

Watch for 'AI-Powered' as a Label, Not a Description

Many vendors now describe their products as 'AI-powered' or 'AI-driven' without specifying what the AI actually does. A tool might use AI for one minor feature, like auto-completing a text field, while the core functionality is conventional software. Before evaluating any AI claim, ask the vendor to specifically describe which features use AI, what type of AI (rules-based automation, machine learning, or a large language model), and what the tool does when the AI component is unavailable or uncertain. This separates genuine AI capability from marketing labeling.

Applying This at Work: Before, During, and After the Demo

The best time to evaluate vendor claims is before the demo even starts. Request the vendor's product documentation, any published case studies, and their data privacy or security overview at least 48 hours in advance. Skim the case studies specifically for the details that matter: company size, industry, the specific task the AI handled, the metrics used, and the measurement methodology. If the case studies are vague, lots of percentages without context, lots of unnamed companies, that tells you the vendor either doesn't have strong outcome data or has chosen not to share it. Walk into the demo with your five questions already written down.

During the demo, watch for the gap between what the AI does in the demo and what it will do with your real data. Vendors typically demo with ideal-condition inputs: clean, well-formatted documents, clearly worded requests, neatly structured data. Ask the vendor to demo the tool on a messy real-world example from your actual workflow. Bring a genuinely difficult customer email, a badly formatted report, or an ambiguous request. If the vendor declines to demo on your data, ask why. If they do demo on your data and the result is noticeably worse than the polished demo, that's the real performance baseline you should be evaluating.

After the demo, the most valuable thing you can do is request a structured pilot before any significant financial commitment. A well-designed pilot runs for 30 to 60 days, uses a defined subset of your real work, tracks specific metrics you care about (not the metrics the vendor prefers), and includes a clear success threshold that both parties agree on before the pilot starts. If a vendor resists a pilot, that resistance is itself a data point. Most reputable AI vendors, including enterprise offerings from Microsoft, Salesforce, and Google, offer structured pilot programs precisely because they're confident their tools perform in real conditions.

Audit an AI Vendor Claim You've Already Encountered

Goal: Apply the five-question framework to a real AI vendor claim from your own professional context, producing a written evaluation you could share with a colleague or manager before a purchasing decision.

1. Identify one AI tool your organization currently uses, is piloting, or has recently been pitched, this could be Microsoft Copilot, Salesforce Einstein, an HR screening tool, a meeting summarizer, or any other AI-powered product. 2. Find one specific performance or outcome claim the vendor makes about this tool, check their website, your sales materials, or any proposal documents you received. 3. Write the claim down exactly as the vendor states it, in a document or notes app. 4. Apply each of the five evaluation questions to this claim: (a) Accurate at what? (b) Measured on whose data? (c) Compared to what baseline? (d) Which customers, doing what? (e) What does failure look like? 5. For each question, write one sentence describing what information you currently have and one sentence describing what information is missing. 6. Open ChatGPT, Claude, or Microsoft Copilot and paste in your claim along with this prompt: 'This is an AI vendor claim: [paste claim]. Help me identify what additional evidence I would need to evaluate whether this claim is credible and relevant to a [your role] at a [your industry] company with [your team size] employees.' 7. Review the AI's response and add any additional gaps it identifies to your notes. 8. Write a one-paragraph summary of how confident you are in this vendor claim, what evidence would increase your confidence, and what your recommended next step is, pilot, more information, or pass. 9. Share your one-paragraph summary with one colleague who is involved in the tool evaluation or purchasing decision.

How This Looks Different Depending on Your Role

The five-question framework applies universally, but the specific claims you'll encounter, and the stakes of getting it wrong, vary significantly by role. Here's how evaluating AI vendor claims plays out differently across common professional contexts.

HR Managers: You're most likely evaluating AI screening, interviewing, or performance-review tools. The stakes are high, biased AI in hiring has real legal exposure. Ask specifically about disparate impact testing: has the vendor tested whether the AI screens out protected groups at higher rates? Ask for third-party bias audits, not just internal testing.
Sales Managers: You'll see claims about AI tools improving forecast accuracy, increasing pipeline, or boosting close rates. Always ask for results segmented by deal size and sales cycle length, a tool that works brilliantly for transactional sales may be useless for complex enterprise deals with 9-month cycles.
Marketing Managers: AI content and personalization tools will claim engagement lifts, conversion improvements, and time savings. Ask whether A/B test results were run on audiences comparable to yours in size and demographics, and whether the lift was sustained beyond the initial novelty period.
Operations and Finance Managers: You'll encounter AI tools claiming to automate document processing, reduce errors, or speed up reporting. Ask specifically about exception rates, what percentage of documents require human review or correction, because that number is where hidden labor costs live.
Teachers and L&D Professionals: AI tutoring or content-generation tools will claim learning outcome improvements. Ask what assessment methodology was used, how long the study ran, and whether results were peer-reviewed or just vendor-reported.
Small Business Owners: You're often the only decision-maker, with less budget for a bad call. Prioritize vendors who offer monthly contracts or no-commitment pilots over those requiring annual commitments upfront, it gives you a real-world test without locking in.

Evaluating an AI hiring tool's bias and accuracy claims

Prompt

I'm an HR director and a vendor is pitching an AI resume screening tool. They claim it's '3 times more accurate than human reviewers at predicting job performance' and that it 'eliminates unconscious bias.' They haven't mentioned how they tested for bias or what 'job performance' means in their accuracy measurement. Write me a clear, professional email I can send to their sales rep asking for the specific evidence I need to evaluate these two claims before we move forward.

AI Response

Subject: Follow-Up Questions on Accuracy and Bias Claims, [Tool Name] Evaluation Hi [Sales Rep Name], Thank you for the demo last week, the team found it useful and we're interested in exploring further. Before we move to the next stage, I'd like to request some additional documentation on two specific claims from your presentation. **On the '3x more accurate than human reviewers' claim:** Could you share the study or evaluation that produced this figure? Specifically, I'm looking to understand: (1) how 'job performance' was defined and measured, was it manager ratings, 90-day retention, promotion rates, or another metric? (2) how long after hiring the performance outcome was measured, (3) which roles and industries the study covered, and (4) whether the evaluation was conducted by your team or an independent third party. **On the 'eliminates unconscious bias' claim:** This is a significant claim and one we take seriously given our legal and compliance obligations. Could you provide: (1) documentation of any disparate impact analyzis your team has conducted, specifically whether the tool screens candidates differently across gender, race, or age groups, (2) whether a third-party bias audit has been completed and, if so, by whom, and (3) how the tool performs on candidates whose backgrounds differ from the training data. We'd also welcome the opportunity to speak with a reference customer in a similar industry before moving forward. Thank you in advance. I look forward to your response. [Your name]

Key Takeaways from Part 1

AI vendor claims fall into three types, performance, outcome, and technology claims, and each requires different questions.
The five-question framework (accurate at what, whose data, what baseline, which customers, what failure looks like) applies to every AI product category.
Benchmark numbers produced by the vendor themselves are a starting point, not a buying signal, always ask whether testing was independent.
AI tools produce probabilistic outputs, not deterministic ones. 'Highly accurate' is meaningless without knowing the task, the test conditions, and the error rate.
The hierarchy of evidence runs from independent third-party studies at the top to anonymous testimonials at the bottom.
Recency matters: ask when case study results were measured and whether the product has changed significantly since then.
A vendor who resists demoing on your actual messy data, or who can't describe failure cases, is telling you something important.
A structured 30-to-60-day pilot with pre-agreed success metrics is the most reliable way to evaluate any AI tool before a significant financial commitment.
Your role shapes which claims matter most. HR professionals should prioritize bias audits, sales managers should segment outcome data by deal type, and operations teams should focus on exception rates.

When the Numbers Look Impressive But Tell You Nothing

Picture this: you're in a vendor demo on a Tuesday afternoon. The sales rep clicks to a slide showing '94% accuracy' in bold green text. The room nods. Your CFO leans forward. Someone says 'that's pretty good.' But nobody asks the obvious question: 94% accurate at what, measured how, tested on whose data? That silence is exactly what vendors count on. Impressive-sounding metrics without context are one of the most common tactics in AI sales presentations, not because vendors are dishonest, but because buyers rarely push back. This section gives you the vocabulary and the instincts to ask the right questions before your company signs anything.

The Metric Trap: Understanding What AI Numbers Actually Mean

Vendors love accuracy scores because they sound scientific and reassuring. But accuracy is almost meaningless without knowing what problem the model was solving and what the data looked like. Consider a fraud detection tool that claims 99% accuracy. If only 0.5% of transactions are actually fraudulent, a system that labels every single transaction as 'not fraud' would score 99.5% accuracy, while catching zero fraud. This is called the base rate problem, and it trips up buyers across every industry. The same logic applies to hiring tools claiming to predict 'top performers,' content moderation systems, or sales forecasting software. The number sounds good. The underlying math can be hollow.

There are three metrics vendors should provide alongside accuracy, and if they can't or won't, that's a red flag. Precision tells you: of all the cases the AI flagged as positive, how many actually were? Recall tells you: of all the actual positive cases that existed, how many did the AI catch? F1 score balances the two. For a hiring tool screening CVs, high precision means most candidates it recommends are genuinely strong. High recall means it's not quietly filtering out great candidates. You don't need to calculate these yourself, you just need to ask for them. A vendor with a genuinely good product will have these numbers ready. One selling hype will change the subject.

The Three Questions That Cut Through Any Demo

Before accepting any AI performance claim, ask: (1) What dataset was this tested on, your customer data or theirs? (2) What does 'failure' look like in this system, and how often does it happen? (3) Can you show us a case where it got it wrong? Vendors with solid products welcome these questions. Those with inflated claims will pivot, deflect, or suddenly remember another slide they need to show you.

Benchmarks: The Fine Print of AI Performance

Many AI vendors back their claims with benchmark results, standardized tests designed to measure model performance on specific tasks. The problem is that benchmarks are often chosen because the model performs well on them, not because they reflect real-world use. A customer service AI might score brilliantly on the GLUE benchmark (a standard language understanding test) but struggle when your actual customers write in regional slang, use product nicknames, or mix languages. Microsoft, Google, and OpenAI all publish benchmark results for their models. Those numbers are real. What they don't tell you is how the model performs on your specific emails, your specific documents, your specific workflows. That gap between benchmark performance and real-world performance is where buyer disappointment lives.

Preparing Sharp Questions Before a Vendor Demo

Prompt

I'm a marketing director meeting with an AI vendor tomorrow. They're pitching an AI tool that claims to 'predict which leads are most likely to convert' with 87% accuracy. Help me prepare 8 sharp questions to ask during the demo that will reveal whether this claim is genuinely meaningful or just marketing. I'm not technical, frame the questions so a non-technical person can ask them naturally in a meeting.

AI Response

Here are 8 questions that will separate a real product from a polished pitch: 1. **'What exactly does 87% accuracy mean here?'** Ask them to define it precisely. Accuracy at predicting conversions within 30 days? 90 days? All deals, or only deals above a certain size? 2. **'What does the 13% failure look like?'** Understanding how the system fails is more important than knowing how it succeeds. Are the misses random, or does it consistently miss a particular type of customer? 3. **'Was this tested on data from businesses like ours?'** If they benchmarked on e-commerce companies and you're a B2B professional services firm, the number may be meaningless for your use case. 4. **'Can we see a real example, a lead it scored highly that didn't convert, and why?'** Any vendor confident in their product should be able to show you edge cases, not just wins. 5. **'How long does it take to start working with our data?'** Many AI tools need 6-12 months of your historical data to calibrate. If they say it works 'out of the box,' ask how. 6. **'What happens when our sales process changes, does the model need retraining?'** Models trained on last year's data can degrade quickly when market conditions shift. 7. **'Who owns the data we put into this system?'** A critical compliance question, especially if you're in a regulated industry. 8. **'Can you connect us with two customers in our industry who've used this for more than a year?'** References from recent customers are table stakes. Long-term references reveal whether the product delivers after the honeymoon period.

Comparing How Major AI Vendors Present Their Claims

Not all AI vendors operate the same way when it comes to transparency. Enterprise software providers like Microsoft and Salesforce, who've been selling to large organizations for decades, tend to publish more detailed documentation, offer contractual SLAs (service level agreements), and have established audit trails. Newer AI-native startups often lead with impressive demo videos and benchmark scores but have thinner documentation around data governance and failure modes. This doesn't make startups worse, some are genuinely innovative, but it means the due diligence process is different. Knowing which type of vendor you're dealing with shapes what questions to prioritize.

Vendor Type	Typical Strengths	Common Gaps	Key Questions to Ask
Enterprise AI (Microsoft Copilot, Salesforce Einstein, Google Workspace AI)	Detailed documentation, compliance certifications, SLAs, enterprise support	Can be slow to update, benchmark-heavy marketing, complex pricing	What's included in our SLA? How do you handle model updates that affect outputs?
AI-Native Startups (newer tools, niche AI platforms)	Fast innovation, specific use-case focus, often impressive demos	Limited compliance history, thinner documentation, funding risk	How long have you been in production? Who are your longest-tenured customers?
Open-Source AI Tools (deployed by vendors on top of models like Llama)	Transparent model architecture, customizable, often lower cost	Support can be limited, security depends on implementation, variable quality	Who maintains the underlying model? What's your security and patching process?
Embedded AI (AI features inside tools you already use, e.g., HubSpot AI, Notion AI)	Low switching cost, familiar interface, integrated with existing data	Less control over AI decisions, limited benchmarking data published	What data is this feature trained on? Can I turn it off for specific workflows?

Four vendor categories and how to approach due diligence for each

Reading the Room: Pilot Projects as Your Real Benchmark

The single most reliable way to evaluate an AI vendor's claims is to run a structured pilot before committing to a full contract. A pilot doesn't need to be complex. It means giving the tool a real, bounded task using your actual data and measuring the results against a clear baseline. If you're evaluating an AI tool that summarizes customer feedback, spend two weeks having both a human analyzt and the AI tool summarize the same 50 customer interviews, then compare quality, time, and any errors. That comparison tells you more than any vendor benchmark. The key is defining success before the pilot starts, not after, when there's a temptation to move the goalposts.

Most reputable vendors will offer a free trial or proof-of-concept period. Microsoft Copilot, for instance, offers 30-day enterprise trials. Salesforce Einstein has sandbox environments. If a vendor resists a structured pilot and pushes for immediate full commitment, treat that as a warning signal. The vendors most confident in their product want you to test it on your own terms, because they know the results will sell the contract better than any slide deck. During a pilot, track three things: output quality (does it do what was promised?), reliability (does it work consistently, or does it hallucinate or fail randomly?), and integration friction (how much human time does it actually save versus create?).

When the pilot ends, you want a simple decision framework. Did the AI tool reduce time on this task by at least 30%? Did the output quality meet the standard your team would accept without heavy editing? Did any failures create risks, wrong information in a client-facing document, biased shortlisting in a hiring workflow, inaccurate numbers in a financial summary? If yes to the first two and no to the third, you have a strong case for adoption. If the time savings were marginal or the failure rate was uncomfortable, you have data to either renegotiate the contract or walk away. That's a much stronger position than trusting a demo.

Run a Mini Vendor Evaluation on an AI Tool You're Already Considering

Goal: Apply a structured evaluation process to an AI tool your team is currently trialling or considering, producing a one-page assessment you can share with your manager or team.

1. Choose one AI tool you've recently seen demoed, are currently trialling, or have been asked to evaluate, this could be Copilot, an HR AI platform, a sales tool, or any AI feature embedded in software you already use. 2. Write down the specific performance claim the vendor made, quote it exactly if you can, or paraphrase from memory. Example: '85% reduction in time spent on meeting notes.' 3. Ask yourself: what would this claim need to be true for it to be meaningful for your team? Write two sentences defining what 'success' would look like in your specific context. 4. Identify the metric type being used, is it accuracy, time savings, cost reduction, or something else? Note whether the vendor provided any information about how it was measured. 5. Use ChatGPT or Claude to generate a list of 6 clarifying questions specific to this vendor claim and your industry. Use the prompt structure: 'I work in [your role/industry]. A vendor is claiming [specific claim]. Generate 6 questions I should ask to verify this claim is meaningful for my context.' 6. Check whether the vendor has published a case study, white paper, or third-party audit of their product. Search '[vendor name] independent evaluation' or '[vendor name] third-party audit.' Note what you find, or don't find. 7. Draft a one-page evaluation summary with four sections: The Claim, What We'd Need to Verify It, What We Found, and Recommended Next Step (pilot, more information needed, or decline). Share this with one colleague and ask for their reaction. 8. If you have access to the tool in a trial, assign it one real task from your actual workflow this week and record the time taken and quality of output compared to doing it manually. 9. Based on everything above, write a single sentence verdict: 'This tool's claim appears [credible / partially credible / unverified] because [one specific reason].'

How This Looks Different Depending on Your Role

HR Managers: When evaluating AI hiring tools, ask specifically about bias audits. The EU AI Act classifies hiring AI as 'high-risk,' meaning vendors operating in Europe must meet specific transparency requirements. Ask for their bias testing methodology and whether it's been independently audited.
Sales Leaders: Focus your vendor evaluation on CRM integration and data freshness. A lead-scoring AI is only as good as the data it's trained on. If your CRM data is incomplete or inconsistently updated, the AI's predictions will reflect that, regardless of the vendor's benchmark scores.
Marketing Managers: Evaluate AI content tools by testing output on your brand voice, not generic samples. Ask the vendor to generate three pieces of content in your brand guidelines and assess whether the output would pass your editorial review without significant rewriting.
Finance and Operations Leaders: prioritize auditability. Any AI tool touching financial data or operational decisions needs a clear audit trail, who made which decision, based on what AI output, at what time. Ask vendors to demonstrate this feature specifically.
Teachers and Learning Designers: When evaluating AI tools for education or training, ask how the system handles incorrect information. AI tutoring tools that confidently deliver wrong answers are worse than no tool at all. Test it with questions where you already know the correct answer.
Consultants and Client-Facing Professionals: Your main risk is an AI tool producing outputs you share with clients that contain errors. Evaluate tools specifically on their hallucination rate for factual claims, ask vendors directly: 'In what percentage of outputs does the model generate information that isn't accurate?'
Small Business Owners: Cost-per-outcome matters more than raw capability. A tool that costs £200/month and saves 5 hours is better than one costing £800/month and saving 8 hours. Build a simple ROI calculation into every vendor evaluation, using your actual hourly cost of staff time.

Using AI to Help You Evaluate AI

There's a practical irony here worth acknowledging: one of the best tools for evaluating AI vendor claims is AI itself. ChatGPT, Claude, or Microsoft Copilot can help you research a vendor's reputation, decode technical documentation, draft evaluation frameworks, and generate stress-test questions you wouldn't have thought of. This isn't circular logic, it's using a general-purpose tool to investigate a specialized one. The key is knowing what to ask and how to verify the output, which is exactly what this lesson is building toward.

Decoding a Vendor's Technical Documentation

Prompt

I've been sent a technical data sheet by an AI vendor pitching us a document-processing tool. Here's an excerpt from their documentation: 'Our model achieves an F1 score of 0.91 on the DocVQA benchmark, with a WER of 6.2% on handwritten text and 99.1% field-level extraction accuracy on structured forms.' I'm a non-technical HR director. Explain what these metrics mean in plain language, tell me what questions I should still be asking, and flag anything that concerns you about how this is being presented.

AI Response

Here's what those numbers mean in plain language: **F1 score of 0.91 on DocVQA**: DocVQA is a standard benchmark for testing whether AI can answer questions about documents. A score of 0.91 out of 1.0 is genuinely strong, this suggests the model understands document content well in controlled test conditions. The important caveat is that DocVQA uses a specific set of document types. If your documents look different (different layouts, fonts, or content types), this score may not translate directly. **WER of 6.2% on handwritten text**: WER means 'Word Error Rate', how often the AI misreads a word. 6.2% sounds small, but in a 500-word handwritten document, that's roughly 31 errors. Whether that's acceptable depends entirely on how you're using the output. For rough sorting, probably fine. For legal or compliance documents, probably not. **99.1% field-level extraction accuracy on structured forms**: This is the most relevant number if you're processing standard forms like job applications or onboarding documents. 99.1% is strong, but 'structured forms' is doing a lot of work in that sentence. Ask them to define exactly what 'structured' means and whether your forms qualify. **What concerns me**: The vendor is presenting three different metrics for three different scenarios, which makes it hard to compare. Ask them for a single, consistent performance figure on documents that look like yours, not their benchmark documents. Also ask: what happens to the 0.9% of fields it gets wrong? Does it flag them for human review, or does it silently pass them through?

Key Takeaways from Part 2

Accuracy scores without context are almost meaningless. Always ask what the metric measures, on what data, and what failure looks like.
Precision and recall matter more than accuracy for most real-world AI applications. You don't need to calculate them, just ask vendors to provide them.
Benchmark performance and real-world performance are not the same thing. Benchmarks are chosen to flatter the model. Your workflows are not a benchmark.
Different vendor types, enterprise, startup, open-source, embedded, require different due diligence approaches. The category shapes the questions you ask.
A structured pilot using your actual data and a pre-defined success measure is more reliable than any demo or case study a vendor provides.
Your role determines which failure modes matter most. HR leaders need bias audits. Finance leaders need audit trails. Client-facing professionals need to worry about hallucinations in shared outputs.
AI tools like ChatGPT and Claude can help you decode technical documentation and generate evaluation questions, use them as part of your vendor assessment process.

Picture this: it's Thursday afternoon and a vendor has just sent your team a glossy one-pager claiming their AI platform delivers "94% accuracy" and "10x productivity gains backed by real-world data." Your CFO wants your recommendation by Monday. Your gut says something is off, but you're not sure what questions to ask or how to verify the numbers. This is the moment where AI literacy pays for itself, not in building models, but in reading between the lines of a sales pitch and protecting your organization from a costly mistake.

Why Vendor Benchmarks Are Often Misleading

Most AI vendors publish benchmark numbers that are technically true but practically useless for your context. "94% accuracy" sounds impressive until you ask: accurate at what, measured on whose data, compared to what baseline? A spam filter that labels 94% of emails correctly still fails your team if the 6% it misses are all your most important client messages. Benchmarks are typically run on curated, clean datasets that look nothing like the messy spreadsheets and inconsistent records your organization actually has. The gap between benchmark performance and real-world performance is one of the most documented problems in applied AI, and it almost never appears in a vendor deck.

The second trap is confusing capability with fit. A vendor might accurately claim their tool summarizes documents with 90% fewer errors than a previous version, but if your team never struggled with document errors in the first place, that improvement is irrelevant to your ROI calculation. Always anchor vendor claims to a specific problem your organization actually has. If the vendor cannot name your problem before you describe it, their case study does not apply to you. Ask them to walk you through a live demo on your own sample data, not their pre-prepared showcase scenario.

Independent validation is rare in AI vendor marketing, but it exists. Look for evaluations published by third parties, analyzts, academic researchers, or peer organizations in your sector. Stanford's HAI annual index, Gartner's AI market reports, and sector-specific trade bodies occasionally publish comparative assessments that are not funded by the vendors themselves. If a vendor cannot point to any external validation of their claims, that is a meaningful data point. It does not mean the tool is bad, but it does mean you are being asked to take their word for it.

The Five Questions Every Vendor Must Answer

Before any AI procurement decision, get written answers to these: (1) What dataset was this benchmark run on? (2) How does performance change on messy or incomplete data? (3) What does failure look like, and how often does it happen? (4) Can we run a pilot on our own data before committing? (5) Who outside your company has independently verified these results? Vendors who resist these questions are telling you something important.

Using AI to Audit AI Claims

Here is a practical move most managers overlook: use a general-purpose AI assistant to help you interrogate a specializt vendor's claims. Paste the vendor's one-pager or key claims into ChatGPT or Claude and ask it to identify assumptions, flag missing context, and generate the questions a sceptical buyer should ask. You are not asking the AI to make the decision, you are using it as a fast research partner and devil's advocate. This takes about ten minutes and routinely surfaces angles your team would have missed under time pressure.

Auditing a vendor claim before a procurement meeting

Prompt

I'm evaluating an AI vendor that claims their recruitment screening tool 'reduces time-to-hire by 60% and improves quality-of-hire by 35%.' Here is their supporting text: [paste vendor claims]. Act as a sceptical procurement advisor. Identify the assumptions behind these numbers, flag what information is missing, and give me 8 sharp questions I should ask in tomorrow's demo call.

AI Response

Here's a critical breakdown of those claims and the questions you should bring to the table: **Assumptions buried in the numbers:** - "60% reduction in time-to-hire" almost certainly measures time from job posting to offer, not from offer to accepted start date, which is where most delays actually occur. - "35% improvement in quality-of-hire" requires a definition of quality. Who defined it, and over what time period? Quality-of-hire measured at 90 days looks very different from 12 months. - Both figures likely come from a pilot with a self-selected group of early adopters, not a randomized comparison. **What's missing:** - No mention of the baseline they improved from, a 60% improvement from a 40-day process is very different from a 60% improvement from a 10-day process. - No data on false positives: candidates screened out who would have been strong hires. - No sector or company size context for the case studies. **8 questions for tomorrow's demo:** 1. How exactly is 'quality-of-hire' defined and measured in your studies? 2. What was the baseline time-to-hire before your tool was introduced? 3. Were these results from a controlled trial or self-reported by clients? 4. What percentage of screened-out candidates were later hired through other channels and performed well? 5. Has an independent third party audited these results? 6. What happens to performance when job descriptions are inconsistent or incomplete? 7. Can we see the results broken down by role type and seniority level? 8. What's your error rate on candidates from non-traditional educational backgrounds?

Comparing How Different Tools Handle Transparency

AI Tool / Platform	Transparency Features	What It Shows You	Watch Out For
ChatGPT Plus (OpenAI)	Cites sources when browsing is on; shows reasoning in o1 model	Step-by-step logic on complex queries	Citations can still hallucinate, always verify links
Claude Pro (Anthropic)	Explains uncertainty; flags when it's guessing	Confidence signals in responses	Doesn't always volunteer limitations unprompted, ask directly
Microsoft Copilot (M365)	Links back to source documents in your tenant	Which internal file a summary came from	Only as good as your internal data quality
Google Gemini (Workspace)	Google Search grounding shows live sources	Web sources for factual claims	Grounding is inconsistent, check when it matters most
specializt AI Vendors	Varies widely, often minimal by default	Usually just output, not reasoning	Least transparent category, apply maximum scrutiny here

Transparency varies significantly across AI platforms. General-purpose tools often give you more visibility into their reasoning than specializt vendors do.

Putting This Into Practice on Monday

The most immediate change you can make is adding a single slide to your internal evaluation process: a "Claim Audit" slide that forces any AI vendor recommendation through five standard questions before it reaches a decision-maker. This is not a technical process, it is a governance habit. Assign one person in every vendor evaluation to play the role of sceptic, specifically tasked with finding what is missing from the vendor's evidence. Rotate this role so the skill spreads across your team.

When you do run pilots, define success before you start, not after. Write down, in advance, the specific metric that would make this tool worth adopting, and the threshold it needs to hit. "We will adopt this tool if it reduces first-draft report time by at least 30% for at least 70% of the team within four weeks." Vendors who know you have a pre-defined success metric behave differently in pilots. They also tend to be more honest about what their tool actually does.

Finally, build a one-page vendor evaluation template and share it with your team now, before the next vendor pitch lands in your inbox. Templates created under time pressure are poor templates. A good one covers: the problem being solved, the claimed benefit, the evidence quality, the pilot design, the failure mode, and the exit plan if it does not work. This document takes two hours to create once and saves dozens of hours, and potentially significant budget, every time you use it.

Build Your AI Vendor Evaluation Checklist

Goal: Create a reusable one-page checklist that helps you assess any AI vendor's claims critically, using free AI tools to accelerate the process.

1. Open ChatGPT (free) or Claude (free tier) in your browser, no account required for basic use on Claude.ai. 2. Type this prompt: 'I manage a team and need to evaluate AI vendor pitches. Create a one-page vendor evaluation checklist with sections for: problem definition, claimed benefits, evidence quality, pilot design, risk and failure modes, and exit criteria. Make it practical for a non-technical manager.' 3. Review the output and highlight any section that feels generic, note where you want more specificity for your industry. 4. Follow up with: 'Now customize this checklist for a [your sector, e.g. marketing agency / HR team / secondary school] evaluating an AI tool for [specific use case, e.g. content creation / candidate screening / lesson planning].' 5. Copy the final checklist into a Word document or Google Doc and add your organization's logo and name. 6. Add one row to the checklist manually: 'Independent validation. Has any third party outside the vendor confirmed these results? Yes / No / Partial.' 7. Share the document with one colleague and ask them to use it on the next vendor pitch they receive, then compare notes. 8. After the first real use, add one question you wish had been on the original list. 9. Save the final version as your team's standard AI procurement template.

How This Applies Differently by Role

Managers: Use the five-question framework in every vendor demo, make it a team norm, not a personal habit.
HR professionals: Scrutinise any AI hiring tool for bias audit results and ask specifically about performance on diverse candidate pools.
Marketers: Demand to see AI content tool outputs on your actual brand voice and product category, not the vendor's demo content.
Sales leaders: When evaluating AI CRM or forecasting tools, ask for accuracy data on deals similar in size and cycle length to yours.
Teachers and school leaders: Any EdTech AI tool should show evidence from students with similar demographics and learning contexts to your school.
Consultants: Your clients will ask you to evaluate AI tools on their behalf, this checklist becomes a billable service deliverable.
Small business owners: You have less margin for error than large enterprises, so pilot costs and exit clauses matter even more, always negotiate a free trial with your own data.

Drafting a pilot success criteria document before a vendor trial

Prompt

We're about to start a 4-week pilot of an AI meeting summary tool with our sales team of 12 people. Help me write a one-page pilot success criteria document that defines what 'this worked' looks like before we start. Include: the problem we're solving, three measurable success metrics with specific thresholds, how we'll collect data, and what a fair failure looks like.

AI Response

**AI Meeting Summary Tool. Pilot Success Criteria** *Sales Team | 4-Week Trial | 12 Participants* **Problem we're solving:** Sales reps currently spend an average of 25 minutes per client call writing up notes and next steps in the CRM. This delays follow-up and produces inconsistent records. **What success looks like (defined before Day 1):** 1. **Time saving:** Average post-call admin time drops from 25 minutes to 15 minutes or less for at least 8 of 12 reps by Week 4. Measured via a weekly 3-question team survey. 2. **Accuracy:** At least 80% of AI-generated summaries require only minor edits (not rewrites) before being saved to CRM. Reps log edit level (none / minor / major rewrite) for each summary during the trial. 3. **Adoption:** At least 10 of 12 reps use the tool on 75% or more of their calls by Week 3. Usage data provided by vendor dashboard. **How we collect data:** - Weekly 5-minute survey sent every Friday (Google Forms, takes 2 minutes to complete) - Vendor usage dashboard reviewed by team lead each Monday - One 20-minute group debrief at Week 2 and Week 4 **What fair failure looks like:** If fewer than 6 reps hit the time-saving threshold, or if more than 40% of summaries require full rewrites, we consider the pilot unsuccessful for our use case, regardless of what the vendor's benchmarks say. **Decision date:** End of Week 4. Go / No-go confirmed in writing within 48 hours of final debrief.

Key Takeaways

Benchmark numbers are real but rarely relevant, always ask what dataset they came from and whether it resembles your actual data.
Accuracy claims only matter in context: accurate at what, for whom, compared to what baseline, and with what consequences for failure.
Independent validation is the gold standard, if only the vendor can confirm their results, treat the claim as unverified.
You can use general-purpose AI tools (ChatGPT, Claude) to audit specializt vendor claims, paste in their pitch and ask for sceptical analyzis.
Define pilot success criteria before the trial starts, not after you've seen the results.
A reusable one-page vendor evaluation template is one of the highest-ROI documents your team can create this month.
Different roles face different vendor risks. HR teams face bias risks, marketers face brand-fit risks, small business owners face lock-in risks.
Transparency varies by platform: general-purpose tools typically show more reasoning than specializt vendors, who often show only outputs.

Featured Reading

This lesson requires Pro

Upgrade your plan to unlock this lesson and all other Pro content on the platform.

Upgrade to Pro

You're currently on the Free plan.

Practice this in a lab

Should Your Hospital Buy That AI Documentation Tool?

beginner · 8 min