Back to Ship AI Products: Concept to Launch

Lesson 2 of 5

What Your AI Product Does Better Than Everything Else

~23 min readLast reviewed May 2026

This lesson counts toward:Run Smarter: AI for Operations Leaders Grow Faster: AI for Small Teams

Scoping an AI feature is not the same as scoping a regular software feature. The rules are different, the failure modes are different, and the questions you need to ask before writing a single requirement are different. Most AI product failures trace back to one moment: someone started specifying what the AI should do before they understood what AI actually can and cannot do reliably. This lesson gives you the vocabulary, the frameworks, and the reference tools to scope AI features with precision, so you can write requirements that developers and data teams can actually build from.

7 Things You Need to Know Before Scoping Any AI Feature

AI features produce probabilistic outputs, they are right most of the time, not all of the time. Your spec must account for the error rate.
The quality of an AI feature depends heavily on the quality and quantity of input data. Scoping must include a data audit, not just a feature description.
There are four core AI task types, classification, generation, prediction, and recommendation, and each has different accuracy benchmarks and failure consequences.
AI features degrade over time as the world changes but the model doesn't. A 'maintenance' requirement is not optional.
User trust in AI outputs varies wildly by profession and context. A salesperson will act on a lead score; a physician will not act on a diagnosis without verification. Your spec must reflect this.
Regulatory exposure differs by AI task type and industry. A content moderation feature in HR carries legal risk that a product description generator in e-commerce does not.
Every AI feature needs a fallback. If the model returns low confidence, is unavailable, or produces an error, what happens? This must be in the spec.

What Makes AI Feature Scoping Different

Traditional software features have deterministic behavior: press the button, the thing happens. AI features are statistical. The same input can produce different outputs on different runs. This means your requirements document cannot just describe what the feature should do, it must also describe what 'good enough' looks like numerically. If you're building a meeting summary feature using a tool like Microsoft Copilot or an internal model, you need to define acceptable accuracy before launch, not after. That number becomes your acceptance criterion, the same way a load time of under two seconds would be an acceptance criterion for a web page.

The second major difference is that AI features have inputs you don't fully control. A regular feature reads from a database you manage. An AI feature reads from user language, documents, images, or behavioral signals, messy, inconsistent, real-world data. This means scoping must include explicit assumptions about input quality. If your AI email triage tool only works on emails written in English, that is a scope boundary, and it needs to be written down. Unstated assumptions in AI feature specs are the number one cause of late-stage rework.

Define what 'correct' means for your specific AI task before writing any other requirement.
Specify the minimum acceptable accuracy rate, e.g., 'correctly classifies customer intent at least 85% of the time on a held-out test set.'
List every assumption about input data: language, format, length, source, and freshness.
Document what the feature must NOT do, exclusions are as important as inclusions in AI specs.
Identify who reviews or acts on AI outputs and whether human sign-off is required before any action is taken.
State the retraining or update schedule. AI features are not 'ship and forget.'

Write the Failure Case First

Before writing what your AI feature should do when it works, write what happens when it fails. What does the user see if the model returns a confidence score below your threshold? What happens if the AI produces an output that is factually wrong? Defining failure modes first forces you to set concrete accuracy thresholds and design fallback states, two things most early-stage AI specs completely omit.

The Four Core AI Task Types

Task Type	What It Does	Workplace Example	Key Accuracy Metric	Common Failure Mode
Classification	Assigns an input to one of several predefined categories	Automatically tagging incoming support tickets as Billing, Technical, or General	Precision and Recall per category	Misclassifying edge cases; poor performance on minority categories
Generation	Creates new text, images, or content from a prompt or context	Drafting first-pass job descriptions from a role title and bullet points	Human eval score (relevance, tone, accuracy)	Hallucination, plausible but incorrect facts
Prediction	Forecasts a future value or outcome based on historical patterns	Scoring sales leads by likelihood to close within 30 days	Mean Absolute Error or AUC-ROC score	Model drift, accuracy degrades as market conditions change
Recommendation	Surfaces the most relevant options from a large set	Suggesting the next best action for a customer success manager after a churn signal	Click-through or adoption rate	Filter bubbles, over-recommending the same narrow set of options

The four AI task types every product manager must be able to distinguish. Each requires different success metrics and has different failure risks.

Matching the Right Task Type to the Right Business Problem

One of the most common scoping mistakes is mislabeling the AI task type, and it costs weeks of rework. A product manager who asks for a 'smart recommendation engine' when what they actually need is a classification feature will receive a system optimized for the wrong objective. Classification ranks items into fixed buckets. Recommendation ranks items by relevance to a specific user. They share similar underlying technology, but they are built, evaluated, and maintained differently. Getting this distinction right in your initial brief sets the entire technical direction correctly from day one.

Prediction and generation are the most frequently confused pair among non-technical product managers. Prediction answers a question about the future using historical data: 'Will this customer churn in the next 60 days?' Generation creates new content from scratch or from a template: 'Write a win-back email for customers who haven't purchased in 90 days.' Both use AI, but a prediction model needs a labeled historical dataset to train on, while a generation feature can often be built quickly using an existing large language model like GPT-4 or Claude 3 Opus via a no-code tool. The data requirements and timelines are completely different.

Read the business problem statement and extract the core verb: categorize, create, forecast, or surface.
Match that verb to a task type using the table above.
Ask your data team: do we have labeled historical examples for this task? If yes, classification or prediction is feasible. If no, generation using an existing model may be faster.
Confirm the task type in writing with your technical lead before the spec is drafted, a one-line confirmation email is enough.
Use the task type to select your primary success metric from the table above.
Document the task type explicitly in the feature spec under a section called 'AI Approach.'

Business Problem	Sounds Like	Actual Task Type	Right Question to Ask
Route inbound emails to the right team automatically	Recommendation	Classification	What are the fixed categories? How many?
Help reps write follow-up emails faster	Automation	Generation	What inputs does the AI need? What tone and length?
Identify which accounts are at risk of churning	Reporting	Prediction	Do we have 12+ months of labeled churn history?
Show each user the most relevant knowledge base article	Search	Recommendation	What signal do we have about user intent and past behavior?
Flag policy-violating content before it goes live	Moderation	Classification	Who defines the violation categories? Who reviews edge cases?
Summarize meeting transcripts from Zoom or Teams	Integration	Generation	What's the max transcript length? What format should summaries follow?

Common business problems mapped to their actual AI task type. Use this to correct mismatches early in discovery.

Why Task Type Determines Timeline

Generation features using existing models (GPT-4, Claude, Gemini) can reach a working prototype in days using tools like Zapier, Make, or a simple ChatGPT integration. Prediction features require historical training data, model training, and validation, typically 6 to 16 weeks minimum. Classification sits in between. When a stakeholder asks 'how long will this AI feature take?', the honest answer starts with identifying the task type. Without that, any timeline estimate is fiction.

Writing an AI Feature Brief That Actually Gets Built

Using ChatGPT to Structure an AI Feature Brief

Prompt

I'm a product manager at a mid-size B2B SaaS company. We want to add an AI feature that automatically summarizes customer support tickets and suggests a priority level (High, Medium, Low) for each one before it reaches a human agent. I need to write a feature brief. Help me identify: the AI task type, the data we'll need, the key success metrics, the main failure risks, and the fallback behavior if the AI confidence is low. Keep it practical and non-technical.

AI Response

Here's a structured breakdown for your AI ticket triage feature: **AI Task Type:** This feature combines two tasks. Summarization is a generation task, the AI produces a condensed version of the ticket text. Priority scoring is a classification task, the AI assigns each ticket to one of three fixed categories (High, Medium, Low). **Data You'll Need:** - At least 6 to 12 months of historical support tickets with priority labels already applied by your human agents - A minimum of 1,000 labeled examples per priority tier for reliable classification - Clean ticket text, if tickets contain attachments, images, or non-English text, you'll need to define how those are handled **Key Success Metrics:** - Classification accuracy: target 85%+ agreement between AI priority label and human agent label on a test set - Summary usefulness: measured by agent survey, 'Did this summary save you time?' Target 70%+ positive response - Time-to-first-response: reduction of 20%+ after feature launch **Main Failure Risks:** - Misclassifying a High-priority ticket as Low, a critical risk requiring a safety net - Summaries that omit key details from long or complex tickets - Model drift: priority patterns change seasonally and the model won't know unless retrained **Fallback Behavior:** - If confidence score is below 75%, display the ticket without a priority label and flag it for manual review - Always show the AI-generated summary but mark it 'AI Draft, please verify' for the first 60 days post-launch

The Data Requirement Problem Most PMs Ignore

Every AI feature runs on data. Not 'data' in the abstract sense, specific, structured, labeled, accessible data. When you scope an AI feature without auditing your data first, you are writing requirements for a building without checking if the land exists. The most common discovery, usually made embarrassingly late, is that the data needed to train or run the AI feature either does not exist, exists in an unusable format, is locked in a system the team cannot access, or contains privacy and compliance issues that require legal review before it can be used at all.

Data requirements belong in the feature spec, not in a separate technical document that PMs never read. For a generation feature using an existing model like Claude Pro or GPT-4, the data question is relatively simple: what text or context do you feed into the prompt, and where does it come from? For a prediction or classification feature, the question is much heavier: how much labeled historical data exists, who labeled it, is it still accurate, and is it stored somewhere your engineering team can access without a six-month procurement process? Answering these questions in week one saves months of pain.

AI Task Type	Minimum Data Requirement	Where PMs Usually Find It	Common Blocker
Generation (using existing model)	A clear prompt template + context inputs (e.g., ticket text, customer name, product type)	CRM records, support systems, internal documents	Data is unstructured or locked in PDFs
Classification	1,000–10,000+ labeled examples per category; labels must be consistent and accurate	Historical tickets, past decisions, manually tagged records	Labels don't exist or were applied inconsistently
Prediction	12–24 months of historical outcomes with associated feature variables (e.g., deal size, activity logs)	CRM, product analytics, billing systems	Data lives in three systems that have never been joined
Recommendation	User behavior data: clicks, purchases, views, ratings, ideally 6+ months	Product analytics tools like Mixpanel, Amplitude, or Salesforce	Not enough users yet; cold-start problem for new products

Data requirements by AI task type. Add this table to your feature brief template as a standard section.

Don't Scope Around a Data Gap

If the data you need doesn't exist, the correct response is to delay the AI feature and start a data collection initiative, not to proceed and hope the team figures it out. Scoping an AI feature on the assumption that 'we'll get the data later' is the single most reliable way to launch a feature six months late with half the promised functionality. If your data audit reveals a gap, document it explicitly in the spec as a prerequisite, assign an owner, and set a date by which the gap must be resolved before development begins.

Scope Your First AI Feature Brief

Goal: Produce a one-page AI feature brief for a real or hypothetical feature at your organization, using the frameworks from this lesson.

1. Choose one business problem at your organization that you believe could be addressed with an AI feature. Write one sentence describing the problem in plain language, no technical jargon. 2. Use the Four Core AI Task Types table to identify which task type (or combination) your feature requires. Write your answer down and confirm it with a colleague or technical contact. 3. Open ChatGPT Plus, Claude Pro, or Microsoft Copilot. Paste the prompt example from this lesson, replacing the ticket triage scenario with your own business problem. Review the output and edit it to reflect your actual context. 4. Complete a data audit: list the specific data inputs your feature needs, identify where that data currently lives, and note any access, quality, or privacy issues. Be honest, if the data doesn't exist, write that down. 5. Define your primary success metric using the task type table. Write it as a specific, measurable statement: 'The AI must [do X] at [Y% accuracy / Z% adoption] within [N weeks of launch].' 6. Write your fallback behavior: one to two sentences describing exactly what happens in the product if the AI returns a low-confidence result or fails entirely.

Part 1 Cheat Sheet

AI features are probabilistic, always define an acceptable accuracy rate before writing other requirements.
The four task types are: Classification, Generation, Prediction, Recommendation. Each has different data needs, timelines, and metrics.
Generation features using existing models (GPT-4, Claude) can prototype in days. Prediction features need months of historical data.
Always match the business problem to the correct task type before briefing your team, use the 'core verb' test: categorize, create, forecast, or surface.
Data requirements belong in the feature spec, not a separate document. Audit data availability in week one.
Minimum labeled data for classification: 1,000–10,000+ examples per category.
Every AI feature spec needs: task type, success metric, data requirements, failure mode, and fallback behavior.
A data gap is a prerequisite blocker, document it, assign an owner, set a resolution date.
Use ChatGPT Plus, Claude Pro, or Microsoft Copilot to draft your initial feature brief structure, then edit for your specific context.

Key Takeaways from Part 1

AI feature scoping is fundamentally different from traditional feature scoping because outputs are statistical, not deterministic.
Misidentifying the AI task type at the start of a project is one of the most costly and avoidable mistakes in AI product management.
Data availability is a prerequisite, not a detail, and it must be assessed before any development work begins.
A complete AI feature brief includes task type, success metrics, data requirements, failure modes, and fallback behavior as non-negotiable sections.
Non-technical product managers can use AI tools like ChatGPT Plus and Claude Pro to structure and draft feature briefs, accelerating the scoping process significantly.

Part 1 covered why AI features fail when scoped loosely. Now the work gets practical. This section gives you the frameworks, tables, and checklists you'll use to write actual AI feature specs, the kind that engineers understand, stakeholders approve, and product teams can build against. Keep this open the next time you're in a sprint planning meeting or writing a product brief.

7 Things Every AI Feature Spec Must Address

The trigger: What action or event causes the AI feature to activate? A button click, a scheduled task, a data threshold, a user request?
The input: What data does the AI receive? Who provides it, in what format, and where does it come from?
The output: What exactly does the AI produce? A score, a summary, a recommendation, a generated draft, a classification?
The success condition: How do you know the output is good? Define this in plain terms before development starts.
The failure mode: What happens when the AI is wrong, uncertain, or produces garbage? Who sees it, and what's the fallback?
The human role: Does a person review, approve, edit, or simply receive the AI output? Specify this explicitly.
The scope boundary: What is this feature NOT doing? Stating the edges prevents scope creep and misaligned builds.

Defining Inputs and Outputs with Precision

Vague inputs produce vague outputs. If your spec says 'the AI analyzes customer feedback,' engineers have to guess what 'customer feedback' means, survey responses? Support tickets? App store reviews? All three? Each option changes the build entirely. Good AI specs name the data source, describe its format, and note any quality issues. For example: 'The AI receives the last 90 days of closed support tickets exported from Zendesk as plain text, averaging 200 words each, tagged with resolution status.'

Outputs need equal precision. 'The AI summarizes the feedback' tells nobody what to build. 'The AI produces a 3-5 bullet summary of recurring complaint themes, ranked by frequency, displayed in the manager dashboard before the weekly team meeting' is a spec. Notice the format (bullets), the content (themes, ranked), the length (3-5), and the context of use (manager, dashboard, weekly cadence). Every one of those details shapes how the feature gets built and evaluated.

Name the data source explicitly: Salesforce, Zendesk, Google Sheets, uploaded CSV, typed user input
Specify format: free text, structured fields, numerical scores, audio transcripts, images
Note volume and frequency: one record at a time, batch of 500, real-time stream
Define the output type: summary, score, label, ranked list, generated draft, yes/no flag
Specify output length or size: word count, number of items, character limit
State where the output appears: email, dashboard widget, Slack message, PDF export, in-app tooltip
Identify who sees the output first: end user, manager, system log, review queue

The 'So What' Test for Outputs

After you write your output definition, ask: 'So what does the user do with this?' If you can't answer in one sentence, the output isn't specific enough. A good output definition makes the next user action obvious. 'The AI flags invoices over $10,000 for manager approval' passes the test. 'The AI processes invoice data' does not.

Input Type	Example Source	Common Format	Watch Out For
Customer feedback	Zendesk, Typeform, App Store	Free text	Sarcasm, mixed languages, very short responses
Employee data	Workday, BambooHR, spreadsheet	Structured fields	Missing values, outdated records, PII sensitivity
Sales activity	Salesforce, HubSpot, CRM export	Structured + notes	Inconsistent rep input quality, blank fields
Documents	SharePoint, Google Drive, email	PDF, DOCX, plain text	Scanned images vs. text-based files, formatting artifacts
Meeting content	Zoom, Teams, Otter.ai transcripts	Text transcript	Speaker labels missing, crosstalk errors, jargon
Product usage	Analytics platform, event logs	Numerical/event data	Bot traffic, incomplete sessions, timezone mismatches
Financial records	ERP, accounting software, CSV	Structured numerical	Currency formats, fiscal year variations, manual overrides

Common AI feature input types and their known quality risks, address these in your spec before handoff

Writing Acceptance Criteria for AI Features

Standard software acceptance criteria ask: does the button work? AI acceptance criteria ask something harder: is the output good enough, often enough? You can't test AI like a calculator. The same input can produce slightly different outputs. Your criteria need to define a quality threshold, not perfection. A useful format is: 'In [X]% of test cases, the AI output must [meet condition Y], as evaluated by [person or method Z].' This gives QA a concrete target and gives your team a shared definition of done.

For non-technical product managers, the most practical approach is to build a small test set before development starts. Collect 20-30 real examples of the input your feature will process. Write down what a good output looks like for each one. This becomes your evaluation set. When the feature is built, you run these examples through it and check the results against your written expectations. No coding required, just structured judgment. This is sometimes called a 'golden set' or 'eval set' in AI teams.

Choose 20-30 real input examples that represent the range of what the feature will encounter, easy cases, edge cases, and tricky ones
For each example, write what a correct or acceptable output looks like in plain language
Define your threshold: what percentage of outputs need to be acceptable for the feature to ship? (80%? 90%? 95%?)
Decide who evaluates outputs: the PM, a subject matter expert, a sample of end users, or a combination
Document the evaluation rubric: is the output judged on accuracy, tone, completeness, format compliance, or all of these?
Run the eval set after each major build iteration, not just at the end
Track results in a simple spreadsheet: input, expected output, actual output, pass/fail, notes

Criteria Type	What It Measures	Example	Who Evaluates
Accuracy	Is the AI factually or functionally correct?	Does the AI correctly classify 90% of support tickets by category?	PM + subject matter expert
Completeness	Does the output include all required elements?	Does every generated summary include a recommended next action?	PM or QA reviewer
Format compliance	Does the output match the specified structure?	Are all outputs under 150 words and in bullet format?	Automated check or QA
Tone and voice	Does the output match brand or professional standards?	Does the AI-drafted email sound like the company's usual tone?	Marketing or comms lead
Hallucination rate	How often does the AI invent information not in the input?	In 5% or fewer cases, the summary references facts not in the source ticket	PM + spot-check review
Edge case handling	What happens with unusual or incomplete inputs?	When the transcript is under 50 words, does the AI return a graceful fallback message?	QA team

AI acceptance criteria types, use at least three of these in every feature spec

The 80/90/95 Rule of Thumb

Not every AI feature needs to be right 95% of the time. A feature that drafts marketing copy for human review can tolerate more variance than one that flags compliance violations. Match your accuracy threshold to the stakes. Low-stakes, human-reviewed outputs: 75-85% acceptable. High-stakes, automated decisions: 95%+ with mandatory human override. Set this threshold in writing before development starts, it prevents endless post-launch debates about whether the feature is 'good enough.'

PM Using Claude to Draft AI Feature Acceptance Criteria

Prompt

I'm a product manager writing a spec for an AI feature that automatically summarizes customer support tickets before a support rep opens them. The summary should be 2-3 sentences, highlight the customer's main problem, and note any previous tickets from the same customer. Help me write clear acceptance criteria that a QA team could actually use to test this feature. I don't have a technical background.

AI Response

Here are testable acceptance criteria for your ticket summary feature: **Format Criteria** - Every summary is between 2 and 3 sentences in length (measurable: word count 30–75 words) - Summary appears in the ticket view before the rep clicks to open the full ticket - If no previous tickets exist for the customer, the summary omits the history line entirely (no 'No previous tickets found' clutter) **Content Criteria** - The summary identifies the customer's primary complaint or request in the first sentence - When prior tickets exist, the summary notes how many and the most recent issue topic - The summary does not include information not present in the ticket text (no invented details) **Quality Thresholds** - In a test set of 30 real tickets, at least 85% of summaries are rated 'accurate and useful' by two support team reviewers - In 100% of cases, the summary does not expose data from a different customer's ticket **Failure Handling** - If the ticket is fewer than 20 words, the feature displays: 'Ticket too short to summarize, see full text below' - If the AI fails to generate a summary within 3 seconds, the ticket opens normally with no summary shown Run these criteria against a batch of 30 archived tickets before launch. Have two experienced support reps do the rating independently, then compare scores.

Specifying the Human-in-the-Loop

Every AI feature sits somewhere on a spectrum from 'fully automated' to 'AI assists, human decides.' Most enterprise AI features, especially early versions, should sit closer to the human end. The spec must state this explicitly. 'The AI generates a recommendation' means nothing until you specify: does a human see it before it goes to the customer? Can they edit it? Can they override it? Is there an audit trail? Leaving this undefined is how AI features end up making autonomous decisions nobody signed off on.

There are five distinct human-in-the-loop models, and each creates different UX, workflow, and accountability requirements. Choosing the right model isn't a technical decision, it's a product and risk decision. A feature that auto-sends AI-generated responses to customers carries very different liability than one that drafts responses for agent review. Your spec should name the model, explain why you chose it, and list any conditions under which the model should change, for example, moving from 'human approves' to 'auto-send' once the feature hits 95% accuracy over 60 days.

Model Name	How It Works	Best For	Risk Level	Example
Full automation	AI acts with no human review	Low-stakes, high-volume, reversible actions	High, requires high confidence threshold	Auto-tagging support tickets by category
Human notification	AI acts, then notifies a human after	Time-sensitive tasks where speed matters more than perfection	Medium	AI schedules a meeting, sends calendar invite, notifies organizer
Human approval	AI drafts or decides, human approves before action is taken	Customer-facing outputs, financial actions, HR decisions	Low-Medium	AI drafts a contract clause, legal reviews before sending
Human editing	AI produces a draft, human edits freely before use	Content creation, communications, proposals	Low	Copilot drafts a sales email, rep edits and sends
Human override only	AI acts automatically, human can reverse within a time window	Routine tasks with occasional exceptions	Medium	AI moves a deal to 'closed-lost' after 90 days inactive, rep can reopen within 7 days

Five human-in-the-loop models, choose one per feature and document the rationale in your spec

Don't Default to Full Automation

Product teams under pressure to ship sometimes default to full automation because it's simpler to build. Resist this. Full automation is appropriate only when the AI has been validated at high accuracy on your specific data, the consequences of errors are low or easily reversible, and legal/compliance teams have signed off. For most first-version AI features, start with human approval or human editing. You can reduce the human role later as confidence builds, but you can't undo the reputational damage from an AI feature that sent 10,000 wrong emails to your customers.

Practice: Write a One-Page AI Feature Spec

Draft a Scoped AI Feature Spec Using the Frameworks from This Lesson

Goal: Produce a one-page AI feature spec that addresses all seven core spec elements, includes measurable acceptance criteria, and explicitly names the human-in-the-loop model, ready to share with a development team or stakeholder for review.

1. Pick one AI feature idea relevant to your job, a meeting summarizer, a lead-scoring tool, a document drafter, a report generator, or any other workflow you'd like AI to assist with. Write the feature name and a one-sentence description at the top of a blank document. 2. Using the 7-item checklist from this lesson, write one sentence for each item: trigger, input, output, success condition, failure mode, human role, and scope boundary. Don't overthink it, first draft answers are fine. 3. Open the input/output table from this lesson. Identify which input type your feature uses. Note any data quality risks that apply to your specific case and write them as a bullet list under your spec. 4. Write three acceptance criteria for your feature using the criteria types from the second table. Include at least one format criterion and one content criterion. State your quality threshold (e.g., '85% of outputs rated acceptable by two reviewers'). 5. Choose one of the five human-in-the-loop models from the third table. Write two sentences explaining why you chose it and what conditions would cause you to change it. 6. Use ChatGPT, Claude, or Microsoft Copilot to review your spec draft. Paste it in and ask: 'I'm a non-technical product manager. Review this AI feature spec and tell me what's ambiguous, missing, or unclear. Be specific.' Revise based on the feedback.

AI Feature Scoping. Quick Reference Cheat Sheet

Every spec needs 7 elements: trigger, input, output, success condition, failure mode, human role, scope boundary
Name your data source explicitly, 'customer feedback' is not a spec; 'Zendesk tickets, last 90 days, plain text' is
Define output by format + length + location + audience, not just what the AI produces but where it lands and who acts on it
Build a golden set of 20-30 real examples before development starts, this is your evaluation baseline
Set your accuracy threshold in writing: match it to the stakes (low-stakes: 75-85%, high-stakes: 95%+)
Use at least three acceptance criteria types: format compliance, content accuracy, and one more relevant to your context
Choose one of five human-in-the-loop models and document why, never leave human oversight undefined
Default to human approval or human editing for v1 features, reduce human involvement only after validation
Run your spec through an AI tool and ask for gaps before handing it to engineering
State scope boundaries explicitly, what this feature does NOT do is as important as what it does

Key Takeaways from Part 2

Precise input definitions prevent misbuilt features, name the source, format, volume, and known quality issues
Output definitions must specify format, length, location, and the user's next action, not just what the AI produces
Acceptance criteria for AI features measure quality thresholds, not binary pass/fail, build an eval set before development
Human-in-the-loop is a product decision, not a technical one, document it explicitly in every spec
Starting with more human oversight and reducing it over time is always safer than starting with full automation

You have a feature idea. You know the problem it solves. Now comes the hardest part: writing a spec that engineers, designers, and stakeholders can actually build from. Most AI feature specs fail not because the idea was bad, but because the requirements were vague, the success metrics were missing, or nobody thought through what happens when the model is wrong. This reference section covers the three remaining pillars: acceptance criteria, failure modes, and stakeholder sign-off.

Acceptance criteria define the minimum bar for 'done', without them, every demo looks like a success.
AI features need probabilistic acceptance criteria, not binary pass/fail thresholds like traditional software.
Every AI feature has at least three failure modes: wrong output, no output, and confidently wrong output.
Stakeholders sign off on different things: legal on data use, finance on cost-per-inference, UX on error states.
A feature spec without a rollback plan is incomplete, models degrade, data drifts, outputs shift.
Human-in-the-loop checkpoints should be specified explicitly, not assumed.
Your spec is a living document, version it, date it, and assign a single owner.

Writing Acceptance Criteria for AI Features

Traditional software acceptance criteria are binary: the button saves the file or it doesn't. AI features don't work that way. A resume-screening tool that correctly ranks candidates 85% of the time might be acceptable for a first-pass filter but unacceptable as a final decision engine. Your acceptance criteria must name the threshold, the evaluation method, and the dataset used to measure it. 'The model performs well' is not a criterion. '82% precision on a held-out set of 500 tagged support tickets' is a criterion.

Write acceptance criteria in three layers. First, the performance floor: the minimum accuracy, recall, or confidence score the feature must hit before launch. Second, the user experience bar: what does a correct output look like to the end user, and how quickly must it appear? Third, the edge case protocol: what does the system do when it cannot produce a confident answer? Specifying all three prevents the classic PM trap of approving a model that scores well on benchmarks but frustrates real users within a week of launch.

Name the metric: precision, recall, F1, BLEU score, or user satisfaction rating.
Name the threshold: '80% or above' is a criterion; 'high accuracy' is not.
Name the test set: how many examples, from what time period, labeled by whom.
Name the latency limit: most users abandon AI responses that take longer than 3–5 seconds.
Name the fallback: what the UI shows when confidence is below threshold.
Name the review cadence: how often performance is re-evaluated post-launch.

Use ChatGPT to Draft Your Acceptance Criteria

Paste your feature description into ChatGPT or Claude and prompt: 'Write acceptance criteria for this AI feature in three layers: performance floor, user experience bar, and edge case protocol. Use specific, measurable language.' Then edit the output against your actual data constraints. You will get a solid first draft in under two minutes.

Criteria Layer	What to Specify	Example
Performance Floor	Metric + threshold + test dataset	≥80% precision on 500 labeled support tickets from Q3
UX Bar	Output format + response time + confidence display	Plain-language summary under 100 words, delivered in <3 seconds
Edge Case Protocol	Behavior when confidence is low or input is invalid	Show 'I'm not sure, here are 3 articles to review' below 60% confidence
Bias & Fairness Check	Demographic parity or equal error rate across subgroups	Error rate within 5% across gender and age groups in test set
Rollback Trigger	Condition that forces revert to prior version	Precision drops below 70% for 48 consecutive hours in production

Acceptance Criteria Framework for AI Features

Mapping Failure Modes Before You Build

Every AI feature has three fundamental failure modes. Wrong output: the model gives an answer, but the answer is incorrect. No output: the model refuses, times out, or returns nothing useful. Confidently wrong output: the model gives a plausible-sounding answer with high apparent certainty, the most dangerous failure mode because users trust it. Your spec must address all three explicitly. For each one, define who is harmed, how badly, and what the system does next.

Failure mode mapping is not pessimism, it is how you earn engineering and legal trust. When you walk into a spec review having already documented what happens when the sentiment analyzer misreads sarcasm, or when the contract summarizer hallucinates a clause, stakeholders stop asking 'but what if it's wrong?' and start asking 'how do we prioritize the build?' That shift is worth every hour spent on failure analyzis. Use a simple table. Keep it in the spec. Update it as you learn more from testing.

List every output type the feature can produce (text, score, classification, recommendation).
For each output type, write one 'wrong output' scenario with a specific example.
Write one 'no output' scenario: what triggers it, what the UI shows.
Write one 'confidently wrong' scenario: why a user might act on it before noticing the error.
Rate each failure: Low (inconvenience), Medium (rework required), High (compliance or financial risk).
Assign a mitigation: human review gate, confidence threshold, disclaimer text, or audit log.
Get legal or compliance sign-off on any High-rated failure mode before sprint planning.

Failure Mode	Example	Risk Level	Mitigation
Wrong output	Sentiment tool flags positive review as negative	Low–Medium	Display raw score alongside label; allow user override
No output	Summarizer returns blank on PDFs over 50 pages	Medium	Show fallback message with manual upload prompt
Confidently wrong	Contract AI states wrong renewal date with no disclaimer	High	Add 'Always verify dates against source document' warning
Bias error	Job description rewriter defaults to gendered language	High	Bias audit before launch; human review gate for HR use cases
Data drift failure	Churn model trained on 2022 data misfires after pricing change	Medium–High	Monthly retraining schedule; drift alert in monitoring dashboard

Failure Mode Reference Table for AI Feature Specs

Legal and Compliance Need Their Own Failure Mode Column

Add a 'Regulatory Impact' column to your failure mode table before sharing with legal. In regulated industries, finance, healthcare, HR tech, a confidently wrong AI output can trigger audit obligations or discrimination claims. Naming the risk explicitly in the spec shows legal you have already thought about it, which dramatically speeds up their review cycle.

Generating a Failure Mode analyzis with Claude

Prompt

I am a product manager specifying an AI feature that automatically summarizes customer support tickets and suggests a priority level (Low, Medium, High). Write a failure mode analyzis covering: wrong output, no output, and confidently wrong output. For each, give a specific example, rate the risk as Low/Medium/High, and suggest one mitigation. Format as a table.

AI Response

| Failure Mode | Example | Risk Level | Mitigation | |---|---|---|---| | Wrong output | Ticket about billing error classified as 'Low' instead of 'High' | High | Require human agent review for all tickets touching billing or refunds regardless of AI classification | | No output | Ticket written in Spanish returns blank summary because model was trained on English only | Medium | Display 'Translation required, please review original text' fallback; route to multilingual queue | | Confidently wrong | Ticket flagged 'High, legal threat detected' based on common phrase 'I'll take action,' triggering unnecessary escalation | High | Add confidence score display; set escalation trigger only above 90% confidence; include human override button | | Data drift | Model trained on pre-2023 ticket language misreads new product terminology introduced after a rebrand | Medium | Schedule quarterly retraining; add monitoring alert if classification distribution shifts more than 15% month-over-month | | Bias error | Tickets from non-native English speakers receive lower priority scores due to informal phrasing | High | Audit classification rates by language pattern before launch; include fairness metric in monthly reporting dashboard |

Stakeholder Sign-Off and the Living Spec

A spec nobody has formally approved is just a document. Stakeholder sign-off is not bureaucracy, it is a forcing function that surfaces disagreements before engineering starts. Different stakeholders own different parts of the spec. Legal owns the data use section. Finance owns the cost-per-inference estimate. UX owns the error state designs. The data team owns the training data assumptions. Your job is to route the right sections to the right people and log their sign-off with a date and version number.

AI feature specs must be versioned because the feature itself will change. Models get retrained. Thresholds get adjusted. New failure modes appear in production. Keep a changelog at the top of every spec: date, version number, what changed, and who approved the change. This is not overhead, it is the audit trail that protects you when a model underperforms six months after launch and someone asks, 'Did anyone know this was a risk?' With a versioned spec, the answer is documented.

Stakeholder	Section They Own	What They Are Approving
Legal / Compliance	Data sources, retention, output disclaimers	No regulatory violations in data use or AI-generated content
Finance	Inference cost estimate, API budget	Cost per query is within unit economics model
UX / Design	Error states, confidence display, fallback UI	User experience for failed or low-confidence outputs is acceptable
Data / ML Team	Training data, model card, performance metrics	Data quality and model assumptions are sound
Security	Data access controls, logging, PII handling	Feature meets data security and privacy standards
Business Owner	Success metrics, rollback triggers, launch criteria	Feature aligns with business objective and risk tolerance

Stakeholder Sign-Off Matrix for AI Feature Specs

Never Launch Without a Rollback Plan in the Spec

Every AI feature spec must include a rollback plan before it enters sprint planning. Define the exact trigger condition, a drop in precision below a threshold, a spike in user complaints, a legal flag, and name who has the authority to pull the feature. If rollback is not in the spec, it will not be built into the release process, and you will scramble to undo a live feature under pressure. That is a recoverable problem. The reputational damage from a confidently wrong AI feature at scale is harder to fix.

Write a One-Page AI Feature Spec Using Free AI Tools

Goal: Produce a complete, stakeholder-ready one-page spec for a real or hypothetical AI feature you want to propose, covering problem statement, acceptance criteria, failure modes, and sign-off requirements.

1. Open ChatGPT (free) or Claude (free tier) and type: 'I am a product manager. Help me write a one-page AI feature spec. I will give you the feature idea and you will ask me clarifying questions before drafting.' Press enter and answer its questions. 2. When the AI produces a draft, check that it includes: a problem statement, the target user, the AI task type (classification, generation, summarization, etc.), and a success metric with a number. 3. Prompt the AI: 'Now add an acceptance criteria section with three layers: performance floor, UX bar, and edge case protocol. Use specific, measurable language.' 4. Prompt the AI: 'Add a failure mode table with columns for: failure type, example, risk level (Low/Medium/High), and one mitigation per row. Include at least three rows.' 5. Prompt the AI: 'Add a stakeholder sign-off section listing who must approve this spec and what they are approving.' Review the output and add or remove stakeholders based on your actual organization. 6. Copy the full output into a Google Doc or Notion page. Add a changelog table at the top with today's date, Version 1.0, and your name as owner. Share it with one colleague for feedback before your next team meeting.

Cheat Sheet: AI Feature Spec. Every section you need
Problem statement: Who has the problem, what they do today, what the AI does instead.
Feature definition: AI task type, input format, output format, integration point.
Success metrics: Business KPI + model performance metric + user satisfaction measure.
Acceptance criteria: Performance floor (metric + threshold + test set) + UX bar + edge case protocol.
Failure mode table: Wrong output, no output, confidently wrong, each with risk level and mitigation.
Stakeholder sign-off matrix: Legal, Finance, UX, Data, Security, Business Owner.
Rollback plan: Trigger condition + who approves rollback + timeline for revert.
Changelog: Date, version, what changed, who approved.
Out-of-scope list: At least three things the feature will not do in v1.

Takeaways
Acceptance criteria for AI must be probabilistic, name the metric, the threshold, and the test dataset.
Every AI feature has three failure modes: wrong output, no output, and confidently wrong output. Spec all three.
Different stakeholders own different sections of the spec, route the right sections to the right people.
A spec without a rollback plan is not ready for sprint planning.
Version your spec with a changelog. It is your audit trail and your credibility protection.
Free AI tools like ChatGPT and Claude can draft a complete spec structure in under ten minutes, your job is to pressure-test and own the output.

Featured Reading

This lesson requires Pro+

Upgrade your plan to unlock this lesson and all other Pro+ content on the platform.

Upgrade to Pro+

You're currently on the Free plan.

Practice this in a lab

Fix the Flawed Prompt: AI-Assisted Legal Contract Review

intermediate · 12 min

Pick the Better AI Scope: Healthcare Triage Feature

intermediate · 10 min