Business Goal Identification in the ML Lifecycle

Starting with WHY, not how: Define the business problem and success metrics before considering ML. Most ML projects fail because they optimize for the wrong objectives, not because algorithms are weak.

Posted Nov 10, 2025

15 min read

Starting with WHY, not how: Define the business problem and success metrics before considering ML. Most ML projects fail because they optimize for the wrong objectives, not because algorithms are weak.

The Critical 20 Questions

Before starting any ML project, answer these questions in writing. If stakeholders disagree on answers, the project isn’t ready:

Problem Definition (5 questions)

What is the business problem? (Not “use ML” — what operational pain point?)
Why does this problem exist? (Is it data-driven, process-driven, or both?)
What will success look like? (Specific, measurable outcome)
What is the current baseline performance? (Heuristic, status quo, or human)
Who is the customer/user impacted by this solution? (Internal or external?)

Stakeholder Alignment (5 questions)

Who are the key stakeholders? (List title, motivation, success criteria)
Does everyone agree on the goal? (CEO, PM, Ops, Legal, Privacy)
Who owns the outcome? (Accountability for ROI, not just model accuracy)
What are stakeholder concerns? (Job displacement, privacy, fairness, cost)
What happens if we do nothing? (Cost of inaction vs cost of building)

Feasibility & Constraints (5 questions)

Is labeled data available? (Quantity, quality, cost to label)
Can we compute features at prediction time? (Real-time vs batch)
What is the latency requirement? (10ms for ads vs 5s for email)
Are there regulatory/legal constraints? (GDPR, HIPAA, bias audits)
What is the timeline for ROI? (3 months, 1 year, longer acceptable?)

Data & Measurement (5 questions)

How will we measure business impact offline? (Holdout test set, simulations)
How will we measure business impact online? (A/B test, canary, metrics)
What is the cost of a false positive vs false negative? (Different costs matter)
How frequently does ground truth change? (Concept drift risk)
How will we collect feedback at scale? (User corrections, implicit signals)

Key Properties by Goal Type

Dimension	Classification	Regression	Ranking	Clustering	Anomaly Detection
Business Q	Will X happen?	What is quantity Y?	In what order?	What segments?	Is this normal?
Example	Click ad?	Orders tomorrow?	Best movies for user?	Customer types?	Fraudulent transaction?
ML Metric	Accuracy, AUC	RMSE, MAPE	NDCG, MAP	Silhouette score	Precision, Recall
Data Req	Labeled examples	Historical outcomes	Ranked pairs	Unlabeled clusters	Labeled anomalies
Baseline Ease	Easy	Easy	Tricky	Hard	Very hard
Concept Drift Risk	Medium	Low	High	Low	Very high
Sample Deployment	Stripe fraud	Amazon demand forecasting	Netflix recommendations	Stripe merchant segmentation	DoorDash delivery fraud

SMART Goal Framework

Convert vague goals into measurable objectives:

Dimension	Bad Goal	SMART Goal
Specific	“Improve recommendations”	“Increase CTR on recommended items by 5% within 6 months”
Measurable	“Better personalization”	“NDCG@10 >= 0.75 offline; 3% CTR lift in A/B test”
Achievable	“100% customer satisfaction”	“NPS improvement of 10 points (from 45 to 55) through better recs”
Relevant	“Use deep learning”	“Reduce recommendation latency from 2s to 500ms (enables mobile app)”
Time-bound	“Eventually deploy”	“Offline eval in 8 weeks; A/B test running by week 14”

Stakeholder Analysis & Alignment

Stakeholder Map

Stakeholder	Primary Goal	ML Impact	Concerns	Win Condition
Product Manager	Increase feature adoption, engagement	Faster iteration, better UX	Takes too long, derails roadmap	5% CTR improvement, ships in Q3
CEO/CFO	Revenue, profitability	Unlock new revenue stream, reduce ops cost	ROI uncertain, high infrastructure cost	$2M incremental revenue or $500k cost savings
End User	Better experience, privacy	Personalized recommendations, faster	Privacy erosion, unfair treatment	Discovers movies they love, no data misuse
Ops/Finance	Cost reduction, efficiency	Automate decisions, reduce headcount needs	Automation backfires, regulatory scrutiny	20% reduction in decision latency, cost savings
Data/ML Team	Technical growth, interesting problems	Solve hard problem, publish	Impossible data quality, wrong metrics	Ship product, measure impact, learn
Legal/Privacy	Risk mitigation, compliance	Explainability, fairness testing, data governance	Model bias, GDPR violation, unfair impact	Audit trail, explainability, fairness metrics

Alignment Exercise: Schedule a 90-min workshop with all stakeholders. Go through the 20 questions together. If there’s disagreement on the goal, the project isn’t ready — alignment is prerequisite, not deliverable.

Success Metrics: The Business to ML Chain

The core insight: Business metrics ultimately matter, but they often have high variance and long feedback loops. Use ML metrics as proxy:

Business Goal (Observable but Noisy)
         ↓
Business Metrics (Weekly/monthly measurement)
         ↓
ML Metrics (Per-prediction measurement)
         ↓
Model Optimization Target (What to train on)

Example 1: Recommendation CTR Improvement

Business Goal: Increase user engagement and ad revenue

Business Metric: CTR (clicks / impressions shown)

Measured daily, but has variance from user behavior, time-of-day, seasonality
1% CTR improvement → $50k/month additional revenue (for Netflix scale)

ML Metrics:

NDCG@10 (ranking quality) — correlates with CTR
Diversity — ensures recommendations aren’t repetitive
Freshness — favor recent/trending content

Training Objective: Maximize NDCG@10 + 0.1 * Diversity - 0.05 * Latency

Example 2: Fraud Detection

Business Goal: Minimize fraud loss while allowing legitimate transactions

Business Metrics:

Fraud loss: $ stolen / total transaction volume
False positive rate: legitimate transactions declined / total legit transactions
Cannot optimize both simultaneously — trade-off set by stakeholders

ML Metrics:

Precision: % of flagged transactions actually fraudulent
Recall: % of actual fraud caught
Threshold: balance is business decision, not model decision

Training Objective: Precision-Recall curve; ops team chooses threshold

Example 3: Demand Forecasting

Business Goal: Reduce stockouts and overstock simultaneously

Business Metrics:

Stockout rate: items unavailable / total customers
Overstock cost: holding cost on excess inventory
These are naturally opposed; balance matters

ML Metrics:

MAPE (Mean Absolute Percentage Error): forecast accuracy
Directional accuracy: did we predict up/down correctly?
Uncertainty quantification: confidence intervals (not just point estimate)

Training Objective: Minimize MAPE while calibrating uncertainty

Constraints & Feasibility Assessment

Constraints are binding — they eliminate candidate solutions:

Technical Constraints

Data Availability:

Do labeled examples exist? (e.g., transaction fraud labels)
How many? (100 examples → too few; 100M → plenty)
What’s the label quality? (User-supplied vs expert vs proxy)
Can we collect more? (Cost to crowdsource labels?)

Latency Budget:

Scenario           | Budget | Consequence of Missing
Search ads         | <10ms  | User clicks before ad loads → lost revenue
Real-time ranking  | 50–100ms | Slow page → users leave
Batch email        | <5s per user | Infrastructure feasible
Offline analytics  | <1hr   | Almost anything works

Feature Availability:

Which features can be computed at prediction time? (inference latency)
Which require lookback windows? (how much history to cache?)
Are some features unavailable during serving? (batch training data != serving data)

Infrastructure Capacity:

Peak QPS required? (10 req/s → single server; 100k req/s → distributed)
Model update frequency? (hourly retraining → Kubernetes; monthly → batch job)
Budget for compute, storage, monitoring?

Regulatory & Fairness Constraints

Privacy (GDPR, CCPA, HIPAA):

Can we use personal data? (Consent exists? Deletion-capable?)
Retention period? (Can’t train on deleted data)
Cross-border data flows? (EU data ≠ US servers)

Explainability (Finance, Healthcare, Legal):

Must decisions be explainable? (Interpretable model required)
Is “feature importance” sufficient or full reasoning path needed?
Can we use black-box models (neural nets) or must it be rules/trees?

Fairness & Bias:

Protected attributes (race, gender, age): training or measurement?
Fairness definition? (Equal opportunity, demographic parity, equalized odds?)
Audit requirement? (Periodic bias testing, human review)

Feasibility Checklist

Data Availability
  [ ] Labeled data exists (at least 1k examples)
  [ ] Label quality acceptable (>90% agreement on random sample)
  [ ] Can collect more labels if needed (reasonable cost)
  [ ] No data leakage issues identified
  [ ] Data retention policy allows model training

Problem Clarity
  [ ] Business goal is specific and measurable
  [ ] ML problem clearly framed (X, y, distribution)
  [ ] Success metrics defined (business + ML)
  [ ] Baseline performance documented
  [ ] Trade-offs (precision/recall, latency/accuracy) explicitly stated

Technical Feasibility
  [ ] Latency budget achievable (benchmarked on sample data)
  [ ] Required features can be computed in time
  [ ] Serving infrastructure estimated (cost, architecture)
  [ ] Team has required skills (data eng, ML, DevOps)

Organizational Readiness
  [ ] Executive sponsor identified (CEO/CPO level)
  [ ] Stakeholders aligned on goal (documented in meeting notes)
  [ ] Ownership clear (who owns retraining, monitoring?)
  [ ] Timeline realistic (4–6 months minimum)
  [ ] Budget approved (data, compute, engineering)

Regulatory & Ethical
  [ ] Legal review completed (privacy, terms of service)
  [ ] Fairness concerns identified (protected attributes, disparate impact)
  [ ] Explainability requirements understood
  [ ] User consent/transparency plan in place

ROI & Business Case Framework

A common mistake: spending 6 months building, then discovering nobody wants it.

3-Level ROI Calculation

Level 1: Revenue Impact

Base case: “Lift CTR by 3% → $500k incremental revenue annually”
Probability: 70% (model performs as expected)
Expected value: $500k x 0.7 = $350k

Level 2: Cost Savings

Base case: “Automate moderation, save 2 FTE ($200k/year)”
Probability: 80% (automation doesn’t catch all cases, some human review remains)
Expected value: $200k x 0.8 = $160k

Level 3: Opportunity Cost

Investment: 3 engineers x 6 months x $300k/yr / 2 = $270k labor
Infrastructure: $50k/month x 6 months = $300k
Opportunity: Same 3 engineers could build X instead (know the alternative)
Net: ($350k + $160k) - ($270k + $300k) = -$60k (NOT recommend!)

Decision Rule: If expected value < 0 or probability < 50%, don’t start. If expected value high but probability uncertain, run cheaper POC first.

Real-World Examples: How Companies Frame Goals

Google Search Quality (Ranking)

Business Goal: Increase user satisfaction and ad revenue

Business Metric: Aggregate NDCG@10 (pairwise relevance judgments on 10k query samples quarterly)
ML Metric: Offline LambdaMART loss, online CTR
Success Threshold: 0.5% improvement in NDCG = go/no-go for production
Timeline: 3 months offline eval → 4-week canary on 5% → full rollout
Team: 50+ engineers (signals, feature engineering, large-scale training, serving)
Challenge: Concept drift (new entities, news); need continuous retraining

Netflix Recommendations (Ranking)

Business Goal: Increase engagement (watch hours, retention)

Business Metric: Watch hours per member, churn rate
ML Metric: NDCG@10 (member feedback on recommendations), diversity (avoid algorithm bubble)
Success Threshold: 2% watch hour lift in A/B test
Latency Constraint: <100ms (member sees recs in seconds)
Timeline: 8 weeks offline → 2-week slow rollout → monitor weekly
Team: 20+ ML engineers + data platform team
Challenge: Cold-start (new members, new content); multi-objective (relevance + diversity + freshness)

Stripe Fraud Detection (Classification)

Business Goal: Prevent fraud while minimizing false positives (good UX)

Business Metrics: Fraud rate (% of transactions), decline rate (% of legit declined)
ML Metrics: Precision (FP/(TP+FP)), Recall (TP/(TP+FN))
Success Threshold: Precision >= 99% at 95% recall (catch 95% of fraud, <1% false positives)
Latency Constraint: <50ms (payment approval)
Timeline: 6 weeks of offline modeling → 2-week shadow mode → threshold calibration → production
Team: 10 ML engineers + fraud operations team
Challenge: Adversarial (fraudsters adapt); concept drift is daily

Implementation Example: Problem Statement Template

Use this one-pager to lock in alignment before starting:

PROJECT: [Name] — [One-sentence goal]
DATE: YYYY-MM-DD
OWNER: [Name, Title]

PROBLEM DEFINITION
==================
Current State: [Describe operational reality]
Desired State: [After ML deployment]
Why Matters: [Business impact if solved]

STAKEHOLDERS & SUCCESS CRITERIA
================================
Product: [Motivated by X, success = Y]
Finance: [ROI target, timeline]
Ops: [Operational burden, scale requirements]
Legal/Privacy: [Compliance needs, fairness concerns]

SUCCESS METRICS
===============
Business: [KPI, how measured, target]
ML: [Offline metric, online metric, targets]
Trade-offs: [Precision vs recall? Latency vs accuracy?]

CONSTRAINTS & ASSUMPTIONS
==========================
Data: [Sources, volume, quality, labeling cost/timeline]
Latency: [Budget, implication of missing]
Regulatory: [GDPR? Explainability required? Fairness audits?]
Team Skills: [Python? Distributed systems? Do we need to hire?]
Infrastructure: [Available? Cost estimate?]

FEASIBILITY ASSESSMENT
======================
Data availability: pass / warn / fail
Problem clarity: pass / warn / fail
Technical feasibility: pass / warn / fail
Stakeholder alignment: pass / warn / fail
Timeline realistic: pass / warn / fail

RECOMMENDATION
==============
GO / NO-GO / POC FIRST (why?)

NEXT STEP
=========
[If GO: Schedule kick-off; if POC: Design 2-week experiment]

How Real Companies Use This

Airbnb (Smart Pricing Goal: Increase Host Revenue): Airbnb’s ML Lifecycle for Smart Pricing starts with a clear business goal identified by the platform team: increase host revenue per night while maintaining guest satisfaction. ML framing translated this into a regression problem: predict optimal nightly price given property attributes, calendar date, local events, and historical demand. Success metric was explicitly tied to business outcome: revenue per listing per month, not model accuracy (MAPE). The team ran stakeholder alignment workshops with hosts (who want higher prices), guests (who want affordability), and operations (who wanted stable supply). Key constraint: the pricing model must be explainable to hosts (“why did you suggest this price?”), limiting black-box models. Timeline: 90 days to MVP with weekly stakeholder reviews. Result: 3–5% revenue increase for hosts adopting Smart Pricing, validated via A/B test on 50k listings over 6 weeks.

Spotify (Discover Weekly Goal: Increase Engagement): Spotify’s Discover Weekly feature originated from a business goal: increase user engagement and retention via personalization. Problem framing: generate 30 song recommendations per user per week such that the user listens for 30+ seconds (proxy for genuine interest). Success metric was NOT model accuracy; it was stream minutes and subscriber retention, tracked via A/B testing. Stakeholder alignment required buy-in from product (feature visibility), marketing (retention messaging), and data (label quality). Key challenge: cold-start for new artists (no listening data) solved via content-based filtering. Feasibility assessment: do we have enough listening history? Yes (500M users x 10 years). Timeline: 12 weeks from goal to production. Business impact: 4% increase in listener retention, driving $100M+ incremental annual revenue.

LinkedIn (Feed Ranking Goal: Meaningful Professional Interactions): LinkedIn’s feed ranking goal was ambiguous at first: “increase engagement.” Stakeholder workshops revealed disagreement on what “engagement” meant. Product wanted clicks, recruiters wanted profile views, advertising wanted ad clicks. Alignment took 6 weeks of negotiation, resulting in redefined goal: “increase meaningful professional interactions” (defined as comments, shares, job applies, message starts). ML framing: rank feed items by probability of positive action within 24 hours. Features: post recency, author influence, member interest signals. Constraints: explainability required (members see reasons in “Why seeing this?” prompt). Feasibility: 500M members x 1,000 actions/day = sufficient labeled data. Timeline: 16 weeks due to stakeholder alignment overhead. Result: 12% increase in meaningful interactions, improving platform health and reducing low-quality content.

Uber (Delivery Time Prediction Goal: Minimize Lateness Complaints): Uber Eats’ business goal was to reduce “late delivery” complaints, which correlates with churn. Problem framing initially: predict exact delivery time (regression, RMSE metric). However, stakeholder analysis revealed customers don’t care about precise predictions; they care about threshold promises (“30 minutes or free”). Reframed as: predict ordinal class (on-time / 5-min late / 10-min late / 15+ min late), optimize for f1-score at each threshold. ROI calculation: each late delivery costs ~$5 in refunds + churn risk. A 10% reduction in late deliveries = $2M annual savings. Constraints: real-time serving (<100ms), rely only on features available at order placement. Feasibility: 100M+ delivery labels available. Timeline: 8 weeks. Result: 8% improvement in on-time delivery rate, directly impacting customer satisfaction scores.

Netflix (Reducing Churn via Personalization Goal): Netflix’s original business goal for their ML Lifecycle was “reduce churn,” defined as members canceling within 30 days. Problem framing revealed two strategies: (1) improve content recommendations (relevance), (2) detect at-risk members early (prediction). The second led to an early-warning system: predict probability of churn within 7 days based on viewing patterns. Success metrics: precision (% of predicted-churn members who actually churn) must be 80%+ to avoid false positives (wasted retention offers). Constraints: GDPR-compliant (no location data), explainability required (why is this member at risk?). Feasibility: billions of viewing records available. Stakeholder alignment took 4 weeks (product wanted features, retention ops wanted predictions early). Timeline: 12 weeks to production. Result: 2–3% churn reduction, equivalent to $20M+ annual revenue retention.

References

Machine Learning Yearning (Andrew Ng) — Free ebook on problem framing and project management
Rules of Machine Learning: Best Practices for ML Engineering (Google) — Rules 1–10 cover goal-setting
What You Need To Know Before Building a Recommender System (Eugene Yan) — Framework for recommendations specifically
An Introduction to Statistical Learning (James et al.) — Chapter on problem formulation
Fairness and Machine Learning (Barocas, Hardt, Narayanan) — Ethical considerations in problem definition
Chip Huyen: ML Systems Design – Scoping Phase — Problem scoping in-depth
The AI Hierarchy of Needs (Monica Rogati) — Why most companies fail at ML (wrong problem)

AI & Agents, AI Ops

ai-fundamentals

This post is licensed under CC BY 4.0 by the author.

The Critical 20 Questions

Problem Definition (5 questions)

Stakeholder Alignment (5 questions)

Feasibility & Constraints (5 questions)

Data & Measurement (5 questions)

Key Properties by Goal Type

SMART Goal Framework

Stakeholder Analysis & Alignment

Stakeholder Map

Success Metrics: The Business to ML Chain

Example 1: Recommendation CTR Improvement

Example 2: Fraud Detection

Example 3: Demand Forecasting

Constraints & Feasibility Assessment

Technical Constraints

Regulatory & Fairness Constraints

Feasibility Checklist

ROI & Business Case Framework

3-Level ROI Calculation

Real-World Examples: How Companies Frame Goals

Google Search Quality (Ranking)

Netflix Recommendations (Ranking)

Stripe Fraud Detection (Classification)

Implementation Example: Problem Statement Template

How Real Companies Use This

References

Trending Tags