The ML Lifecycle
Systematic end-to-end journey from business question through production deployment and continuous monitoring -- a feedback loop designed to deliver measurable value while managing risk.
Systematic end-to-end journey from business question through production deployment and continuous monitoring — a feedback loop designed to deliver measurable value while managing risk.
The Six-Stage Lifecycle
The ML lifecycle is fundamentally iterative. Each stage produces outputs that feed into the next, but monitoring creates a feedback loop that often requires returning to earlier stages when performance degrades or business objectives shift.
Stage 1: Business Goal Identification (1–2 weeks, 0.5 FTE)
Objective: Define WHAT problem to solve and WHY, before considering HOW.
This stage prevents wasted effort on technically interesting but business-irrelevant problems. Key activities:
- Stakeholder alignment: Map who benefits (product, revenue, operations) and who loses (privacy concerns, job displacement, fairness impacts)
- Problem type: Classify as classification, regression, ranking, clustering, or anomaly detection
- Success metrics: Link business metrics (revenue, engagement, cost savings) to ML metrics (accuracy, latency, fairness)
- Feasibility assessment: Data availability, regulatory constraints, timeline, budget
- Baseline establishment: What simple heuristic or status quo are we comparing against?
Key Deliverable: One-page problem statement with aligned stakeholders, clear success metrics, and decision criteria.
Stage 2: ML Problem Framing (2–4 weeks, 1 FTE)
Objective: Translate business problem into a concrete ML formulation.
This stage defines the exact machine learning task:
- Feature definition: What input variables (X) are available and predictable at serving time?
- Target definition: What are we predicting (y)? How do we collect ground truth labels?
- Cold-start problem: How do we handle new items/users without historical data?
- Temporal aspects: Is this time-series? Do we need lookback windows?
- Label quality: Can we programmatically label, or do we need human annotation?
- Proxy metrics: If the true metric is expensive to compute, what proxy correlates with it?
Key Deliverable: Feature specification document, labeling strategy, dataset schema, and initial data requirements.
Stage 3: Data Processing (4–12 weeks, 1–2 FTE)
Objective: Collect, label, clean, and prepare data at scale.
This is typically the longest and most resource-intensive stage:
- Data collection: Internal logs, third-party APIs, user-generated, or synthetic data
- Labeling: Crowdsourcing (Amazon Mechanical Turk, Appen), programmatic rules, active learning, or human expertise
- Data quality: Duplicates, missing values, outliers, measurement errors
- Feature engineering: Normalization, categorical encoding, domain-specific transformations
- Train/val/test split: Temporal splits for time-series, stratified splits for imbalanced data, avoiding data leakage
- Data versioning: Track dataset snapshots (DVC, Delta Lake) so models can be reproduced
Key Deliverable: Clean, reproducible dataset with train/val/test splits and feature definitions.
Stage 4: Model Development (4–16 weeks, 1–2 FTE)
Objective: Train, tune, and iterate on candidate models.
This stage involves rapid experimentation:
- Algorithm selection: Rule-based baseline → simple model (logistic regression, decision tree) → complex model (gradient boosting, neural networks)
- Hyperparameter tuning: Grid search, random search, or Bayesian optimization (Optuna, Hyperopt)
- Cross-validation: K-fold for small datasets, time-series split for temporal data
- Error analysis: Confusion matrix, learning curves, segment-level performance (which user groups fail?)
- Experiment tracking: Log all runs in MLflow or Weights & Biases for reproducibility
- Model comparison: Statistical testing (is improvement statistically significant?)
Key Deliverable: Trained model, experiment logs, error analysis report, and model card (documentation of capabilities and limitations).
Stage 5: Model Deployment (2–8 weeks, 1–2 FTE)
Objective: Package and release model to production safely.
This stage minimizes risk through staged rollout:
- Model serialization: Save model weights + preprocessing pipeline (ONNX, TF SavedModel, PyTorch JIT)
- API serving: Real-time (low latency) via TorchServe/TF Serving, or batch (high throughput) via Spark/Airflow
- Serving infrastructure: Containerization (Docker), orchestration (Kubernetes), autoscaling
- Shadow mode: Run new model in parallel without impacting users; log predictions for audit
- Canary release: Roll out to 1–5% of traffic first, monitor for issues
- A/B testing: Compare new model against baseline on 50/50 traffic split; measure business metrics, not just ML metrics
Key Deliverable: Deployed model, serving API, A/B test infrastructure, and monitoring dashboards.
Stage 6: Model Monitoring (Ongoing, 0.2–0.5 FTE)
Objective: Detect degradation and trigger retraining before users are impacted.
This stage is where most production failures happen:
- Performance monitoring: Track accuracy, latency, cost in production
- Data drift: Input distribution shift detected via Kolmogorov-Smirnov test or Population Stability Index
- Concept drift: True label distribution changes (ground truth changed, user behavior shifted)
- Retraining triggers: Automatic (weekly/monthly) or on-demand (when drift detected)
- Feedback loops: Integrate user feedback (corrected labels) into retraining pipeline
- SLA enforcement: Alert if accuracy drops below threshold or latency exceeds budget
Key Deliverable: Real-time dashboards (Grafana, Prometheus), drift detection alerts, automated retraining pipeline.
When ML Is Appropriate
ML is powerful but not always the right tool. Use this framework:
Use ML when:
- Problem has clear objective functions and sufficient labeled data
- Patterns are non-obvious (human heuristics insufficient)
- Scalability matters (thousands/millions of decisions per day)
- Environment changes over time (feedback loop justifies retraining)
- Trade-offs are acceptable (explainability vs accuracy, latency vs quality)
Avoid ML when:
- Data is scarce (<1k examples) and humans are reliable
- Rules are simple enough (hardcoded logic is maintainable and sufficient)
- Explainability is legally required and model is a black box
- Cost of wrong predictions is catastrophic (healthcare without oversight)
- Latency or infrastructure requirements prohibitive
Key Properties by Stage
| Dimension | Goal ID | Problem Frame | Data Processing | Model Dev | Deployment | Monitoring |
|---|---|---|---|---|---|---|
| Duration | 1–2 weeks | 2–4 weeks | 4–12 weeks | 4–16 weeks | 2–8 weeks | Ongoing |
| Effort (FTE) | 0.5 | 1 | 1–2 | 1–2 | 1–2 | 0.2–0.5 |
| Cost (if outsourced) | ~$5–10k | ~$10–20k | ~$20–100k | ~$30–150k | ~$10–50k | ~$5–20k/month |
| Primary Risk | Wrong problem | Bad labels | Data leakage | Overfitting | Silent failure | Undetected drift |
| Tools | Spreadsheet | Python/SQL | dbt, DVC, Spark | scikit-learn, PyTorch | Docker, K8s, TF Serving | Prometheus, Grafana |
Company Maturity Impact
The ML lifecycle differs dramatically by company stage:
Startups (0–50 engineers)
- Fast iteration, accept technical debt, focus on business impact
- One engineer wearing many hats (data collection → deployment)
- Simple tools (scikit-learn, hosted serving on Heroku/Lambda)
- Timeline: 4–8 weeks, minimal monitoring
- Example: Early-stage fraud detection with simple decision tree
Growth-Stage (50–500 engineers)
- Dedicated ML team, standardized infrastructure (Kubernetes, feature stores)
- Experiment tracking, A/B testing framework in place
- Multiple concurrent projects
- Timeline: 8–16 weeks, robust monitoring
- Example: Recommendation system with Airflow pipelines, Kafka for feedback
Enterprise (500+ engineers)
- ML platforms team building internal tools for other teams
- Strict governance, compliance, audit trails
- Complex models serving millions of requests/day
- Timeline: 12–24 weeks with regulatory review
- Example: Credit risk modeling with explainability, fairness testing, SLA enforcement
Feedback Loop: Why Monitoring Matters
The lifecycle isn’t linear; it’s a loop. Monitoring data feeds back into retraining:
1
2
3
4
Business Metrics ← Monitoring & A/B Tests ← Production Model ← Deployment
↑ ↓
└─────── Problem Reframing ←─ Concept Drift Detected
(retrain on new data)
Example failure scenario:
- Model deployed with 95% accuracy (Stage 5)
- Three months pass; accuracy drops to 88% (Stage 6 detects via monitoring)
- Investigation reveals user behavior shifted (concept drift)
- Root cause: competitor added feature, changed user workflows
- Reframe problem (Stage 2), collect new labels (Stage 3), retrain (Stage 4)
- Redeploy with new model (Stage 5)
Companies that skip monitoring often discover problems through customer complaints, not dashboards.
Common Pitfalls & How to Avoid Them
| Pitfall | Stage | Impact | Prevention |
|---|---|---|---|
| Problem before data | 1→2 | 3–6 months wasted | Write problem statement FIRST |
| Wrong success metrics | 1 | Optimize for wrong objective | Link business→ML metrics with stakeholders |
| Data leakage | 3 | Inflated eval metrics, poor production | Use time-series split, never touch test set |
| Overfitting | 4 | Model fails on new data | Strict holdout test set, cross-validation |
| No baseline | 4 | Can’t measure improvement | Always train simple model first |
| Deploying untested model | 5 | Breaks production | Use canary + shadow mode first |
| No monitoring | 6 | Silent degradation | Real-time dashboards + automated alerts |
| Retraining without feedback | 6 | Model converges to wrong solution | Integrate user corrections into pipeline |
Real-World Timeline Expectations
Netflix Recommendation System:
- Goal ID: 1 week (already motivated)
- Problem Framing: 3 weeks (massive feature space, cold-start problem)
- Data Processing: 8 weeks (terabytes of watch history, complex labeling)
- Model Dev: 12 weeks (Bayesian personalization, multi-armed bandit for exploration)
- Deployment: 4 weeks (canary on 1% traffic, A/B test infrastructure)
- Monitoring: 8+ engineers full-time (daily retraining, drift detection)
- Total: 6–8 months; ongoing investment
Stripe Fraud Detection:
- Goal ID: 1 week
- Problem Framing: 2 weeks (binary classification, real-time serving requirement)
- Data Processing: 4 weeks (billions of transactions, programmatic labeling)
- Model Dev: 6 weeks (gradient boosting + neural network ensemble, feature importance analysis)
- Deployment: 3 weeks (sub-10ms latency requirement, canary on 0.1%)
- Monitoring: Continuous (fraud patterns evolve daily, concept drift detected hourly)
- Total: 3–4 months; heavy ongoing monitoring and retraining
Implementation Example: Full Lifecycle Workflow
Below is a realistic end-to-end Python workflow sketch (simplified; production versions are 10x larger):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
from datetime import datetime
import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
# Stage 1: Business Goal (documented in problem_statement.md)
# Goal: Predict if user will click recommended item (CTR prediction)
# Success metric: NDCG@10 >= 0.75 (offline), 5% CTR lift (online A/B test)
# Stage 2: Problem Framing
# X: user features (age, location, history), item features (genre, popularity)
# y: binary (clicked=1, not clicked=0)
# Baseline: 35% CTR (current heuristic: recommend popular items)
# Stage 3: Data Processing
df = pd.read_csv('s3://data-lake/user_item_interactions_2026_q1.csv')
X = df[['user_age', 'user_location', 'item_genre', 'item_popularity']]
y = df['clicked']
# Train/val/test split (stratified, no leakage)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp)
# Stage 4: Model Development
mlflow.start_run()
mlflow.log_params({'max_depth': 8, 'n_estimators': 200, 'learning_rate': 0.05})
model = GradientBoostingClassifier(max_depth=8, n_estimators=200, learning_rate=0.05)
model.fit(X_train, y_train)
y_pred_val = model.predict(X_val)
y_pred_test = model.predict(X_test)
val_acc = accuracy_score(y_val, y_pred_val)
test_acc = accuracy_score(y_test, y_pred_test)
test_precision = precision_score(y_test, y_pred_test)
test_recall = recall_score(y_test, y_pred_test)
mlflow.log_metrics({
'val_accuracy': val_acc,
'test_accuracy': test_acc,
'test_precision': test_precision,
'test_recall': test_recall
})
mlflow.sklearn.log_model(model, 'model')
print(f"Test Accuracy: {test_acc:.3f}, Precision: {test_precision:.3f}, Recall: {test_recall:.3f}")
# Expected: Test Acc ~73%, Precision ~62%, Recall ~55% (baseline: 35%)
mlflow.end_run()
# Stage 5: Model Deployment (infrastructure setup, canary testing)
# model_registry.register_model('ctr-predictor', version=1, stage='staging')
# k8s_deploy(model_uri='models:/ctr-predictor/staging', traffic_percent=1)
# wait_for_metrics(latency_p99 < 50ms, error_rate < 0.1%)
# gradual_rollout(1% → 10% → 100% traffic)
# Stage 6: Model Monitoring (automated daily)
# cron_job: 'SELECT * FROM user_item_interactions WHERE date >= yesterday'
# drift_detector.ks_test(current_feature_dist, reference_dist, threshold=0.05)
# if drift_detected: trigger_retraining()
How Real Companies Use This
Meta (Recommendation Systems at 3B Daily Active Users): Meta manages 3,000+ ML models in production across their platform, with the full 6-stage lifecycle supporting recommendation systems for Feed, Stories, and Reels. Their ML Lifecycle spans 6–12 weeks from business goal identification to production deployment, involving cross-functional teams (product, data science, MLOps, infrastructure). Key challenge: concept drift happens weekly due to shifting user behavior and competitive pressure. Meta’s FBLearner Flow platform automates data collection, feature engineering, model training, evaluation, and canary deployment, enabling them to retrain recommendation models daily while maintaining 50+ ML lifecycle stages in parallel. Failure rate on deployment: <0.1% due to rigorous shadow mode testing (comparing current vs new model predictions on 10M users).
Google (Search Ranking at Millions of Queries/Day): Google’s search ranking system involves thousands of ranking features and 100+ models, with the ML Lifecycle structured around quarterly release cycles. Business goal is always “improve user satisfaction” (proxy: pairwise relevance judgments), translated into ML metrics (LambdaMART loss, CTR, dwell time). Problem framing takes 4 weeks to define which 5–10 ranking signals are most important. Data processing pipeline: queries from users → labels from human raters (10k queries quarterly) → offline evaluation using NDCG@10. Model development runs 1,000+ A/B experiments per year using Vizier (Bayesian hyperparameter optimization) on 10,000+ TPUs; single model training takes 4–6 hours. Deployment: strict canary protocol — new ranking model rolls out to 1% traffic first, monitored for 2 weeks, then scales to 100%. Timeline: idea → production = 3–6 months.
Uber (Real-Time Predictions Across Rideshare, Eats, Freight): Uber’s Michelangelo platform orchestrates the ML Lifecycle for 100+ production models handling real-time decisions: ETA prediction, surge pricing, driver matching, fraud detection. Business goals vary per model but share common traits: low latency (<50ms), high volume (1M+ predictions/second), concept drift is hourly. Data processing handles 5TB raw data daily from Kafka (ride events, GPS, payment) into feature store (Cassandra, 1TB cached features). Model development uses 100+ experiments per week with automated cross-validation. Deployment strategy: shadow mode for 2 weeks, canary on 1%, A/B test on 50%. Monitoring is continuous: drift detection triggers automatic retraining when Population Stability Index > 0.25 on any top-10 feature. Lifecycle per model: 8–12 weeks; retraining cadence: weekly (scheduled) + on-demand (drift-triggered).
Netflix (Personalization at 250M Subscribers): Netflix’s ML Lifecycle for recommendation systems aims to increase “watch hours” (business metric) via improved NDCG@10 (ML metric). Business goal identification is straightforward (already motivated by engagement), so problem framing dominates (3 weeks) due to cold-start complexity and need for both collaborative filtering and content-based features. Data processing uses 500TB+ of watch history data daily; labels are implicit (watch >2 minutes = relevant). Model development explores 50+ architectures over 6 months, comparing baselines (popularity, collaborative filtering) vs challengers (neural CF, graph neural networks). Chosen model: two-tower architecture with 8% precision@20 improvement. Deployment: blue/green with automatic rollback if watch hours drop >2%, shadow mode runs for 2 weeks before any new ranking model reaches 1% of users. Monitoring tracks concept drift (user preferences change seasonally) and triggers weekly retraining. Full lifecycle: 6–8 months from goal identification to steady-state production with continuous improvement.
Stripe (Fraud Detection with Adversarial Environment): Stripe’s ML Lifecycle for fraud detection is shaped by rapid concept drift (fraudsters adapt daily) and asymmetric costs (false positives damage UX; false negatives cost money). Business goal: minimize fraud loss while keeping false positive rate <0.1%. Problem framing: classification (fraud vs legitimate), with real-time serving constraint (<50ms). Data processing: billions of transactions, programmatic labels (chargebacks), ground truth arrives 30–90 days delayed. Model development baseline: gradient boosting (fraud detection is well-studied). Challenges: dataset imbalance (99.9% legitimate), adversarial adaptation. Solution: ensemble of gradient boosting + neural net, retraining weekly, with threshold calibration based on fraud team feedback. Deployment: canary on 0.1% (fraud is rare enough that 0.1% still covers ~10k transactions for evaluation). Monitoring is aggressive: false positive rate tracked hourly (SLA < 0.1%), automatic rollback if breached. Lifecycle: 3–4 months to initial production; continuous evolution thereafter (weekly retraining, bi-weekly threshold adjustments, monthly architecture experiments).
References
- Machine Learning Engineering for Production (Andrew Ng, MLOps.community) — Comprehensive course on full ML systems
- Rules of Machine Learning: Best Practices for ML Engineering (Google, Martin & Polyzotis) — 43 rules covering the full lifecycle
- Machine Learning Engineering (Andriy Burkov) — Practical guide emphasizing software engineering practices
- Chip Huyen: ML Systems Design (YouTube) — Systems thinking applied to ML
- Hidden Technical Debt in Machine Learning Systems (Google, 2015) — Why ML systems fail in production
- Designing Machine Learning Systems (Chip Huyen) — Recent comprehensive reference
- Google Cloud: AI/ML Best Practices — Enterprise lifecycle practices