The ML Lifecycle

Systematic end-to-end journey from business question through production deployment and continuous monitoring -- a feedback loop designed to deliver measurable value while managing risk.

Posted Nov 5, 2025

13 min read

The ML Lifecycle

Systematic end-to-end journey from business question through production deployment and continuous monitoring — a feedback loop designed to deliver measurable value while managing risk.

The Six-Stage Lifecycle

The ML lifecycle is fundamentally iterative. Each stage produces outputs that feed into the next, but monitoring creates a feedback loop that often requires returning to earlier stages when performance degrades or business objectives shift.

Stage 1: Business Goal Identification (1–2 weeks, 0.5 FTE)

Objective: Define WHAT problem to solve and WHY, before considering HOW.

This stage prevents wasted effort on technically interesting but business-irrelevant problems. Key activities:

Stakeholder alignment: Map who benefits (product, revenue, operations) and who loses (privacy concerns, job displacement, fairness impacts)
Problem type: Classify as classification, regression, ranking, clustering, or anomaly detection
Success metrics: Link business metrics (revenue, engagement, cost savings) to ML metrics (accuracy, latency, fairness)
Feasibility assessment: Data availability, regulatory constraints, timeline, budget
Baseline establishment: What simple heuristic or status quo are we comparing against?

Key Deliverable: One-page problem statement with aligned stakeholders, clear success metrics, and decision criteria.

Stage 2: ML Problem Framing (2–4 weeks, 1 FTE)

Objective: Translate business problem into a concrete ML formulation.

This stage defines the exact machine learning task:

Feature definition: What input variables (X) are available and predictable at serving time?
Target definition: What are we predicting (y)? How do we collect ground truth labels?
Cold-start problem: How do we handle new items/users without historical data?
Temporal aspects: Is this time-series? Do we need lookback windows?
Label quality: Can we programmatically label, or do we need human annotation?
Proxy metrics: If the true metric is expensive to compute, what proxy correlates with it?

Key Deliverable: Feature specification document, labeling strategy, dataset schema, and initial data requirements.

Stage 3: Data Processing (4–12 weeks, 1–2 FTE)

Objective: Collect, label, clean, and prepare data at scale.

This is typically the longest and most resource-intensive stage:

Data collection: Internal logs, third-party APIs, user-generated, or synthetic data
Labeling: Crowdsourcing (Amazon Mechanical Turk, Appen), programmatic rules, active learning, or human expertise
Data quality: Duplicates, missing values, outliers, measurement errors
Feature engineering: Normalization, categorical encoding, domain-specific transformations
Train/val/test split: Temporal splits for time-series, stratified splits for imbalanced data, avoiding data leakage
Data versioning: Track dataset snapshots (DVC, Delta Lake) so models can be reproduced

Key Deliverable: Clean, reproducible dataset with train/val/test splits and feature definitions.

Stage 4: Model Development (4–16 weeks, 1–2 FTE)

Objective: Train, tune, and iterate on candidate models.

This stage involves rapid experimentation:

Algorithm selection: Rule-based baseline → simple model (logistic regression, decision tree) → complex model (gradient boosting, neural networks)
Hyperparameter tuning: Grid search, random search, or Bayesian optimization (Optuna, Hyperopt)
Cross-validation: K-fold for small datasets, time-series split for temporal data
Error analysis: Confusion matrix, learning curves, segment-level performance (which user groups fail?)
Experiment tracking: Log all runs in MLflow or Weights & Biases for reproducibility
Model comparison: Statistical testing (is improvement statistically significant?)

Key Deliverable: Trained model, experiment logs, error analysis report, and model card (documentation of capabilities and limitations).

Stage 5: Model Deployment (2–8 weeks, 1–2 FTE)

Objective: Package and release model to production safely.

This stage minimizes risk through staged rollout:

Model serialization: Save model weights + preprocessing pipeline (ONNX, TF SavedModel, PyTorch JIT)
API serving: Real-time (low latency) via TorchServe/TF Serving, or batch (high throughput) via Spark/Airflow
Serving infrastructure: Containerization (Docker), orchestration (Kubernetes), autoscaling
Shadow mode: Run new model in parallel without impacting users; log predictions for audit
Canary release: Roll out to 1–5% of traffic first, monitor for issues
A/B testing: Compare new model against baseline on 50/50 traffic split; measure business metrics, not just ML metrics

Key Deliverable: Deployed model, serving API, A/B test infrastructure, and monitoring dashboards.

Stage 6: Model Monitoring (Ongoing, 0.2–0.5 FTE)

Objective: Detect degradation and trigger retraining before users are impacted.

This stage is where most production failures happen:

Performance monitoring: Track accuracy, latency, cost in production
Data drift: Input distribution shift detected via Kolmogorov-Smirnov test or Population Stability Index
Concept drift: True label distribution changes (ground truth changed, user behavior shifted)
Retraining triggers: Automatic (weekly/monthly) or on-demand (when drift detected)
Feedback loops: Integrate user feedback (corrected labels) into retraining pipeline
SLA enforcement: Alert if accuracy drops below threshold or latency exceeds budget

Key Deliverable: Real-time dashboards (Grafana, Prometheus), drift detection alerts, automated retraining pipeline.

When ML Is Appropriate

ML is powerful but not always the right tool. Use this framework:

Use ML when:

Problem has clear objective functions and sufficient labeled data
Patterns are non-obvious (human heuristics insufficient)
Scalability matters (thousands/millions of decisions per day)
Environment changes over time (feedback loop justifies retraining)
Trade-offs are acceptable (explainability vs accuracy, latency vs quality)

Avoid ML when:

Data is scarce (<1k examples) and humans are reliable
Rules are simple enough (hardcoded logic is maintainable and sufficient)
Explainability is legally required and model is a black box
Cost of wrong predictions is catastrophic (healthcare without oversight)
Latency or infrastructure requirements prohibitive

Key Properties by Stage

Dimension	Goal ID	Problem Frame	Data Processing	Model Dev	Deployment	Monitoring
Duration	1–2 weeks	2–4 weeks	4–12 weeks	4–16 weeks	2–8 weeks	Ongoing
Effort (FTE)	0.5	1	1–2	1–2	1–2	0.2–0.5
Cost (if outsourced)	~$5–10k	~$10–20k	~$20–100k	~$30–150k	~$10–50k	~$5–20k/month
Primary Risk	Wrong problem	Bad labels	Data leakage	Overfitting	Silent failure	Undetected drift
Tools	Spreadsheet	Python/SQL	dbt, DVC, Spark	scikit-learn, PyTorch	Docker, K8s, TF Serving	Prometheus, Grafana

Company Maturity Impact

The ML lifecycle differs dramatically by company stage:

Startups (0–50 engineers)

Fast iteration, accept technical debt, focus on business impact
One engineer wearing many hats (data collection → deployment)
Simple tools (scikit-learn, hosted serving on Heroku/Lambda)
Timeline: 4–8 weeks, minimal monitoring
Example: Early-stage fraud detection with simple decision tree

Growth-Stage (50–500 engineers)

Dedicated ML team, standardized infrastructure (Kubernetes, feature stores)
Experiment tracking, A/B testing framework in place
Multiple concurrent projects
Timeline: 8–16 weeks, robust monitoring
Example: Recommendation system with Airflow pipelines, Kafka for feedback

Enterprise (500+ engineers)

ML platforms team building internal tools for other teams
Strict governance, compliance, audit trails
Complex models serving millions of requests/day
Timeline: 12–24 weeks with regulatory review
Example: Credit risk modeling with explainability, fairness testing, SLA enforcement

Feedback Loop: Why Monitoring Matters

The lifecycle isn’t linear; it’s a loop. Monitoring data feeds back into retraining:

Business Metrics ← Monitoring & A/B Tests ← Production Model ← Deployment
     ↑                                           ↓
     └─────── Problem Reframing ←─ Concept Drift Detected
                                       (retrain on new data)

Example failure scenario:

Model deployed with 95% accuracy (Stage 5)
Three months pass; accuracy drops to 88% (Stage 6 detects via monitoring)
Investigation reveals user behavior shifted (concept drift)
Root cause: competitor added feature, changed user workflows
Reframe problem (Stage 2), collect new labels (Stage 3), retrain (Stage 4)
Redeploy with new model (Stage 5)

Companies that skip monitoring often discover problems through customer complaints, not dashboards.

Common Pitfalls & How to Avoid Them

Pitfall	Stage	Impact	Prevention
Problem before data	1→2	3–6 months wasted	Write problem statement FIRST
Wrong success metrics	1	Optimize for wrong objective	Link business→ML metrics with stakeholders
Data leakage	3	Inflated eval metrics, poor production	Use time-series split, never touch test set
Overfitting	4	Model fails on new data	Strict holdout test set, cross-validation
No baseline	4	Can’t measure improvement	Always train simple model first
Deploying untested model	5	Breaks production	Use canary + shadow mode first
No monitoring	6	Silent degradation	Real-time dashboards + automated alerts
Retraining without feedback	6	Model converges to wrong solution	Integrate user corrections into pipeline

Real-World Timeline Expectations

Netflix Recommendation System:

Goal ID: 1 week (already motivated)
Problem Framing: 3 weeks (massive feature space, cold-start problem)
Data Processing: 8 weeks (terabytes of watch history, complex labeling)
Model Dev: 12 weeks (Bayesian personalization, multi-armed bandit for exploration)
Deployment: 4 weeks (canary on 1% traffic, A/B test infrastructure)
Monitoring: 8+ engineers full-time (daily retraining, drift detection)
Total: 6–8 months; ongoing investment

Stripe Fraud Detection:

Goal ID: 1 week
Problem Framing: 2 weeks (binary classification, real-time serving requirement)
Data Processing: 4 weeks (billions of transactions, programmatic labeling)
Model Dev: 6 weeks (gradient boosting + neural network ensemble, feature importance analysis)
Deployment: 3 weeks (sub-10ms latency requirement, canary on 0.1%)
Monitoring: Continuous (fraud patterns evolve daily, concept drift detected hourly)
Total: 3–4 months; heavy ongoing monitoring and retraining

Implementation Example: Full Lifecycle Workflow

Below is a realistic end-to-end Python workflow sketch (simplified; production versions are 10x larger):

        
      
from datetime import datetime
import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Stage 1: Business Goal (documented in problem_statement.md)
# Goal: Predict if user will click recommended item (CTR prediction)
# Success metric: NDCG@10 >= 0.75 (offline), 5% CTR lift (online A/B test)

# Stage 2: Problem Framing
# X: user features (age, location, history), item features (genre, popularity)
# y: binary (clicked=1, not clicked=0)
# Baseline: 35% CTR (current heuristic: recommend popular items)

# Stage 3: Data Processing
df = pd.read_csv('s3://data-lake/user_item_interactions_2026_q1.csv')
X = df[['user_age', 'user_location', 'item_genre', 'item_popularity']]
y = df['clicked']

# Train/val/test split (stratified, no leakage)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp)

# Stage 4: Model Development
mlflow.start_run()
mlflow.log_params({'max_depth': 8, 'n_estimators': 200, 'learning_rate': 0.05})

model = GradientBoostingClassifier(max_depth=8, n_estimators=200, learning_rate=0.05)
model.fit(X_train, y_train)

y_pred_val = model.predict(X_val)
y_pred_test = model.predict(X_test)

val_acc = accuracy_score(y_val, y_pred_val)
test_acc = accuracy_score(y_test, y_pred_test)
test_precision = precision_score(y_test, y_pred_test)
test_recall = recall_score(y_test, y_pred_test)

mlflow.log_metrics({
    'val_accuracy': val_acc,
    'test_accuracy': test_acc,
    'test_precision': test_precision,
    'test_recall': test_recall
})
mlflow.sklearn.log_model(model, 'model')

print(f"Test Accuracy: {test_acc:.3f}, Precision: {test_precision:.3f}, Recall: {test_recall:.3f}")
# Expected: Test Acc ~73%, Precision ~62%, Recall ~55% (baseline: 35%)

mlflow.end_run()

# Stage 5: Model Deployment (infrastructure setup, canary testing)
# model_registry.register_model('ctr-predictor', version=1, stage='staging')
# k8s_deploy(model_uri='models:/ctr-predictor/staging', traffic_percent=1)
# wait_for_metrics(latency_p99 < 50ms, error_rate < 0.1%)
# gradual_rollout(1% → 10% → 100% traffic)

# Stage 6: Model Monitoring (automated daily)
# cron_job: 'SELECT * FROM user_item_interactions WHERE date >= yesterday'
# drift_detector.ks_test(current_feature_dist, reference_dist, threshold=0.05)
# if drift_detected: trigger_retraining()

How Real Companies Use This

Meta (Recommendation Systems at 3B Daily Active Users): Meta manages 3,000+ ML models in production across their platform, with the full 6-stage lifecycle supporting recommendation systems for Feed, Stories, and Reels. Their ML Lifecycle spans 6–12 weeks from business goal identification to production deployment, involving cross-functional teams (product, data science, MLOps, infrastructure). Key challenge: concept drift happens weekly due to shifting user behavior and competitive pressure. Meta’s FBLearner Flow platform automates data collection, feature engineering, model training, evaluation, and canary deployment, enabling them to retrain recommendation models daily while maintaining 50+ ML lifecycle stages in parallel. Failure rate on deployment: <0.1% due to rigorous shadow mode testing (comparing current vs new model predictions on 10M users).

Google (Search Ranking at Millions of Queries/Day): Google’s search ranking system involves thousands of ranking features and 100+ models, with the ML Lifecycle structured around quarterly release cycles. Business goal is always “improve user satisfaction” (proxy: pairwise relevance judgments), translated into ML metrics (LambdaMART loss, CTR, dwell time). Problem framing takes 4 weeks to define which 5–10 ranking signals are most important. Data processing pipeline: queries from users → labels from human raters (10k queries quarterly) → offline evaluation using NDCG@10. Model development runs 1,000+ A/B experiments per year using Vizier (Bayesian hyperparameter optimization) on 10,000+ TPUs; single model training takes 4–6 hours. Deployment: strict canary protocol — new ranking model rolls out to 1% traffic first, monitored for 2 weeks, then scales to 100%. Timeline: idea → production = 3–6 months.

Uber (Real-Time Predictions Across Rideshare, Eats, Freight): Uber’s Michelangelo platform orchestrates the ML Lifecycle for 100+ production models handling real-time decisions: ETA prediction, surge pricing, driver matching, fraud detection. Business goals vary per model but share common traits: low latency (<50ms), high volume (1M+ predictions/second), concept drift is hourly. Data processing handles 5TB raw data daily from Kafka (ride events, GPS, payment) into feature store (Cassandra, 1TB cached features). Model development uses 100+ experiments per week with automated cross-validation. Deployment strategy: shadow mode for 2 weeks, canary on 1%, A/B test on 50%. Monitoring is continuous: drift detection triggers automatic retraining when Population Stability Index > 0.25 on any top-10 feature. Lifecycle per model: 8–12 weeks; retraining cadence: weekly (scheduled) + on-demand (drift-triggered).

Netflix (Personalization at 250M Subscribers): Netflix’s ML Lifecycle for recommendation systems aims to increase “watch hours” (business metric) via improved NDCG@10 (ML metric). Business goal identification is straightforward (already motivated by engagement), so problem framing dominates (3 weeks) due to cold-start complexity and need for both collaborative filtering and content-based features. Data processing uses 500TB+ of watch history data daily; labels are implicit (watch >2 minutes = relevant). Model development explores 50+ architectures over 6 months, comparing baselines (popularity, collaborative filtering) vs challengers (neural CF, graph neural networks). Chosen model: two-tower architecture with 8% precision@20 improvement. Deployment: blue/green with automatic rollback if watch hours drop >2%, shadow mode runs for 2 weeks before any new ranking model reaches 1% of users. Monitoring tracks concept drift (user preferences change seasonally) and triggers weekly retraining. Full lifecycle: 6–8 months from goal identification to steady-state production with continuous improvement.

Stripe (Fraud Detection with Adversarial Environment): Stripe’s ML Lifecycle for fraud detection is shaped by rapid concept drift (fraudsters adapt daily) and asymmetric costs (false positives damage UX; false negatives cost money). Business goal: minimize fraud loss while keeping false positive rate <0.1%. Problem framing: classification (fraud vs legitimate), with real-time serving constraint (<50ms). Data processing: billions of transactions, programmatic labels (chargebacks), ground truth arrives 30–90 days delayed. Model development baseline: gradient boosting (fraud detection is well-studied). Challenges: dataset imbalance (99.9% legitimate), adversarial adaptation. Solution: ensemble of gradient boosting + neural net, retraining weekly, with threshold calibration based on fraud team feedback. Deployment: canary on 0.1% (fraud is rare enough that 0.1% still covers ~10k transactions for evaluation). Monitoring is aggressive: false positive rate tracked hourly (SLA < 0.1%), automatic rollback if breached. Lifecycle: 3–4 months to initial production; continuous evolution thereafter (weekly retraining, bi-weekly threshold adjustments, monthly architecture experiments).

References

Machine Learning Engineering for Production (Andrew Ng, MLOps.community) — Comprehensive course on full ML systems
Rules of Machine Learning: Best Practices for ML Engineering (Google, Martin & Polyzotis) — 43 rules covering the full lifecycle
Machine Learning Engineering (Andriy Burkov) — Practical guide emphasizing software engineering practices
Chip Huyen: ML Systems Design (YouTube) — Systems thinking applied to ML
Hidden Technical Debt in Machine Learning Systems (Google, 2015) — Why ML systems fail in production
Designing Machine Learning Systems (Chip Huyen) — Recent comprehensive reference
Google Cloud: AI/ML Best Practices — Enterprise lifecycle practices

AI & Agents, AI Ops

ai-fundamentals

This post is licensed under CC BY 4.0 by the author.