ML Problem Framing
Translating business objectives into a precise ML formulation: Define features (X), targets (y), labeling strategy, cold-start handling, and baseline -- before touching any data at scale.
Translating business objectives into a precise ML formulation: Define features (X), targets (y), labeling strategy, cold-start handling, and baseline — before touching any data at scale.
The Golden Rule: Learnability
A problem is learnable if:
- Features predictive of target exist and are measurable
- Target is observable (we can get ground truth labels)
- Enough examples exist to find patterns (statistical power)
- Problem is not deterministic chaos (prediction possible in principle)
Corollary: If you can’t define X and y precisely, the problem isn’t ready for ML.
Feature Definition: X
Features are observed input variables available at prediction time.
Golden Rule of Features
If you cannot compute it when you need to make a prediction, it’s not a valid feature.
This eliminates many seemingly useful signals:
| Signal | Example | Available at Serving? | Status |
|---|---|---|---|
| Computed | User age (from DOB) | Yes (cache in DB) | Valid |
| Historical | User’s avg rating | Yes (precomputed cache) | Valid |
| Contextual | Current time, location | Yes (from request) | Valid |
| Future | Tomorrow’s weather | No (unknown) | Invalid |
| Label proxy | User eventually clicked | No (haven’t shown yet) | Invalid (leakage) |
| Delayed | User’s private feedback form | No (unknown at decision time) | Invalid |
| Real-time expensive | Complex computed feature | Maybe (but slow; cache?) | Expensive trade-off |
Feature Categories
User Features:
- Static: age, account creation date, country, subscription tier
- Behavioral: total orders, avg order value, favorite category, churn risk score
- Temporal: orders in last 7/30/90 days, days since last order, session count
- Network: friend count, follower influence, group membership
Item/Content Features:
- Static: title, genre, duration, release date, language, source
- Engagement: views, rating (avg + count), comments, shares
- Temporal: trending score, seasonality factor, freshness (hours since published)
- Metadata: category, tags, embeddings (pretrained vectors)
Contextual Features:
- Request context: time of day, day of week, user location
- Device: mobile vs desktop, OS, screen size, connection type
- Traffic source: organic search, paid ad, direct, referral
Interaction Features (use with caution for training):
- User x Item: past interaction history (viewed, liked, purchased)
- These may leak information about the label; careful with train/test split
Temporal Considerations
Lookback Window: How far back in history to use?
1
2
3
4
5
Event Timeline: T-90d T-30d T-7d T-1d T (predict)
Feature: Historical Current Predict
Lookback 30d: y y y y
Lookback 7d: y y y
Lookback 1d: y y
Decision: Longer lookback = more stable signal but misses recent trends. Shorter = responsive but noisier. Typical ranges: 7d (very recent), 30d (recent trends), 90d (seasonal patterns).
Feature Staleness: How often to recompute?
- High-frequency (hourly): engagement metrics, trending signals
- Daily: user statistics, cumulative metrics, recent activity
- Weekly: longer-term behavior trends, seasonal patterns
- Static: user attributes, item metadata (rarely changes)
Asynchronous Updates: What if feature computation is slow?
- Option 1: Pre-compute and cache in feature store (Redis, Cassandra)
- Option 2: Accept staleness trade-off (cache for 1 hour, refresh hourly)
- Option 3: Use batch features (daily or weekly updates) for non-critical signals
Feature Engineering During Problem Framing
DO (necessary transformations):
- Normalize numeric features (e.g., order count: 0–10,000 → 0–1 range)
- Encode categorical features (one-hot, ordinal, target encoding)
- Handle missing values explicitly (imputation strategy, indicator column)
- Create domain-specific features (e.g., day-of-week from timestamp)
DON’T (premature optimization):
- Complex feature interactions (XGBoost/neural nets do this automatically)
- Over-engineered transformations (keep it simple, reproducible)
- Feature selection (train models first, then analyze importance)
Example: Click prediction in ads:
1
2
3
4
5
6
7
8
9
10
Raw Features: Framed Features:
user_id → user_avg_ctr_30d (avg CTR on historical ads)
→ user_ads_seen_7d (engagement frequency)
ad_id → ad_ctr_7d (ad's historical CTR)
→ ad_freshness_hours (how long since created)
context_time → hour_of_day (0-23)
→ day_of_week (0-6)
→ is_weekend (binary: Sat/Sun)
context_device → device_type (mobile=1, desktop=0)
→ is_mobile_app (vs web)
Target Definition: y
The target is the observed outcome we’re trying to predict.
Target Requirements
Observable: Can we determine the ground truth?
- Click: user clicks or doesn’t → observable immediately
- Purchase: user buys or not → observable within hours
- Satisfaction: user’s internal preference → need proxy (rating, review)
- Counterfactual: what if we’d shown different item? → impossible to observe
Timely: Can we get labels fast enough to retrain?
- Click: 1 second (online, real-time)
- Purchase: 1 hour (batch process logs hourly)
- User satisfaction: 1 week (survey delayed)
- Career satisfaction: 10 years (feedback loop too slow)
Unbiased (in data): Are labels created independently of the model’s predictions?
- Historical data: users saw items independently of our system
- Post-deployment data: only see labels for items we showed (selection bias)
Label Collection Strategies
| Strategy | Cost | Quality | Speed | Best For | Risk |
|---|---|---|---|---|---|
| Programmatic | Low | Depends on rules | Instant | Clear signal (click, purchase) | Rule brittleness; changes may invalidate old labels |
| Crowdsourcing | Medium | Good (80–90%) | 1–2 weeks | Relevance judgments, content classification | Inter-annotator disagreement |
| Expert | High | Excellent (95%+) | Slow (weeks) | High-stakes (medical, legal, financial) | Cost limits volume; bottleneck |
| User feedback | Free | Variable | Delayed | Corrections, refinements | Noisy; biased toward shown items |
| Weak labeling | Low | Noisy (60–80%) | Instant | Multiple weak signals combined | Label noise accumulates |
Common Label Collection Approaches
Programmatic Labels (Click Prediction):
1
2
3
4
5
6
7
8
Observation: User sees recommended item
→ User clicks or not (binary outcome)
Label: y = 1 if user clicked within 2 seconds, else 0
Challenge: Only label items we showed (selection bias in retraining)
→ When retraining, new data is biased toward what model showed
Solution: Use inverse propensity weighting (IPW); weight samples by showing probability
→ Or use counterfactual learning algorithms (e.g., MAGIC/CEVAE)
Crowdsourced Labels (Relevance Judgment):
1
2
3
4
5
6
7
8
9
Task: 10 annotators rate relevance of (query, item) pair
Scale: 0 (irrelevant) to 4 (perfect match)
Label: y = majority vote if >=7 annotators agree, else discard
Quality: Only ~65% inter-annotator agreement on subjective tasks
Filter to high-agreement examples (>0.75 Fleiss Kappa)
Challenge: Cost: ~$0.50–$2 per judgment
For 1M examples → $500k–$2M budget
Solution: Active learning: annotate uncertain examples; ignore easy ones
Expert Labels (Medical Diagnosis):
1
2
3
4
5
6
7
8
Data: Patient imaging (MRI, CT scan)
Task: 3 board-certified radiologists independently diagnose
Label: y = 1 if >=2 agree on disease, else 0 (gold standard)
Challenge: Very expensive (~$100–$500 per case)
Limited volume (100–1000 cases typical)
Solution: Combine with weak labels for scale (automated detection + expert review)
Semi-supervised learning: train on weak + few expert labels
User Feedback (Recommendation Corrections):
1
2
3
4
5
6
7
8
Post-deployment: Model shows 5 recommendations
User action: "Hide item X, show Y instead"
Label: y = 1 for corrected items; 0 for shown but uncorrected
Challenge: Only get feedback for shown items (selection bias)
Feedback is sparse (most don't rate)
Solution: Use propensity weighting; treat as weak signal only
Flag for careful retraining (distribution shift)
Label Quality Metrics
Inter-annotator Agreement:
- Cohen’s Kappa (binary): 0.0 (random), 0.4–0.6 (moderate), 0.8+ (excellent)
- Fleiss’ Kappa (multiple annotators): similar scale
- Action: Accept only examples with agreement >= 0.75 (high confidence)
Label Noise Tolerance:
- Models robust to 5–10% label noise
- Performance degrades significantly at 20%+ noise
- Measure: If 10% of labels are incorrect, what happens to model accuracy?
Temporal Label Leakage:
- Can we compute label only AFTER making prediction? (yes → valid)
- Do we have information about future? (yes → leakage, invalid)
- Check: Plot label distribution over time; sudden shifts = possible leakage
Baseline Selection
Golden Rule: Always train a simple baseline model first.
If your fancy algorithm doesn’t beat the baseline, investigate:
- Is the problem harder than expected?
- Is there data quality issue?
- Are features actually predictive?
- Did you introduce leakage?
Simple Baseline Strategies
Classification (e.g., click prediction):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Baseline 1: Always predict majority class
y_pred = np.ones_like(y_test) * 0.03 # CTR is 3%, predict 3% everywhere
accuracy = (y_pred == y_test).mean() # ~97% (misleading!)
auc = roc_auc_score(y_test, y_pred) # ~0.5 (random)
# Baseline 2: Per-group empirical rate (stratified by user/item)
y_pred = y_test.copy()
for user_id in y_test.index.unique():
user_mask = y_test.index == user_id
empirical_rate = y_train[y_train.index.str.startswith(user_id)].mean()
y_pred[user_mask] = empirical_rate
auc = roc_auc_score(y_test, y_pred) # ~0.60–0.65 (weak personalization)
# Baseline 3: Simple logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred) # ~0.70 (linear model)
Regression (e.g., price prediction):
1
2
3
4
5
6
7
8
9
10
11
# Baseline: Predict mean
y_pred = np.ones_like(y_test) * y_train.mean()
rmse = np.sqrt(np.mean((y_pred - y_test) ** 2))
# Baseline: Per-category mean (stratified)
y_pred = y_test.copy()
for category in y_test.index.unique():
mask = y_test.index == category
category_mean = y_train[y_train.index == category].mean()
y_pred[mask] = category_mean
rmse = np.sqrt(np.mean((y_pred - y_test) ** 2))
Ranking (e.g., recommendation):
1
2
3
4
5
6
7
8
9
10
11
12
13
# Baseline: Rank by popularity (deterministic)
# Items shown frequently = users like them = rank high
popularity = df.groupby('item_id')['show_count'].mean()
ranking = popularity.sort_values(ascending=False).index
# Serve top-K popular items to every user
# Compute NDCG@10
from sklearn.metrics import ndcg_score
for user_id in test_set:
user_items = test_set[test_set.user_id == user_id]['item_id'].values
predicted_ranking = ranking[:10]
relevance = [1 if item in user_items else 0 for item in predicted_ranking]
ndcg = ndcg_score([relevance], [range(10, 0, -1)])
Baseline Performance Documentation
Create a results table:
| Model | Accuracy | AUC | Precision | Recall | Latency | Training Time | Notes |
|---|---|---|---|---|---|---|---|
| Majority class | 96.5% | 0.500 | — | — | <1ms | — | Binary: always predict 0 |
| Per-user avg | 97.2% | 0.620 | 0.38 | 0.25 | 5ms | 2 min | Simple personalization |
| Logistic Regression | 97.8% | 0.680 | 0.52 | 0.38 | 2ms | 30 min | Linear; interpretable |
| Decision Tree | 98.1% | 0.695 | 0.54 | 0.42 | 8ms | 45 min | Non-linear; starts overfitting |
| XGBoost | 98.5% | 0.735 | 0.58 | 0.48 | 25ms | 2 hours | Gradient boosting; more complex |
| Neural Network | 98.6% | 0.745 | 0.59 | 0.50 | 80ms | 4 hours | Highest accuracy; latency cost |
Decision Framework:
- Is improvement statistically significant? (test accuracy improvement > 1% = noticeable; > 5% = major)
- Is latency trade-off acceptable? (50ms acceptable? 200ms too slow for real-time?)
- Is model complexity justified? (XGBoost vs Neural Net: both 0.74 AUC, but XGBoost 3x faster)
The Cold-Start Problem
Cold-start: New items/users/contexts without historical data. Causes recommendation failure in production.
Cold-Start Categories
| Scenario | Problem | Data Available | Solution |
|---|---|---|---|
| New user | No history for personalization | Item features, global patterns | Recommend popular items; ask preferences |
| New item | No interaction history | Item metadata (title, genre, tags) | Content-based recommendations |
| New market | Different user behavior/language | Similar markets’ data | Transfer learning; hybrid approach |
| New model version | Previous data doesn’t reflect new logic | Old model predictions only | Exploration; accept lower initial performance |
Cold-Start Handling Strategies
Content-Based: Use item/user features directly (no history needed)
1
2
3
4
5
6
New movie: "Action sci-fi from 2025"
Algorithm: Find users who've rated similar movies highly
(match genre, director, release era)
Result: Recommend to similar-taste users
Pros: Works immediately for new items
Cons: Needs good metadata; misses unpopular but good items
Popularity-Based: Recommend global top items (safe fallback)
1
2
3
4
New user: Show all-time top-10 items (e.g., top movies)
New country: Show top items from similar countries
Pros: Always works; gives users best-of-category
Cons: Not personalized; boring after a while
Hybrid (Best Practice):
1
2
3
4
5
Ranking score = 0.4 * popularity + 0.3 * content_similarity + 0.3 * collaborative_signal
- Popularity: safe default for unknowns
- Content similarity: leverage metadata
- Collaborative: use if any history available
Result: More personalized than popularity alone, more robust than pure collaborative
Active Learning: Actively gather feedback
1
2
3
4
5
New user: Show diverse items, ask "which do you like?"
Collect: Top-5 items they rate highly
Bootstrap: Use rated items to find similar recommendations
Pros: Quickly personalize with few interactions (10–20)
Cons: Requires user effort; abandonment risk
Proxy Metrics vs True Metrics
Often, the true metric is expensive/slow to measure. Use a proxy that correlates:
Example 1: Recommendation System
| Metric | Measurement | Cost | Latency | Use When? |
|---|---|---|---|---|
| Watch hours | Total minutes user watches recommendations | $$$ | Weeks (accumulate) | True metric; validate A/B test |
| CTR | Clicks on recommendations | $ | Hours | Proxy; optimize during development |
| NDCG@10 | Expert relevance judgments | $$ | Hours (offline) | Proxy; fast development feedback |
| Impression count | Recommendations shown (+ engagement binary) | $ | Real-time | Monitoring; leading indicator |
Best practice: Optimize for NDCG in offline eval (fast iteration) → validate against watch hours in A/B test (ground truth).
Example 2: Healthcare Diagnosis
| Metric | Measurement | Cost | Latency | Use When? |
|---|---|---|---|---|
| Patient outcome | Health after treatment (recovery, mortality) | $$$ | Months–years | True metric; validate annually |
| Expert agreement | Board-certified radiologist diagnosis | $$$ | Days | Gold standard for training |
| ROC-AUC | Diagnostic accuracy on labeled test set | $ | Hours | Development metric; fast iteration |
| Sensitivity/Specificity | True positive/negative rates | $ | Hours | Monitoring; public reporting |
Best practice: Train on expert labels (available, fast) → validate against true outcomes (periodic, expensive).
Data Leakage: The Silent Killer
Data leakage = information from outside training distribution leaks into training, inflating metrics.
Leakage Types and Examples
Temporal Leakage (Most Common):
1
2
3
4
5
6
7
8
9
Problem: Train on data from 2025, predict data from 2024
Use features from future to predict past
Result: Metrics look great (95% accuracy)
In production, model performs 50% (uses unavailable features)
Example: Churn prediction
Feature: "has_opened_support_ticket_in_next_7_days"
Problem: We don't know this when predicting churn
Fix: Use only support tickets BEFORE the prediction point
Label Leakage (Feature uses target information):
1
2
3
4
5
6
7
8
Problem: Feature directly encodes the target
Model learns to "cheat"
Result: Perfect training accuracy; fails completely in production
Example: Fraud detection
Feature: "transaction_marked_as_fraud_in_system"
Problem: We're trying to predict fraud; this IS the label
Fix: Remove any features that require knowing the label
Sample Leakage (Train/test contamination):
1
2
3
4
5
6
7
8
9
Problem: Same user/transaction appears in train AND test
Model memorizes specific samples
Result: Good test accuracy; poor on new users
Example: Click prediction
User 12345 has 1000 clicks in training set
User 12345's clicks also in test set
Fix: Split by user_id, not randomly
Ensures no user in both train and test
Leakage Detection Checklist
- Using future data to predict past? (temporal split)
- Using test set to compute feature statistics? (compute on train only)
- Features that require knowing the target? (would be unavailable at serving)
- Same entity in train and test? (split by entity_id, not row_id)
- Information from evaluation period in training features? (time-series split)
Leakage Prevention
Best practice: Split data BEFORE feature engineering
1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.model_selection import train_test_split
# Step 1: Split first
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Step 2: Fit preprocessor on train only
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Transform test with train stats
# Don't do this: scaler.fit(X_train + X_test) → leakage!
For time-series: Use temporal split
1
2
3
4
5
6
7
# Wrong: random split mixes past and future
X_train, X_test = train_test_split(X, test_size=0.2)
# Right: split by time
split_point = int(0.8 * len(X))
X_train, X_test = X[:split_point], X[split_point:]
# Train on 2020–2024, test on 2025 only
Implementation Example: Problem Framing Document
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
PROBLEM FRAMING: Click Prediction for Ad Recommendations
DATE: YYYY-MM-DD
OWNER: Data Science Lead
PROBLEM TYPE
============
Supervised classification (binary)
Definition: Predict probability user clicks recommended ad
FEATURE DEFINITION (X)
======================
Feature Set:
User Features:
- user_avg_ctr_30d = mean(click=1) for user's ads shown past 30d
- user_ads_seen_7d = count of ads shown to user in last 7d
Ad Features:
- ad_ctr_7d = mean(click=1) for this ad shown past 7d
- ad_freshness = hours since ad created
Context:
- hour_of_day = 0–23 (temporal)
- day_of_week = 0–6 (Mon–Sun)
- is_mobile = 1 if mobile device, 0 if desktop
Feature Availability Check:
All features available at serving time (<50ms latency)
No future data used
No label information in features
Feature values stable (not dependent on random seed)
TARGET DEFINITION (Y)
=====================
Definition: Binary outcome
y = 1 if user clicks recommended ad within 2 seconds
y = 0 otherwise
Label Timing: Observable within 1 second of showing ad
Label Quality:
Source: Production event logs (user click events)
Accuracy: 100% (deterministic; user clicked or not)
Volume: ~50M labeled examples/day available
Label Bias Risk:
Selection bias: only see labels for ads we showed
→ Post-deployment data biased toward model's recommendations
Mitigation: Use inverse propensity weighting in retraining
Also collect A/B test data with random exposure
BASELINE MODEL
==============
Strategy: Per-ad historical CTR
For each ad_id, predict y = ad_ctr (empirical click rate)
Expected performance:
AUC: ~0.62 (weak signal; only uses item popularity)
Precision: ~0.032 (3.2% CTR baseline)
Success criterion:
New model must beat AUC > 0.62 to be worth deploying
PROXY METRICS
=============
True metric: Ad revenue generated ($$) — too slow to optimize
Proxy 1: CTR (clicks/impressions) — fast, correlates with revenue
Proxy 2: eCTR = predicted click rate — optimized metric
Proxy 3: Engagement score = CTR + hover_rate — broader signal
Validation:
Offline: optimize eCTR in ML model
Online (A/B test): measure true revenue lift
COLD-START STRATEGY
===================
New ads: Recommend based on global popularity (top 100 ads overall)
→ After 1000 impressions, use estimated CTR
New users: Show high-popularity ads (safe default)
→ After 10 interactions, use collaborative signals
DATA SOURCES
============
Source 1: Event logs (BigQuery)
- User exposure: which ads shown to which users
- User clicks: clicks on shown ads
- Update frequency: streaming (real-time, but batch consumed)
Source 2: Ad metadata (PostgreSQL)
- Ad text, image URL, landing page
- Update frequency: daily
Source 3: User profile (Hive)
- User age, location, account age
- Update frequency: static
Data Retention: 24-month lookback available (GDPR-compliant)
Volume: ~1B events/day; training set ~100B examples
TEMPORAL SPLIT
==============
Timeline:
Training data: 2024-01-01 to 2024-11-30 (11 months)
Validation data: 2024-12-01 to 2024-12-20 (20 days)
Test data: 2024-12-21 to 2024-12-31 (10 days)
Reason: Time-series split prevents leakage (train on past, test on future)
LEAKAGE CHECK
=============
Features use only historical data (before prediction time)
No test set information used for feature engineering
No users/ads duplicated between train/test
Time split enforced (no past-future mixing)
Target computable only after showing ad (no cheating)
How Real Companies Use This
Google Translate (Pivoting from Sequence-to-Sequence to Quality Ranking): Google’s neural machine translation project initially framed the problem as end-to-end sequence-to-sequence: predict target language sentence given source language input. However, team discovered that optimizing for BLEU score (standard NMT metric) didn’t correlate with user satisfaction in production. Reframed as: rank candidate translations by cross-lingual similarity scores (using embeddings of source and backtranslated target). This required redefining features (monolingual context becomes less important; semantic equivalence becomes critical) and target (pairwise ranking instead of likelihood). Problem framing alone took 4 weeks of experimentation to validate the new formulation. Result: new framing improved user-perceived translation quality by 15% without changing the underlying model architecture.
Amazon “Customers Also Bought” (Item-to-Item vs User-Based Collaborative Filtering): Amazon’s recommendation engine initially framed the problem as: predict which items user will buy next (user-based collaborative filtering). Data showed this performed poorly on new product categories (cold-start). Reframed to: predict items frequently bought together in same session (item-to-item CF). Key insight: user relationships are volatile (preferences change); item relationships are stable (complementary products don’t change). Features shifted from user history to item metadata and co-purchase statistics. Cold-start strategy: new items immediately get recommendations if similar (by metadata) items exist. This simple reframing gave 25% improvement in recommendation coverage and 5% lift in conversion rate.
DoorDash (Regression to Ordinal Classification for Delivery Time): DoorDash’s initial problem framing for delivery time prediction was continuous regression: predict exact minutes. However, customers care about ordinal promises (“guaranteed <30 min or free”). Reframed to: predict ordinal class (on-time / 5-min late / 10-min late / 15+ min late). This changed features (now focus on order complexity, traffic, driver history rather than precise geolocation) and target (multi-class labels instead of continuous values). Cold-start: new restaurants get conservative time estimates using restaurant type and neighborhood aggregates. Result: 8% improvement in on-time delivery rate, driven not by better model accuracy but by better problem formulation aligned with business reality.
Twitter/X (Content Ranking with Virality Signals): Twitter’s problem framing for feed ranking evolved over years. Initially: predict engagement (likes, retweets). Revealed hidden concept drift: virality signals (retweets from influential accounts) matter more than raw volume. Reframed to: rank by “engagement probability weighted by influencer reach” (combining CTR + virality into composite score). Feature engineering became critical: follower count of engager, tweet recency (tweets decay in interest), conversation depth (threaded replies). Proxy metric innovation: use 1-hour engagement as proxy for 24-hour engagement (can’t wait weeks for labels). Cold-start: use tweet age, author follower count, content type (media, link, quote) as default signals. Concept drift is managed by weekly retraining on fresh data.
Pinterest (Image Embeddings for Cold-Start Pins): Pinterest’s “visual search” problem framing required a pivot: instead of collaborative filtering (which fails for new pins with no saves), use image embeddings (trained via contrastive learning on pin saves) as features. Reframed from “predict user x pin interaction” to “rank pins by semantic similarity to user’s saved pins + visual attractiveness.” Labeling strategy: implicit (saved = relevant), but with propensity weighting (people save more from items they see). Cold-start pins use image content alone (CNN features) until enough engagement data accumulates. Baseline: random ranking from trending category. New framing gave 40% improvement in save rate for new pins, directly impacting user retention (fresh content discoverability).
References
- A Few Useful Things to Know about Machine Learning (Pedro Domingos) — Canonical paper on why learning fails (leakage, bias, overfitting)
- Machine Learning Systems Design (Chip Huyen) — Comprehensive on features, targets, baselines
- Counterfactual Learning for Recommendation Ranking Systems (Joachims et al., 2021) — Handling selection bias in rankings
- A Disciplined Approach to Neural Network Design (Karpukhin et al., 2020) — Feature engineering best practices
- Chip Huyen: ML Problem Framing — Video walkthrough
- Hands-On Machine Learning (Aurelien Geron) — Feature engineering chapter
- Feature Store concepts (Feast) — Real-world feature management at scale
- An Introduction to Statistical Learning (James et al.) — Chapter on problem formulation