ML Problem Framing

Translating business objectives into a precise ML formulation: Define features (X), targets (y), labeling strategy, cold-start handling, and baseline -- before touching any data at scale.

Posted Nov 15, 2025

20 min read

ML Problem Framing

Translating business objectives into a precise ML formulation: Define features (X), targets (y), labeling strategy, cold-start handling, and baseline — before touching any data at scale.

The Golden Rule: Learnability

A problem is learnable if:

Features predictive of target exist and are measurable
Target is observable (we can get ground truth labels)
Enough examples exist to find patterns (statistical power)
Problem is not deterministic chaos (prediction possible in principle)

Corollary: If you can’t define X and y precisely, the problem isn’t ready for ML.

Feature Definition: X

Features are observed input variables available at prediction time.

Golden Rule of Features

If you cannot compute it when you need to make a prediction, it’s not a valid feature.

This eliminates many seemingly useful signals:

Signal	Example	Available at Serving?	Status
Computed	User age (from DOB)	Yes (cache in DB)	Valid
Historical	User’s avg rating	Yes (precomputed cache)	Valid
Contextual	Current time, location	Yes (from request)	Valid
Future	Tomorrow’s weather	No (unknown)	Invalid
Label proxy	User eventually clicked	No (haven’t shown yet)	Invalid (leakage)
Delayed	User’s private feedback form	No (unknown at decision time)	Invalid
Real-time expensive	Complex computed feature	Maybe (but slow; cache?)	Expensive trade-off

Feature Categories

User Features:

Static: age, account creation date, country, subscription tier
Behavioral: total orders, avg order value, favorite category, churn risk score
Temporal: orders in last 7/30/90 days, days since last order, session count
Network: friend count, follower influence, group membership

Item/Content Features:

Static: title, genre, duration, release date, language, source
Engagement: views, rating (avg + count), comments, shares
Temporal: trending score, seasonality factor, freshness (hours since published)
Metadata: category, tags, embeddings (pretrained vectors)

Contextual Features:

Request context: time of day, day of week, user location
Device: mobile vs desktop, OS, screen size, connection type
Traffic source: organic search, paid ad, direct, referral

Interaction Features (use with caution for training):

User x Item: past interaction history (viewed, liked, purchased)
These may leak information about the label; careful with train/test split

Temporal Considerations

Lookback Window: How far back in history to use?

Event Timeline:     T-90d  T-30d  T-7d  T-1d  T (predict)
Feature:            Historical         Current  Predict
Lookback 30d:               y      y     y      y
Lookback 7d:                       y     y      y
Lookback 1d:                             y      y

Decision: Longer lookback = more stable signal but misses recent trends. Shorter = responsive but noisier. Typical ranges: 7d (very recent), 30d (recent trends), 90d (seasonal patterns).

Feature Staleness: How often to recompute?

High-frequency (hourly): engagement metrics, trending signals
Daily: user statistics, cumulative metrics, recent activity
Weekly: longer-term behavior trends, seasonal patterns
Static: user attributes, item metadata (rarely changes)

Asynchronous Updates: What if feature computation is slow?

Option 1: Pre-compute and cache in feature store (Redis, Cassandra)
Option 2: Accept staleness trade-off (cache for 1 hour, refresh hourly)
Option 3: Use batch features (daily or weekly updates) for non-critical signals

Feature Engineering During Problem Framing

DO (necessary transformations):

Normalize numeric features (e.g., order count: 0–10,000 → 0–1 range)
Encode categorical features (one-hot, ordinal, target encoding)
Handle missing values explicitly (imputation strategy, indicator column)
Create domain-specific features (e.g., day-of-week from timestamp)

DON’T (premature optimization):

Complex feature interactions (XGBoost/neural nets do this automatically)
Over-engineered transformations (keep it simple, reproducible)
Feature selection (train models first, then analyze importance)

Example: Click prediction in ads:

Raw Features:              Framed Features:
user_id                   → user_avg_ctr_30d (avg CTR on historical ads)
                          → user_ads_seen_7d (engagement frequency)
ad_id                     → ad_ctr_7d (ad's historical CTR)
                          → ad_freshness_hours (how long since created)
context_time              → hour_of_day (0-23)
                          → day_of_week (0-6)
                          → is_weekend (binary: Sat/Sun)
context_device            → device_type (mobile=1, desktop=0)
                          → is_mobile_app (vs web)

Target Definition: y

The target is the observed outcome we’re trying to predict.

Target Requirements

Observable: Can we determine the ground truth?

Click: user clicks or doesn’t → observable immediately
Purchase: user buys or not → observable within hours
Satisfaction: user’s internal preference → need proxy (rating, review)
Counterfactual: what if we’d shown different item? → impossible to observe

Timely: Can we get labels fast enough to retrain?

Click: 1 second (online, real-time)
Purchase: 1 hour (batch process logs hourly)
User satisfaction: 1 week (survey delayed)
Career satisfaction: 10 years (feedback loop too slow)

Unbiased (in data): Are labels created independently of the model’s predictions?

Historical data: users saw items independently of our system
Post-deployment data: only see labels for items we showed (selection bias)

Label Collection Strategies

Strategy	Cost	Quality	Speed	Best For	Risk
Programmatic	Low	Depends on rules	Instant	Clear signal (click, purchase)	Rule brittleness; changes may invalidate old labels
Crowdsourcing	Medium	Good (80–90%)	1–2 weeks	Relevance judgments, content classification	Inter-annotator disagreement
Expert	High	Excellent (95%+)	Slow (weeks)	High-stakes (medical, legal, financial)	Cost limits volume; bottleneck
User feedback	Free	Variable	Delayed	Corrections, refinements	Noisy; biased toward shown items
Weak labeling	Low	Noisy (60–80%)	Instant	Multiple weak signals combined	Label noise accumulates

Common Label Collection Approaches

Programmatic Labels (Click Prediction):

Observation: User sees recommended item
            → User clicks or not (binary outcome)
Label:      y = 1 if user clicked within 2 seconds, else 0

Challenge:  Only label items we showed (selection bias in retraining)
            → When retraining, new data is biased toward what model showed
Solution:   Use inverse propensity weighting (IPW); weight samples by showing probability
            → Or use counterfactual learning algorithms (e.g., MAGIC/CEVAE)

Crowdsourced Labels (Relevance Judgment):

Task:       10 annotators rate relevance of (query, item) pair
            Scale: 0 (irrelevant) to 4 (perfect match)
Label:      y = majority vote if >=7 annotators agree, else discard
Quality:    Only ~65% inter-annotator agreement on subjective tasks
            Filter to high-agreement examples (>0.75 Fleiss Kappa)

Challenge:  Cost: ~$0.50–$2 per judgment
            For 1M examples → $500k–$2M budget
Solution:   Active learning: annotate uncertain examples; ignore easy ones

Expert Labels (Medical Diagnosis):

Data:       Patient imaging (MRI, CT scan)
Task:       3 board-certified radiologists independently diagnose
Label:      y = 1 if >=2 agree on disease, else 0 (gold standard)

Challenge:  Very expensive (~$100–$500 per case)
            Limited volume (100–1000 cases typical)
Solution:   Combine with weak labels for scale (automated detection + expert review)
            Semi-supervised learning: train on weak + few expert labels

User Feedback (Recommendation Corrections):

Post-deployment: Model shows 5 recommendations
User action:     "Hide item X, show Y instead"
Label:           y = 1 for corrected items; 0 for shown but uncorrected

Challenge:      Only get feedback for shown items (selection bias)
                Feedback is sparse (most don't rate)
Solution:       Use propensity weighting; treat as weak signal only
                Flag for careful retraining (distribution shift)

Label Quality Metrics

Inter-annotator Agreement:

Cohen’s Kappa (binary): 0.0 (random), 0.4–0.6 (moderate), 0.8+ (excellent)
Fleiss’ Kappa (multiple annotators): similar scale
Action: Accept only examples with agreement >= 0.75 (high confidence)

Label Noise Tolerance:

Models robust to 5–10% label noise
Performance degrades significantly at 20%+ noise
Measure: If 10% of labels are incorrect, what happens to model accuracy?

Temporal Label Leakage:

Can we compute label only AFTER making prediction? (yes → valid)
Do we have information about future? (yes → leakage, invalid)
Check: Plot label distribution over time; sudden shifts = possible leakage

Baseline Selection

Golden Rule: Always train a simple baseline model first.

If your fancy algorithm doesn’t beat the baseline, investigate:

Is the problem harder than expected?
Is there data quality issue?
Are features actually predictive?
Did you introduce leakage?

Simple Baseline Strategies

Classification (e.g., click prediction):

        
      
# Baseline 1: Always predict majority class
y_pred = np.ones_like(y_test) * 0.03  # CTR is 3%, predict 3% everywhere
accuracy = (y_pred == y_test).mean()  # ~97% (misleading!)
auc = roc_auc_score(y_test, y_pred)  # ~0.5 (random)

# Baseline 2: Per-group empirical rate (stratified by user/item)
y_pred = y_test.copy()
for user_id in y_test.index.unique():
    user_mask = y_test.index == user_id
    empirical_rate = y_train[y_train.index.str.startswith(user_id)].mean()
    y_pred[user_mask] = empirical_rate
auc = roc_auc_score(y_test, y_pred)  # ~0.60–0.65 (weak personalization)

# Baseline 3: Simple logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred)  # ~0.70 (linear model)

Regression (e.g., price prediction):

        
      
# Baseline: Predict mean
y_pred = np.ones_like(y_test) * y_train.mean()
rmse = np.sqrt(np.mean((y_pred - y_test) ** 2))

# Baseline: Per-category mean (stratified)
y_pred = y_test.copy()
for category in y_test.index.unique():
    mask = y_test.index == category
    category_mean = y_train[y_train.index == category].mean()
    y_pred[mask] = category_mean
rmse = np.sqrt(np.mean((y_pred - y_test) ** 2))

Ranking (e.g., recommendation):

        
      
# Baseline: Rank by popularity (deterministic)
# Items shown frequently = users like them = rank high
popularity = df.groupby('item_id')['show_count'].mean()
ranking = popularity.sort_values(ascending=False).index
# Serve top-K popular items to every user

# Compute NDCG@10
from sklearn.metrics import ndcg_score
for user_id in test_set:
    user_items = test_set[test_set.user_id == user_id]['item_id'].values
    predicted_ranking = ranking[:10]
    relevance = [1 if item in user_items else 0 for item in predicted_ranking]
    ndcg = ndcg_score([relevance], [range(10, 0, -1)])

Baseline Performance Documentation

Create a results table:

Model	Accuracy	AUC	Precision	Recall	Latency	Training Time	Notes
Majority class	96.5%	0.500	—	—	<1ms	—	Binary: always predict 0
Per-user avg	97.2%	0.620	0.38	0.25	5ms	2 min	Simple personalization
Logistic Regression	97.8%	0.680	0.52	0.38	2ms	30 min	Linear; interpretable
Decision Tree	98.1%	0.695	0.54	0.42	8ms	45 min	Non-linear; starts overfitting
XGBoost	98.5%	0.735	0.58	0.48	25ms	2 hours	Gradient boosting; more complex
Neural Network	98.6%	0.745	0.59	0.50	80ms	4 hours	Highest accuracy; latency cost

Decision Framework:

Is improvement statistically significant? (test accuracy improvement > 1% = noticeable; > 5% = major)
Is latency trade-off acceptable? (50ms acceptable? 200ms too slow for real-time?)
Is model complexity justified? (XGBoost vs Neural Net: both 0.74 AUC, but XGBoost 3x faster)

The Cold-Start Problem

Cold-start: New items/users/contexts without historical data. Causes recommendation failure in production.

Cold-Start Categories

Scenario	Problem	Data Available	Solution
New user	No history for personalization	Item features, global patterns	Recommend popular items; ask preferences
New item	No interaction history	Item metadata (title, genre, tags)	Content-based recommendations
New market	Different user behavior/language	Similar markets’ data	Transfer learning; hybrid approach
New model version	Previous data doesn’t reflect new logic	Old model predictions only	Exploration; accept lower initial performance

Cold-Start Handling Strategies

Content-Based: Use item/user features directly (no history needed)

New movie: "Action sci-fi from 2025"
Algorithm: Find users who've rated similar movies highly
           (match genre, director, release era)
Result:    Recommend to similar-taste users
Pros:      Works immediately for new items
Cons:      Needs good metadata; misses unpopular but good items

Popularity-Based: Recommend global top items (safe fallback)

New user:     Show all-time top-10 items (e.g., top movies)
New country:  Show top items from similar countries
Pros:         Always works; gives users best-of-category
Cons:         Not personalized; boring after a while

Hybrid (Best Practice):

Ranking score = 0.4 * popularity + 0.3 * content_similarity + 0.3 * collaborative_signal
- Popularity: safe default for unknowns
- Content similarity: leverage metadata
- Collaborative: use if any history available
Result:        More personalized than popularity alone, more robust than pure collaborative

Active Learning: Actively gather feedback

New user:      Show diverse items, ask "which do you like?"
Collect:       Top-5 items they rate highly
Bootstrap:     Use rated items to find similar recommendations
Pros:          Quickly personalize with few interactions (10–20)
Cons:          Requires user effort; abandonment risk

Proxy Metrics vs True Metrics

Often, the true metric is expensive/slow to measure. Use a proxy that correlates:

Example 1: Recommendation System

Metric	Measurement	Cost	Latency	Use When?
Watch hours	Total minutes user watches recommendations	$$$	Weeks (accumulate)	True metric; validate A/B test
CTR	Clicks on recommendations	$	Hours	Proxy; optimize during development
NDCG@10	Expert relevance judgments	$$	Hours (offline)	Proxy; fast development feedback
Impression count	Recommendations shown (+ engagement binary)	$	Real-time	Monitoring; leading indicator

Best practice: Optimize for NDCG in offline eval (fast iteration) → validate against watch hours in A/B test (ground truth).

Example 2: Healthcare Diagnosis

Metric	Measurement	Cost	Latency	Use When?
Patient outcome	Health after treatment (recovery, mortality)	$$$	Months–years	True metric; validate annually
Expert agreement	Board-certified radiologist diagnosis	$$$	Days	Gold standard for training
ROC-AUC	Diagnostic accuracy on labeled test set	$	Hours	Development metric; fast iteration
Sensitivity/Specificity	True positive/negative rates	$	Hours	Monitoring; public reporting

Best practice: Train on expert labels (available, fast) → validate against true outcomes (periodic, expensive).

Data Leakage: The Silent Killer

Data leakage = information from outside training distribution leaks into training, inflating metrics.

Leakage Types and Examples

Temporal Leakage (Most Common):

Problem:    Train on data from 2025, predict data from 2024
            Use features from future to predict past
Result:     Metrics look great (95% accuracy)
            In production, model performs 50% (uses unavailable features)

Example:    Churn prediction
            Feature: "has_opened_support_ticket_in_next_7_days"
            Problem: We don't know this when predicting churn
            Fix:     Use only support tickets BEFORE the prediction point

Label Leakage (Feature uses target information):

Problem:    Feature directly encodes the target
            Model learns to "cheat"
Result:     Perfect training accuracy; fails completely in production

Example:    Fraud detection
            Feature: "transaction_marked_as_fraud_in_system"
            Problem: We're trying to predict fraud; this IS the label
            Fix:     Remove any features that require knowing the label

Sample Leakage (Train/test contamination):

Problem:    Same user/transaction appears in train AND test
            Model memorizes specific samples
Result:     Good test accuracy; poor on new users

Example:    Click prediction
            User 12345 has 1000 clicks in training set
            User 12345's clicks also in test set
            Fix:     Split by user_id, not randomly
                     Ensures no user in both train and test

Leakage Detection Checklist

Using future data to predict past? (temporal split)
Using test set to compute feature statistics? (compute on train only)
Features that require knowing the target? (would be unavailable at serving)
Same entity in train and test? (split by entity_id, not row_id)
Information from evaluation period in training features? (time-series split)

Leakage Prevention

Best practice: Split data BEFORE feature engineering

        
      
from sklearn.model_selection import train_test_split

# Step 1: Split first
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 2: Fit preprocessor on train only
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Transform test with train stats

# Don't do this: scaler.fit(X_train + X_test) → leakage!

For time-series: Use temporal split

        
      
# Wrong: random split mixes past and future
X_train, X_test = train_test_split(X, test_size=0.2)

# Right: split by time
split_point = int(0.8 * len(X))
X_train, X_test = X[:split_point], X[split_point:]
# Train on 2020–2024, test on 2025 only

Implementation Example: Problem Framing Document

PROBLEM FRAMING: Click Prediction for Ad Recommendations
DATE: YYYY-MM-DD
OWNER: Data Science Lead

PROBLEM TYPE
============
Supervised classification (binary)
Definition: Predict probability user clicks recommended ad

FEATURE DEFINITION (X)
======================
Feature Set:
  User Features:
    - user_avg_ctr_30d = mean(click=1) for user's ads shown past 30d
    - user_ads_seen_7d = count of ads shown to user in last 7d
  Ad Features:
    - ad_ctr_7d = mean(click=1) for this ad shown past 7d
    - ad_freshness = hours since ad created
  Context:
    - hour_of_day = 0–23 (temporal)
    - day_of_week = 0–6 (Mon–Sun)
    - is_mobile = 1 if mobile device, 0 if desktop

Feature Availability Check:
  All features available at serving time (<50ms latency)
  No future data used
  No label information in features
  Feature values stable (not dependent on random seed)

TARGET DEFINITION (Y)
=====================
Definition: Binary outcome
  y = 1 if user clicks recommended ad within 2 seconds
  y = 0 otherwise

Label Timing: Observable within 1 second of showing ad

Label Quality:
  Source: Production event logs (user click events)
  Accuracy: 100% (deterministic; user clicked or not)
  Volume: ~50M labeled examples/day available

Label Bias Risk:
  Selection bias: only see labels for ads we showed
     → Post-deployment data biased toward model's recommendations
  Mitigation: Use inverse propensity weighting in retraining
             Also collect A/B test data with random exposure

BASELINE MODEL
==============
Strategy: Per-ad historical CTR
  For each ad_id, predict y = ad_ctr (empirical click rate)

Expected performance:
  AUC: ~0.62 (weak signal; only uses item popularity)
  Precision: ~0.032 (3.2% CTR baseline)

Success criterion:
  New model must beat AUC > 0.62 to be worth deploying

PROXY METRICS
=============
True metric: Ad revenue generated ($$) — too slow to optimize
Proxy 1:     CTR (clicks/impressions) — fast, correlates with revenue
Proxy 2:     eCTR = predicted click rate — optimized metric
Proxy 3:     Engagement score = CTR + hover_rate — broader signal

Validation:
  Offline: optimize eCTR in ML model
  Online (A/B test): measure true revenue lift

COLD-START STRATEGY
===================
New ads:      Recommend based on global popularity (top 100 ads overall)
              → After 1000 impressions, use estimated CTR
New users:    Show high-popularity ads (safe default)
              → After 10 interactions, use collaborative signals

DATA SOURCES
============
Source 1: Event logs (BigQuery)
  - User exposure: which ads shown to which users
  - User clicks: clicks on shown ads
  - Update frequency: streaming (real-time, but batch consumed)

Source 2: Ad metadata (PostgreSQL)
  - Ad text, image URL, landing page
  - Update frequency: daily

Source 3: User profile (Hive)
  - User age, location, account age
  - Update frequency: static

Data Retention: 24-month lookback available (GDPR-compliant)
Volume: ~1B events/day; training set ~100B examples

TEMPORAL SPLIT
==============
Timeline:
  Training data:   2024-01-01 to 2024-11-30 (11 months)
  Validation data: 2024-12-01 to 2024-12-20 (20 days)
  Test data:       2024-12-21 to 2024-12-31 (10 days)

Reason: Time-series split prevents leakage (train on past, test on future)

LEAKAGE CHECK
=============
Features use only historical data (before prediction time)
No test set information used for feature engineering
No users/ads duplicated between train/test
Time split enforced (no past-future mixing)
Target computable only after showing ad (no cheating)

How Real Companies Use This

Google Translate (Pivoting from Sequence-to-Sequence to Quality Ranking): Google’s neural machine translation project initially framed the problem as end-to-end sequence-to-sequence: predict target language sentence given source language input. However, team discovered that optimizing for BLEU score (standard NMT metric) didn’t correlate with user satisfaction in production. Reframed as: rank candidate translations by cross-lingual similarity scores (using embeddings of source and backtranslated target). This required redefining features (monolingual context becomes less important; semantic equivalence becomes critical) and target (pairwise ranking instead of likelihood). Problem framing alone took 4 weeks of experimentation to validate the new formulation. Result: new framing improved user-perceived translation quality by 15% without changing the underlying model architecture.

Amazon “Customers Also Bought” (Item-to-Item vs User-Based Collaborative Filtering): Amazon’s recommendation engine initially framed the problem as: predict which items user will buy next (user-based collaborative filtering). Data showed this performed poorly on new product categories (cold-start). Reframed to: predict items frequently bought together in same session (item-to-item CF). Key insight: user relationships are volatile (preferences change); item relationships are stable (complementary products don’t change). Features shifted from user history to item metadata and co-purchase statistics. Cold-start strategy: new items immediately get recommendations if similar (by metadata) items exist. This simple reframing gave 25% improvement in recommendation coverage and 5% lift in conversion rate.

DoorDash (Regression to Ordinal Classification for Delivery Time): DoorDash’s initial problem framing for delivery time prediction was continuous regression: predict exact minutes. However, customers care about ordinal promises (“guaranteed <30 min or free”). Reframed to: predict ordinal class (on-time / 5-min late / 10-min late / 15+ min late). This changed features (now focus on order complexity, traffic, driver history rather than precise geolocation) and target (multi-class labels instead of continuous values). Cold-start: new restaurants get conservative time estimates using restaurant type and neighborhood aggregates. Result: 8% improvement in on-time delivery rate, driven not by better model accuracy but by better problem formulation aligned with business reality.

Twitter/X (Content Ranking with Virality Signals): Twitter’s problem framing for feed ranking evolved over years. Initially: predict engagement (likes, retweets). Revealed hidden concept drift: virality signals (retweets from influential accounts) matter more than raw volume. Reframed to: rank by “engagement probability weighted by influencer reach” (combining CTR + virality into composite score). Feature engineering became critical: follower count of engager, tweet recency (tweets decay in interest), conversation depth (threaded replies). Proxy metric innovation: use 1-hour engagement as proxy for 24-hour engagement (can’t wait weeks for labels). Cold-start: use tweet age, author follower count, content type (media, link, quote) as default signals. Concept drift is managed by weekly retraining on fresh data.

Pinterest (Image Embeddings for Cold-Start Pins): Pinterest’s “visual search” problem framing required a pivot: instead of collaborative filtering (which fails for new pins with no saves), use image embeddings (trained via contrastive learning on pin saves) as features. Reframed from “predict user x pin interaction” to “rank pins by semantic similarity to user’s saved pins + visual attractiveness.” Labeling strategy: implicit (saved = relevant), but with propensity weighting (people save more from items they see). Cold-start pins use image content alone (CNN features) until enough engagement data accumulates. Baseline: random ranking from trending category. New framing gave 40% improvement in save rate for new pins, directly impacting user retention (fresh content discoverability).

References

A Few Useful Things to Know about Machine Learning (Pedro Domingos) — Canonical paper on why learning fails (leakage, bias, overfitting)
Machine Learning Systems Design (Chip Huyen) — Comprehensive on features, targets, baselines
Counterfactual Learning for Recommendation Ranking Systems (Joachims et al., 2021) — Handling selection bias in rankings
A Disciplined Approach to Neural Network Design (Karpukhin et al., 2020) — Feature engineering best practices
Chip Huyen: ML Problem Framing — Video walkthrough
Hands-On Machine Learning (Aurelien Geron) — Feature engineering chapter
Feature Store concepts (Feast) — Real-world feature management at scale
An Introduction to Statistical Learning (James et al.) — Chapter on problem formulation

AI & Agents, AI Ops

ai-fundamentals

This post is licensed under CC BY 4.0 by the author.