Top 20 ML Algorithms
Taxonomy of essential machine learning algorithms—covering the algorithms that power production systems across industry.
Taxonomy of essential machine learning algorithms—covering the algorithms that power production systems across industry.
Algorithm Landscape
| Category | Algorithm | Time | Space | When to Use | Production Use |
|---|---|---|---|---|---|
| Linear Models | Linear Regression | O(n×d×iter) | O(d) | Baseline for regression | Uber: delivery time prediction |
| Logistic Regression | O(n×d×iter) | O(d) | Binary classification baseline | Stripe: fraud detection | |
| Ridge/Lasso | O(n×d×iter) | O(d) | Regularization, feature selection | Medical diagnosis with many features | |
| Tree-Based | Decision Trees | O(n log n × d) | O(n) | Interpretability, small data | Credit approval (regulatory) |
| Random Forest | O(trees×n log n×d) | O(n) | Balanced accuracy/speed | Airbnb: price prediction, booking | |
| Gradient Boosting (XGBoost) | O(trees×n log n×d) | O(n) | Kaggle winner, tabular data | Criteo: click-through rate prediction | |
| LightGBM | O(trees×n log n×d) | O(n) | Large datasets, speed | Microsoft: ranking in search | |
| Distance-Based | k-Means Clustering | O(n×k×iter) | O(n) | Fast clustering, scalable | Spotify: playlist generation |
| k-Nearest Neighbors | O(n×d) per query | O(n×d) | Non-parametric, small data | Recommendation (but slow for large scale) | |
| SVM | O(n² to n³) | O(sv) | High-dim data, kernel tricks | Text classification, face recognition | |
| Ensemble | Bagging | O(trees×n×d) | O(n) | Reduce variance | Combines many weak learners |
| Boosting | O(trees×n×d) | O(n) | Reduce bias, weighted samples | AdaBoost, Gradient Boosting | |
| Stacking | O(models×n×d) | O(n) | Combine diverse models | Kaggle: meta-learner | |
| Clustering | Hierarchical Clustering | O(n²) | O(n²) | Dendrograms, interpretable | Gene expression analysis |
| DBSCAN | O(n log n) | O(n) | Non-convex clusters, outliers | Geospatial: location clustering | |
| Neural Nets | MLP | O(layers×n×hidden) | O(params) | Complex non-linear patterns | Recommendation systems |
| CNN | O(filters×n×field) | O(params) | Images, computer vision | ResNet: ImageNet 76% top-1 accuracy | |
| RNN/LSTM | O(seq_len×n×hidden) | O(seq_len) | Sequential data, NLP | Google Translate, sentiment analysis | |
| Transformer | O(seq_len²×n) | O(seq_len²) | Parallel processing, long-range deps | BERT, GPT-4, modern NLP |
Linear Models (Interpretable Baselines)
Linear Regression
Used when: Continuous output, linear relationship, interpretability matters
1
2
3
4
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Production: Uber predicts delivery time. Linear combination of distance, time-of-day, weather, traffic. ~90% R² on 10M daily predictions.
Tree-Based Models (Tabular Data Champions)
Random Forest
Used when: Balanced accuracy/interpretability/speed, non-linear relationships
1
2
3
4
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, max_depth=15, n_jobs=-1)
model.fit(X_train, y_train)
importances = model.feature_importances_ # Feature importance for interpretation
Key Properties: Trains 100 trees in parallel. Each tree uses random feature subset. Averaging reduces variance.
Production: Airbnb uses random forest for price prediction. Inputs: location, amenities, reviews. Accuracy: ±25% on 90% of listings.
Gradient Boosting (XGBoost)
Used when: State-of-the-art tabular data, Kaggle competitions, structured features
1
2
3
4
import xgboost as xgb
model = xgb.XGBClassifier(n_estimators=100, max_depth=5, learning_rate=0.1)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
How It Works: Sequentially builds trees. Each new tree predicts residuals of previous trees. Weighted emphasis on hard examples.
Production: Criteo: click-through rate (CTR) prediction for online ads. 1B+ daily predictions. XGBoost achieves 0.5% AUC improvement over logistic regression = millions in ad revenue.
Distance-Based Methods
k-Nearest Neighbors (k-NN)
Used when: Small datasets, non-parametric, simple baseline
1
2
3
4
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
# Prediction: find 5 nearest neighbors, vote
Trade-offs: No training time, but O(n×d) prediction cost. Doesn’t scale to 1B examples.
Support Vector Machine (SVM)
Used when: High-dimensional data, clear margin separation
1
2
3
from sklearn.svm import SVC
model = SVC(kernel='rbf', C=1.0, gamma='scale')
model.fit(X_train, y_train)
Key Insight: Finds maximum-margin hyperplane. Kernel trick enables non-linear boundaries in original feature space.
Production: Face recognition (Facebook DeepFace started with SVM before switching to deep learning).
Clustering Algorithms
K-Means
Used when: Fast clustering, spherical clusters, scalable
1
2
3
from sklearn.cluster import KMeans
model = KMeans(n_clusters=5, n_init=10, max_iter=300)
labels = model.fit_predict(X)
Production: Spotify clusters songs by audio features (tempo, pitch, energy). 100M songs grouped into playlists + similar track recommendations.
DBSCAN
Used when: Non-convex clusters, outlier detection, unknown number of clusters
1
2
3
4
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.5, min_samples=5)
labels = model.fit_predict(X)
# Returns -1 for noise/outliers
Production: Geospatial: cluster user locations for local restaurant recommendations.
Ensemble Methods
Bagging (Bootstrap Aggregating)
Trains multiple models on random subsamples. Averaging predictions reduces variance.
Example: Random Forest is bagging of decision trees
Boosting
Sequentially trains weak learners. Each focuses on previous mistakes. Reduces bias.
Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM
Production: Most Kaggle winners use gradient boosting (XGBoost, LightGBM, or CatBoost).
Stacking
Trains multiple diverse models. Meta-learner combines predictions.
1
2
3
4
Layer 1: Train logistic regression, random forest, SVM on X
Output: 3×n predictions matrix
Layer 2: Train meta-learner (e.g., linear regression) on Layer 1 output
Output: Final predictions
Neural Network Models
Multilayer Perceptron (MLP)
Used when: Complex non-linear patterns, moderate data (10K+ examples)
1
2
3
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, alpha=0.0001)
model.fit(X_train, y_train)
Layers: Input → hidden layers (learn features) → output (task)
Convolutional Neural Network (CNN)
Used when: Images, video, spatial data
Key Innovation: Convolutional filters extract local features (edges, textures). Parameter sharing reduces parameters by 10,000x vs fully connected.
Production: ResNet-50 achieves 76% top-1 accuracy on ImageNet (1M images, 1000 categories). Inference: ~100ms on CPU.
Recurrent Neural Network (RNN) / LSTM
Used when: Sequential data (text, time-series, speech)
Key: Hidden state carries memory. Can model long-range dependencies.
Production: Google Translate uses LSTM encoder-decoder. Translates 100+ language pairs. 90%+ accuracy on common language pairs.
Transformer
Used when: NLP, long sequences, parallel training needed
Key Innovation: Self-attention replaces recurrence. Can attend to any position directly (no sequential bottleneck).
Production: BERT, GPT-4, Claude all use Transformers. BERT pre-trained on 3.3B words. Fine-tuning achieves 90%+ accuracy on 10+ NLP benchmarks with minimal task-specific data.
Choosing an Algorithm
Decision Framework:
- Is your output continuous or categorical?
- Continuous → Regression (linear, tree-based, neural net)
- Categorical → Classification (logistic reg, tree-based, SVM, neural net)
- Do you have labeled data?
- Yes → Supervised learning (algorithms above)
- No → Unsupervised (K-means, hierarchical, DBSCAN, autoencoders)
- What’s your data size?
- <1000: Simple models (logistic reg, SVM, small tree) or transfer learning
- 1K–1M: Random Forest, XGBoost (best power-to-effort ratio)
-
1M: Neural nets, gradient boosting with sampling
- Do you need interpretability?
- Yes → Linear regression, decision trees, random forest (feature importance)
- No → Neural nets, SVM with complex kernels
- What’s your computational budget?
- Fast (<1s training): Logistic regression, small trees
- Minutes: Random Forest, XGBoost
- Hours–days: Neural networks, transfer learning
References
📖 Hands-On Machine Learning (Aurélien Géron) — Practical, industry-focused textbook 📄 XGBoost Paper (Chen & Guestrin, 2016) 📄 LightGBM Paper (Ke et al., 2017) 📖 Scikit-learn Documentation 🎥 Andrew Ng: Machine Learning (Coursera) — Industry standard intro course 🔗 Kaggle Learn — Free practical courses