Deep Learning Architectures
The neural network zoo: Different architectures for different data modalities—images, sequences, and everything in between.
Architecture Landscape
| Architecture | Input | Task | Key Innovation | Example | Complexity |
|---|---|---|---|---|---|
| CNN | Images, grids | Classification, detection | Convolution: local feature extraction | ResNet-50: ImageNet 76% accuracy | Medium |
| RNN | Sequences | Language, time-series | Recurrence: memory of previous states | LSTM: language modeling, translation | High |
| Transformer | Sequences | NLP, vision | Self-attention: direct dependencies | BERT, GPT-4: state-of-the-art NLP | Very high |
| GAN | Noise vector | Image generation | Adversarial training | StyleGAN: photorealistic faces | Very high |
| Autoencoder | Data | Unsupervised learning | Bottleneck: compress and reconstruct | Anomaly detection, compression | Low-medium |
| Vision Transformer | Images | Classification, detection | Attention on image patches | ViT: competitive with CNNs | Very high |
Convolutional Neural Networks (CNN)
Core Idea: Convolution operator extracts local features. Sharing parameters reduces model size.
Building Blocks:
- Convolutional Layer: Sliding window applies learned filters
- Pooling: Downsampling (e.g., max pooling reduces H×W by 2x)
- Fully Connected: Final layers for classification
Advantage: Parameter sharing. A 3×3 filter has 9 weights regardless of image size (1000×1000).
Production Example:
1
2
3
4
ResNet-50: 50 layers, 25.5M parameters, 76% ImageNet accuracy
Training: 90 epochs, ~1 GPU-week
Inference: ~100ms on CPU, ~5ms on GPU
Used for: Product image search, medical image analysis, autonomous driving
Recurrent Neural Networks (RNN) & LSTM
RNN: Memory via Recurrence
Traditional networks: h_t = tanh(Wx_t + Uh_{t-1} + b)
Problem: Vanishing gradient—can’t learn long-range dependencies.
LSTM (Long Short-Term Memory):
Introduces cell state (memory) + gates to control information flow.
1
2
3
4
5
Forget gate: f_t = sigmoid(Wf[h_{t-1}, x_t] + bf)
Input gate: i_t = sigmoid(Wi[h_{t-1}, x_t] + bi)
Cell state: C̃_t = tanh(Wc[h_{t-1}, x_t] + bc)
New state: C_t = f_t * C_{t-1} + i_t * C̃_t
Output: h_t = o_t * tanh(C_t)
Production Example:
Google Translate: LSTM encoder-decoder for 100+ language pairs. Improved translation quality by 5–10 BLEU points.
Transformer Architecture
Key Innovation: Self-Attention replaces recurrence. “Attend to any position directly.”
Components:
- Self-Attention: Each position attends to all positions
1
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
-
Multi-Head: Multiple attention heads in parallel (8–16 heads)
-
Feed-Forward: Non-linear transformation per position
-
Positional Encoding: Encodes position (since no recurrence)
- Layer Normalization & Residual Connections: Enables deep stacking
Advantages:
- ✅ Parallelizable (no sequential dependency like RNN)
- ✅ Scales to very long sequences (2048–4096 tokens)
- ✅ Captures long-range dependencies better than LSTM
Production Example:
BERT (Bidirectional Encoder Representations from Transformers):
- 12–24 layers, 110M–340M parameters
- Pre-trained on 3.3B words
- Fine-tuning achieves 90%+ on 10+ NLP benchmarks
- Inference: ~100ms per query on GPU
Generative Adversarial Networks (GAN)
Two-Network Adversarial Setup:
- Generator: Learns to create fake data from random noise
- Discriminator: Learns to distinguish real vs fake
Training Loop:
- Generator creates fake images
- Discriminator evaluates real + fake
- Backprop: Generator minimizes discriminator accuracy, discriminator maximizes accuracy
- Equilibrium: Generator produces indistinguishable fakes
Production Example:
StyleGAN: Generates photorealistic human faces. ~100% human fooling rate on random samples.
Challenge: Training instability—generator/discriminator can diverge.
Autoencoders
Structure: Encoder (compress) → Bottleneck → Decoder (reconstruct)
| Objective: Minimize reconstruction loss: | x - decode(encode(x)) | ² |
Applications:
- Anomaly Detection: High reconstruction error = anomaly
- Dimensionality Reduction: Bottleneck layer is compressed representation
- Data Augmentation: Decoder generates variations
Production Example:
Netflix: Autoencoders reduce high-dimensional user embeddings for recommendation. Faster similarity search.
Implementation Comparison
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# CNN (Image Classification)
from torchvision import models
model = models.resnet50(pretrained=True)
model.eval()
output = model(image_tensor) # ~100ms inference
# LSTM (Sequence to Sequence)
import torch.nn as nn
lstm = nn.LSTM(input_size=100, hidden_size=256, num_layers=2, batch_first=True)
output, (h, c) = lstm(sequence) # Hidden state preserved
# Transformer
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
output = model(input_ids, attention_mask=attention) # Attention computed
# GAN
generator = nn.Sequential(
nn.Linear(100, 256),
nn.ReLU(),
nn.Linear(256, 784)
)
fake_images = generator(random_noise)
# Autoencoder
class Autoencoder(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Linear(784, 64)
self.decoder = nn.Linear(64, 784)
def forward(self, x):
return self.decoder(self.encoder(x))
Choosing an Architecture
| Input Type | Task | Best Architecture |
|---|---|---|
| Images | Classification | CNN or Vision Transformer |
| Images | Generation | GAN or Diffusion |
| Text | Classification | Transformer (BERT-style) |
| Text | Generation | Transformer (GPT-style) |
| Sequences | Forecasting | LSTM or Transformer |
| Audio | Recognition | CNN or Transformer |
| Unsupervised | Compression | Autoencoder or VAE |
References
📄 LeNet: Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998) 📄 ResNet: Deep Residual Learning (He et al., 2016) 📄 Attention Is All You Need (Vaswani et al., 2017) 📄 Generative Adversarial Networks (Goodfellow et al., 2014) 📖 Deep Learning (Goodfellow, Bengio, Courville) 🎥 Stanford CS231N: Computer Vision