Post

Deep Learning Architectures

The neural network zoo: Different architectures for different data modalities—images, sequences, and everything in between.

Deep Learning Architectures

Architecture Landscape

Architecture Input Task Key Innovation Example Complexity
CNN Images, grids Classification, detection Convolution: local feature extraction ResNet-50: ImageNet 76% accuracy Medium
RNN Sequences Language, time-series Recurrence: memory of previous states LSTM: language modeling, translation High
Transformer Sequences NLP, vision Self-attention: direct dependencies BERT, GPT-4: state-of-the-art NLP Very high
GAN Noise vector Image generation Adversarial training StyleGAN: photorealistic faces Very high
Autoencoder Data Unsupervised learning Bottleneck: compress and reconstruct Anomaly detection, compression Low-medium
Vision Transformer Images Classification, detection Attention on image patches ViT: competitive with CNNs Very high

Convolutional Neural Networks (CNN)

Core Idea: Convolution operator extracts local features. Sharing parameters reduces model size.

Building Blocks:

  1. Convolutional Layer: Sliding window applies learned filters
  2. Pooling: Downsampling (e.g., max pooling reduces H×W by 2x)
  3. Fully Connected: Final layers for classification

Advantage: Parameter sharing. A 3×3 filter has 9 weights regardless of image size (1000×1000).

Production Example:

1
2
3
4
ResNet-50: 50 layers, 25.5M parameters, 76% ImageNet accuracy
Training: 90 epochs, ~1 GPU-week
Inference: ~100ms on CPU, ~5ms on GPU
Used for: Product image search, medical image analysis, autonomous driving

Recurrent Neural Networks (RNN) & LSTM

RNN: Memory via Recurrence

Traditional networks: h_t = tanh(Wx_t + Uh_{t-1} + b)

Problem: Vanishing gradient—can’t learn long-range dependencies.

LSTM (Long Short-Term Memory):

Introduces cell state (memory) + gates to control information flow.

1
2
3
4
5
Forget gate: f_t = sigmoid(Wf[h_{t-1}, x_t] + bf)
Input gate: i_t = sigmoid(Wi[h_{t-1}, x_t] + bi)
Cell state: C̃_t = tanh(Wc[h_{t-1}, x_t] + bc)
New state: C_t = f_t * C_{t-1} + i_t * C̃_t
Output: h_t = o_t * tanh(C_t)

Production Example:

Google Translate: LSTM encoder-decoder for 100+ language pairs. Improved translation quality by 5–10 BLEU points.


Transformer Architecture

Key Innovation: Self-Attention replaces recurrence. “Attend to any position directly.”

Components:

  1. Self-Attention: Each position attends to all positions
    1
    
    Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
    
  2. Multi-Head: Multiple attention heads in parallel (8–16 heads)

  3. Feed-Forward: Non-linear transformation per position

  4. Positional Encoding: Encodes position (since no recurrence)

  5. Layer Normalization & Residual Connections: Enables deep stacking

Advantages:

  • ✅ Parallelizable (no sequential dependency like RNN)
  • ✅ Scales to very long sequences (2048–4096 tokens)
  • ✅ Captures long-range dependencies better than LSTM

Production Example:

BERT (Bidirectional Encoder Representations from Transformers):

  • 12–24 layers, 110M–340M parameters
  • Pre-trained on 3.3B words
  • Fine-tuning achieves 90%+ on 10+ NLP benchmarks
  • Inference: ~100ms per query on GPU

Generative Adversarial Networks (GAN)

Two-Network Adversarial Setup:

  1. Generator: Learns to create fake data from random noise
  2. Discriminator: Learns to distinguish real vs fake

Training Loop:

  1. Generator creates fake images
  2. Discriminator evaluates real + fake
  3. Backprop: Generator minimizes discriminator accuracy, discriminator maximizes accuracy
  4. Equilibrium: Generator produces indistinguishable fakes

Production Example:

StyleGAN: Generates photorealistic human faces. ~100% human fooling rate on random samples.

Challenge: Training instability—generator/discriminator can diverge.


Autoencoders

Structure: Encoder (compress) → Bottleneck → Decoder (reconstruct)

Objective: Minimize reconstruction loss:   x - decode(encode(x))   ²

Applications:

  • Anomaly Detection: High reconstruction error = anomaly
  • Dimensionality Reduction: Bottleneck layer is compressed representation
  • Data Augmentation: Decoder generates variations

Production Example:

Netflix: Autoencoders reduce high-dimensional user embeddings for recommendation. Faster similarity search.


Implementation Comparison

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# CNN (Image Classification)
from torchvision import models
model = models.resnet50(pretrained=True)
model.eval()
output = model(image_tensor)  # ~100ms inference

# LSTM (Sequence to Sequence)
import torch.nn as nn
lstm = nn.LSTM(input_size=100, hidden_size=256, num_layers=2, batch_first=True)
output, (h, c) = lstm(sequence)  # Hidden state preserved

# Transformer
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
output = model(input_ids, attention_mask=attention)  # Attention computed

# GAN
generator = nn.Sequential(
    nn.Linear(100, 256),
    nn.ReLU(),
    nn.Linear(256, 784)
)
fake_images = generator(random_noise)

# Autoencoder
class Autoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Linear(784, 64)
        self.decoder = nn.Linear(64, 784)
    def forward(self, x):
        return self.decoder(self.encoder(x))

Choosing an Architecture

Input Type Task Best Architecture
Images Classification CNN or Vision Transformer
Images Generation GAN or Diffusion
Text Classification Transformer (BERT-style)
Text Generation Transformer (GPT-style)
Sequences Forecasting LSTM or Transformer
Audio Recognition CNN or Transformer
Unsupervised Compression Autoencoder or VAE

References

📄 LeNet: Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998) 📄 ResNet: Deep Residual Learning (He et al., 2016) 📄 Attention Is All You Need (Vaswani et al., 2017) 📄 Generative Adversarial Networks (Goodfellow et al., 2014) 📖 Deep Learning (Goodfellow, Bengio, Courville) 🎥 Stanford CS231N: Computer Vision

This post is licensed under CC BY 4.0 by the author.