Deep Learning Architectures

The neural network zoo: Different architectures for different data modalities—images, sequences, and everything in between.

Posted Mar 15, 2025

4 min read

Architecture Landscape

Architecture	Input	Task	Key Innovation	Example	Complexity
CNN	Images, grids	Classification, detection	Convolution: local feature extraction	ResNet-50: ImageNet 76% accuracy	Medium
RNN	Sequences	Language, time-series	Recurrence: memory of previous states	LSTM: language modeling, translation	High
Transformer	Sequences	NLP, vision	Self-attention: direct dependencies	BERT, GPT-4: state-of-the-art NLP	Very high
GAN	Noise vector	Image generation	Adversarial training	StyleGAN: photorealistic faces	Very high
Autoencoder	Data	Unsupervised learning	Bottleneck: compress and reconstruct	Anomaly detection, compression	Low-medium
Vision Transformer	Images	Classification, detection	Attention on image patches	ViT: competitive with CNNs	Very high

Convolutional Neural Networks (CNN)

Core Idea: Convolution operator extracts local features. Sharing parameters reduces model size.

Building Blocks:

Convolutional Layer: Sliding window applies learned filters
Pooling: Downsampling (e.g., max pooling reduces H×W by 2x)
Fully Connected: Final layers for classification

Advantage: Parameter sharing. A 3×3 filter has 9 weights regardless of image size (1000×1000).

Production Example:

ResNet-50: 50 layers, 25.5M parameters, 76% ImageNet accuracy
Training: 90 epochs, ~1 GPU-week
Inference: ~100ms on CPU, ~5ms on GPU
Used for: Product image search, medical image analysis, autonomous driving

Recurrent Neural Networks (RNN) & LSTM

RNN: Memory via Recurrence

Traditional networks: h_t = tanh(Wx_t + Uh_{t-1} + b)

Problem: Vanishing gradient—can’t learn long-range dependencies.

LSTM (Long Short-Term Memory):

Introduces cell state (memory) + gates to control information flow.

Forget gate: f_t = sigmoid(Wf[h_{t-1}, x_t] + bf)
Input gate: i_t = sigmoid(Wi[h_{t-1}, x_t] + bi)
Cell state: C̃_t = tanh(Wc[h_{t-1}, x_t] + bc)
New state: C_t = f_t * C_{t-1} + i_t * C̃_t
Output: h_t = o_t * tanh(C_t)

Production Example:

Google Translate: LSTM encoder-decoder for 100+ language pairs. Improved translation quality by 5–10 BLEU points.

Transformer Architecture

Key Innovation: Self-Attention replaces recurrence. “Attend to any position directly.”

Components:

Self-Attention: Each position attends to all positions
1 Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
Multi-Head: Multiple attention heads in parallel (8–16 heads)
Feed-Forward: Non-linear transformation per position
Positional Encoding: Encodes position (since no recurrence)
Layer Normalization & Residual Connections: Enables deep stacking

Advantages:

✅ Parallelizable (no sequential dependency like RNN)
✅ Scales to very long sequences (2048–4096 tokens)
✅ Captures long-range dependencies better than LSTM

Production Example:

BERT (Bidirectional Encoder Representations from Transformers):

12–24 layers, 110M–340M parameters
Pre-trained on 3.3B words
Fine-tuning achieves 90%+ on 10+ NLP benchmarks
Inference: ~100ms per query on GPU

Generative Adversarial Networks (GAN)

Two-Network Adversarial Setup:

Generator: Learns to create fake data from random noise
Discriminator: Learns to distinguish real vs fake

Training Loop:

Generator creates fake images
Discriminator evaluates real + fake
Backprop: Generator minimizes discriminator accuracy, discriminator maximizes accuracy
Equilibrium: Generator produces indistinguishable fakes

Production Example:

StyleGAN: Generates photorealistic human faces. ~100% human fooling rate on random samples.

Challenge: Training instability—generator/discriminator can diverge.

Autoencoders

Structure: Encoder (compress) → Bottleneck → Decoder (reconstruct)

Objective: Minimize reconstruction loss:

x - decode(encode(x))

Applications:

Anomaly Detection: High reconstruction error = anomaly
Dimensionality Reduction: Bottleneck layer is compressed representation
Data Augmentation: Decoder generates variations

Production Example:

Netflix: Autoencoders reduce high-dimensional user embeddings for recommendation. Faster similarity search.

Implementation Comparison

        
      
# CNN (Image Classification)
from torchvision import models
model = models.resnet50(pretrained=True)
model.eval()
output = model(image_tensor)  # ~100ms inference

# LSTM (Sequence to Sequence)
import torch.nn as nn
lstm = nn.LSTM(input_size=100, hidden_size=256, num_layers=2, batch_first=True)
output, (h, c) = lstm(sequence)  # Hidden state preserved

# Transformer
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
output = model(input_ids, attention_mask=attention)  # Attention computed

# GAN
generator = nn.Sequential(
    nn.Linear(100, 256),
    nn.ReLU(),
    nn.Linear(256, 784)
)
fake_images = generator(random_noise)

# Autoencoder
class Autoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Linear(784, 64)
        self.decoder = nn.Linear(64, 784)
    def forward(self, x):
        return self.decoder(self.encoder(x))

Choosing an Architecture

Input Type	Task	Best Architecture
Images	Classification	CNN or Vision Transformer
Images	Generation	GAN or Diffusion
Text	Classification	Transformer (BERT-style)
Text	Generation	Transformer (GPT-style)
Sequences	Forecasting	LSTM or Transformer
Audio	Recognition	CNN or Transformer
Unsupervised	Compression	Autoencoder or VAE

References

📄 LeNet: Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998) 📄 ResNet: Deep Residual Learning (He et al., 2016) 📄 Attention Is All You Need (Vaswani et al., 2017) 📄 Generative Adversarial Networks (Goodfellow et al., 2014) 📖 Deep Learning (Goodfellow, Bengio, Courville) 🎥 Stanford CS231N: Computer Vision

AI & Agents, ML Foundations

ai-fundamentals

This post is licensed under CC BY 4.0 by the author.

Architecture Landscape

Convolutional Neural Networks (CNN)

Recurrent Neural Networks (RNN) & LSTM

Transformer Architecture

Generative Adversarial Networks (GAN)

Autoencoders

Implementation Comparison

Choosing an Architecture

References

Trending Tags