Transformer Architecture

From BloomWiki
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

The Transformer architecture is the foundational neural network design that powers virtually all modern large-scale AI systems — from GPT-4 and Claude to BERT, DALL-E, and AlphaFold. Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., the Transformer replaced recurrence with a mechanism called self-attention, enabling parallel processing of sequences and unlocking unprecedented scale. Understanding transformers is essential for understanding modern AI.

Remembering[edit]

  • Self-attention — A mechanism where each position in a sequence attends to all other positions to compute a weighted representation, capturing dependencies regardless of distance.
  • Multi-head attention — Running multiple self-attention operations in parallel (each called a "head"), allowing the model to attend to different aspects of the input simultaneously.
  • Query, Key, Value (Q, K, V) — Three linear projections of input vectors used to compute attention scores. Q and K determine attention weights; V determines what information is aggregated.
  • Attention score — The dot product of a query with a key, scaled by √(d_k), indicating how much attention one position pays to another.
  • Softmax — Applied to attention scores to produce a probability distribution summing to 1, used as the weighting for value aggregation.
  • Positional encoding — A signal added to input embeddings to give the model information about token positions, since self-attention is permutation-invariant.
  • Feed-forward network (FFN) — A two-layer MLP applied to each position independently after attention, providing additional representational capacity.
  • Layer normalization — A normalization technique applied within each transformer block to stabilize training.
  • Residual connection — Adding a layer's input directly to its output (x + sublayer(x)), enabling gradient flow and enabling very deep networks.
  • Encoder — The part of a transformer that processes input sequences into contextual representations.
  • Decoder — The part that generates output sequences autoregressively, attending to both its own previous outputs and the encoder's output.
  • Causal masking — Masking future tokens in the decoder so each position can only attend to previous positions, enforcing the autoregressive generation property.
  • Context window — The maximum number of tokens a transformer can process at once.
  • Token embedding — A learned dense vector representation for each token in the vocabulary.
  • Temperature — A parameter controlling the randomness of token sampling at inference time.

Understanding[edit]

The core innovation of the transformer is replacing sequential processing (as in RNNs) with parallel self-attention. Instead of processing tokens one at a time, all tokens attend to each other simultaneously — making transformers highly parallelizable and GPU-friendly.

How self-attention works: For each token, three vectors are computed: a Query (what I'm looking for), a Key (what I offer), and a Value (what I give if selected). Attention scores are computed as dot products between the query of one token and the keys of all others, scaled and softmaxed. These scores weight the values to produce a new representation for each position — one that is a context-aware blend of the entire sequence.

Think of it like a search engine within the sequence: each word "queries" for related words, "keys" advertise their content, and "values" are the actual information retrieved.

Multi-head attention runs h parallel attention operations with different learned projections, then concatenates the results. Different heads can specialize: one might track syntactic dependencies, another semantic relationships, another coreference.

The transformer block is a repeating unit: <syntaxhighlight lang="text"> Input → LayerNorm → Multi-Head Attention → Residual → LayerNorm → FFN → Residual → Output </syntaxhighlight>

Stacking N of these blocks (N=12 for BERT-base, N=96 for GPT-3) gives the model increasing ability to compose and abstract information.

Encoder-only vs Decoder-only vs Encoder-Decoder:

  • Encoder-only (BERT, RoBERTa): Bidirectional attention, good for classification and embedding tasks
  • Decoder-only (GPT series, LLaMA): Causal attention, good for generation
  • Encoder-Decoder (T5, BART): Encoder processes input, decoder generates output; good for translation and summarization

Applying[edit]

Computing self-attention from scratch in PyTorch:

<syntaxhighlight lang="python"> import torch import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):

   """
   Q, K, V: shape (batch, heads, seq_len, d_k)
   """
   d_k = Q.size(-1)
   # Compute attention scores
   scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
   # Apply causal mask (for decoder)
   if mask is not None:
       scores = scores.masked_fill(mask == 0, float('-inf'))
   # Softmax to get attention weights
   attn_weights = F.softmax(scores, dim=-1)
   # Weighted sum of values
   return torch.matmul(attn_weights, V), attn_weights
  1. Example dimensions

batch, heads, seq_len, d_k = 2, 8, 512, 64 Q = torch.randn(batch, heads, seq_len, d_k) K = torch.randn(batch, heads, seq_len, d_k) V = torch.randn(batch, heads, seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V) print(output.shape) # torch.Size([2, 8, 512, 64]) </syntaxhighlight>

Key transformer variants and their use cases
BERT → Sentence classification, NER, question answering (fine-tuning on labeled data)
GPT-2/3/4 → Text generation, few-shot learning, instruction following
T5 → Any-to-any text tasks framed as "text to text"
LLaMA / Mistral → Open-weight generation models for local deployment
ViT (Vision Transformer) → Image classification by treating patches as tokens
Whisper → Speech-to-text using encoder transformer on mel spectrograms

Analyzing[edit]

Transformer Architecture Trade-offs
Aspect Benefit Limitation
Self-attention Captures long-range dependencies in O(1) steps O(n²) memory and compute with sequence length
Parallelism Trains much faster than RNNs on modern hardware Requires large GPU memory for long contexts
Scale Performance consistently improves with more parameters and data Training cost is enormous (millions of dollars for frontier models)
Context window Modern models handle 100k+ tokens KV-cache memory grows linearly; long-context retrieval degrades
Positional encoding Sinusoidal or RoPE encodes position Generalization beyond training context length is degraded

Failure modes and nuances:

  • Attention sink — Research shows early tokens receive disproportionate attention regardless of relevance, a consequence of the softmax function needing to sum to 1.
  • Quadratic scaling — Self-attention complexity is O(n²) in sequence length, making very long documents expensive. Mitigations: Flash Attention, sliding window attention (Longformer), linear attention approximations.
  • Position generalization — Transformers trained with absolute positional encodings often fail on sequences longer than seen in training. RoPE (Rotary Position Embeddings) and ALiBi improve this.
  • Repetition — Decoder-only models can fall into repetitive loops; nucleus sampling (top-p) and repetition penalties help.

Evaluating[edit]

Expert-level evaluation of transformer systems goes beyond benchmark scores:

Scaling law analysis: Chinchilla scaling laws show that model size and training data should scale proportionally. Experts understand that a 70B model trained on too little data is less efficient than a 7B model with optimal training.

Attention pattern visualization: Tools like BertViz allow visualization of attention heads. Experts use this to verify that heads are learning meaningful patterns (e.g., syntactic dependencies, coreference) rather than degenerate uniform attention.

Emergent capability tracking: Some capabilities appear suddenly at scale thresholds (chain-of-thought reasoning emerges around 100B parameters). Experts track these phase transitions to understand capability vs. scale relationships.

KV-cache profiling: For production deployment, the key-value cache is the dominant memory bottleneck. Expert engineers profile cache size, hit rates, and eviction strategies in serving infrastructure.

Creating[edit]

Designing a transformer-based system architecture:

1. Choose the right variant for the task <syntaxhighlight lang="text"> Task Classification: ├── Generating text/code → Decoder-only (GPT/LLaMA family) ├── Encoding for search/classification → Encoder-only (BERT/RoBERTa/E5) ├── Translation/Summarization → Encoder-Decoder (T5/BART) └── Images → ViT or CLIP (vision-language) </syntaxhighlight>

2. Efficient attention for long contexts

  • Flash Attention 2 for memory-efficient exact attention
  • Grouped Query Attention (GQA) — reduces KV-cache by sharing keys/values across query heads
  • Sliding window attention for documents

3. Production serving architecture <syntaxhighlight lang="text"> User Request

Load Balancer

[Tokenizer service]

[Inference cluster: vLLM / TGI / Triton]

   ↓ (continuous batching, PagedAttention)

[GPU cluster with tensor parallelism]

[Detokenizer + safety filter]

Response </syntaxhighlight>

4. Key efficiency techniques

  • Quantization: INT8/INT4 weights reduce memory 2–4× with minimal quality loss
  • Speculative decoding: a small "draft" model proposes tokens, a large model verifies them in parallel — 2–3× throughput
  • Prefix caching: cache shared system prompt KV-states across requests
  • Batching: group requests sharing prefixes to maximize GPU utilization