Editing Transformer Architecture (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
The core innovation of the transformer is replacing sequential processing (as in RNNs) with '''parallel self-attention'''. Instead of processing tokens one at a time, all tokens attend to each other simultaneously — making transformers highly parallelizable and GPU-friendly.

'''How self-attention works''':
For each token, three vectors are computed: a Query (what I'm looking for), a Key (what I offer), and a Value (what I give if selected). Attention scores are computed as dot products between the query of one token and the keys of all others, scaled and softmaxed. These scores weight the values to produce a new representation for each position — one that is a context-aware blend of the entire sequence.

Think of it like a search engine within the sequence: each word "queries" for related words, "keys" advertise their content, and "values" are the actual information retrieved.

'''Multi-head attention''' runs h parallel attention operations with different learned projections, then concatenates the results. Different heads can specialize: one might track syntactic dependencies, another semantic relationships, another coreference.

'''The transformer block''' is a repeating unit:
<syntaxhighlight lang="text">
Input → LayerNorm → Multi-Head Attention → Residual → LayerNorm → FFN → Residual → Output
</syntaxhighlight>

Stacking N of these blocks (N=12 for BERT-base, N=96 for GPT-3) gives the model increasing ability to compose and abstract information.

'''Encoder-only vs Decoder-only vs Encoder-Decoder''':
* Encoder-only (BERT, RoBERTa): Bidirectional attention, good for classification and embedding tasks
* Decoder-only (GPT series, LLaMA): Causal attention, good for generation
* Encoder-Decoder (T5, BART): Encoder processes input, decoder generates output; good for translation and summarization
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">