Editing Transformer Architecture (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Self-attention''' — A mechanism where each position in a sequence attends to all other positions to compute a weighted representation, capturing dependencies regardless of distance.
* '''Multi-head attention''' — Running multiple self-attention operations in parallel (each called a "head"), allowing the model to attend to different aspects of the input simultaneously.
* '''Query, Key, Value (Q, K, V)''' — Three linear projections of input vectors used to compute attention scores. Q and K determine attention weights; V determines what information is aggregated.
* '''Attention score''' — The dot product of a query with a key, scaled by √(d_k), indicating how much attention one position pays to another.
* '''Softmax''' — Applied to attention scores to produce a probability distribution summing to 1, used as the weighting for value aggregation.
* '''Positional encoding''' — A signal added to input embeddings to give the model information about token positions, since self-attention is permutation-invariant.
* '''Feed-forward network (FFN)''' — A two-layer MLP applied to each position independently after attention, providing additional representational capacity.
* '''Layer normalization''' — A normalization technique applied within each transformer block to stabilize training.
* '''Residual connection''' — Adding a layer's input directly to its output (x + sublayer(x)), enabling gradient flow and enabling very deep networks.
* '''Encoder''' — The part of a transformer that processes input sequences into contextual representations.
* '''Decoder''' — The part that generates output sequences autoregressively, attending to both its own previous outputs and the encoder's output.
* '''Causal masking''' — Masking future tokens in the decoder so each position can only attend to previous positions, enforcing the autoregressive generation property.
* '''Context window''' — The maximum number of tokens a transformer can process at once.
* '''Token embedding''' — A learned dense vector representation for each token in the vocabulary.
* '''Temperature''' — A parameter controlling the randomness of token sampling at inference time.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">