Editing
Transformer Architecture
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Remembering</span> == * '''Self-attention''' β A mechanism where each position in a sequence attends to all other positions to compute a weighted representation, capturing dependencies regardless of distance. * '''Multi-head attention''' β Running multiple self-attention operations in parallel (each called a "head"), allowing the model to attend to different aspects of the input simultaneously. * '''Query, Key, Value (Q, K, V)''' β Three linear projections of input vectors used to compute attention scores. Q and K determine attention weights; V determines what information is aggregated. * '''Attention score''' β The dot product of a query with a key, scaled by β(d_k), indicating how much attention one position pays to another. * '''Softmax''' β Applied to attention scores to produce a probability distribution summing to 1, used as the weighting for value aggregation. * '''Positional encoding''' β A signal added to input embeddings to give the model information about token positions, since self-attention is permutation-invariant. * '''Feed-forward network (FFN)''' β A two-layer MLP applied to each position independently after attention, providing additional representational capacity. * '''Layer normalization''' β A normalization technique applied within each transformer block to stabilize training. * '''Residual connection''' β Adding a layer's input directly to its output (x + sublayer(x)), enabling gradient flow and enabling very deep networks. * '''Encoder''' β The part of a transformer that processes input sequences into contextual representations. * '''Decoder''' β The part that generates output sequences autoregressively, attending to both its own previous outputs and the encoder's output. * '''Causal masking''' β Masking future tokens in the decoder so each position can only attend to previous positions, enforcing the autoregressive generation property. * '''Context window''' β The maximum number of tokens a transformer can process at once. * '''Token embedding''' β A learned dense vector representation for each token in the vocabulary. * '''Temperature''' β A parameter controlling the randomness of token sampling at inference time. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information