Play with the components and see how transformers process language in real-time!
Interactive version of the transformer architecture | View Source
Before any processing can begin, we need to break down the input text into individual tokens. This could be words, subwords, or even characters depending on the tokenization strategy. Each token becomes a discrete unit that the transformer can process.
Each word token is converted into a dense numerical vector that captures its semantic meaning. Words with similar meanings will have similar vector representations, allowing the model to understand relationships and context between words mathematically.
The self-attention mechanism allows each word to "look at" every other word in the sequence and determine how much attention to pay to each one. The attention weights in the matrix below show these relationships - darker colors indicate stronger connections between words.
Instead of using just one attention mechanism, transformers run multiple attention heads in parallel. Each head learns to focus on different aspects of the relationships between words - like having multiple experts each specializing in different types of linguistic patterns.
The transformer architecture consists of an encoder-decoder structure. The encoder processes the input sequence and creates rich representations, while the decoder generates the output sequence one token at a time. Each component includes residual connections and layer normalization for stable training.