Attention Mechanisms & Transformers
The Transformer Revolution: Attention Is All You Need
The Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need," has become the foundation of modern natural language processing. In this comprehensive guide, we'll explore the self-attention mechanism, Transformer architecture, and how models like BERT and GPT are changing AI.
1. The Limitations of RNNs
While RNNs and LSTMs were state-of-the-art for sequence processing, they suffer from:
- Sequential computation: Can't parallelize training
- Memory bottlenecks: Struggle with long-range dependencies
- Positional encoding: Handle sequence order awkwardly
2. The Self-Attention Mechanism
Self-attention computes a weighted sum of all input elements, where weights are based on pairwise similarity:
Three learned matrices transform each word into components:
- Query (Q): What the word is looking for
- Key (K): What the word contains
- Value (V): The word's actual representation
Attention is calculated as:
Attention(Q, K, V) = softmax(QKT/√dk)V
Where dk is the dimension of the key vectors (for scaling).
3. Multi-Head Attention
Multi-head attention runs multiple attention mechanisms in parallel:
Benefits:
- Learns different types of relationships (syntax, semantics, etc.)
- More expressive than single attention
- Divides computation into parallelizable chunks
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.depth = d_model // num_heads
# Linear layers for Q, K, V
self.wq = nn.Linear(d_model, d_model)
self.wk = nn.Linear(d_model, d_model)
self.wv = nn.Linear(d_model, d_model)
self.dense = nn.Linear(d_model, d_model)
def split_heads(self, x, batch_size):
x = x.view(batch_size, -1, self.num_heads, self.depth)
return x.permute(0, 2, 1, 3)
def forward(self, v, k, q, mask=None):
batch_size = q.size(0)
q = self.wq(q)
k = self.wk(k)
v = self.wv(v)
# Split into multiple heads
q = self.split_heads(q, batch_size)
k = self.split_heads(k, batch_size)
v = self.split_heads(v, batch_size)
# Scaled dot-product attention
matmul_qk = torch.matmul(q, k.transpose(-2, -1))
dk = torch.tensor(k.size(-1), dtype=torch.float32)
scaled_attention_logits = matmul_qk / torch.sqrt(dk)
if mask is not None:
scaled_attention_logits += (mask * -1e9)
attention_weights = F.softmax(scaled_attention_logits, dim=-1)
output = torch.matmul(attention_weights, v)
# Concatenate heads
output = output.permute(0, 2, 1, 3).contiguous()
output = output.view(batch_size, -1, self.d_model)
return self.dense(output)
4. The Transformer Architecture
The complete Transformer consists of:
Encoder
- Stack of N identical layers
- Each layer has:
- Multi-head self-attention
- Position-wise feed-forward network
- Residual connections and layer normalization
Decoder
- Also a stack of N identical layers
- Additional masked multi-head attention to prevent cheating
- Final linear and softmax layer for output
5. Modern Transformer Models
BERT (Bidirectional Encoder Representations)
- Encoder-only architecture
- Pre-trained with masked language modeling
- Contextual representations for downstream tasks
GPT (Generative Pre-trained Transformer)
- Decoder-only architecture
- Pre-trained with next-word prediction
- Strong generative capabilities
Vision Transformers (ViT)
- Apply Transformers to image data
- Split image into patches as "tokens"
- Competitive with CNNs on many tasks
6. Implementing a Transformer in PyTorch
class Transformer(nn.Module):
def __init__(self, src_vocab_size, trg_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
super().__init__()
self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
self.decoder_embedding = nn.Embedding(trg_vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model, max_seq_length)
encoder_layer = nn.TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
self.encoder = nn.TransformerEncoder(encoder_layer, num_layers)
decoder_layer = nn.TransformerDecoderLayer(d_model, num_heads, d_ff, dropout)
self.decoder = nn.TransformerDecoder(decoder_layer, num_layers)
self.fc_out = nn.Linear(d_model, trg_vocab_size)
def forward(self, src, trg, src_mask, trg_mask):
src_embedded = self.positional_encoding(self.encoder_embedding(src))
trg_embedded = self.positional_encoding(self.decoder_embedding(trg))
encoder_output = self.encoder(src_embedded, src_mask)
output = self.decoder(trg_embedded, encoder_output, trg_mask)
return self.fc_out(output)
Conclusion
The Transformer architecture has fundamentally changed natural language processing and is making inroads into other domains. Its self-attention mechanism provides unparalleled ability to model long-range dependencies while remaining highly parallelizable. Models like BERT and GPT demonstrate the remarkable capabilities of this architecture when scaled to massive sizes and datasets.
In our next post, we'll explore generative models including Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).
✅ SHARE
🔍 Curious about Deep Learning? Read our next post on Generative Adversarial NetworksFollow DrASR Deep Learning for more in-depth tutorials, fundamentals, and research-backed content in Deep Learning.
If you found this helpful, leave a comment or share it with your peers. Let’s grow together in AI learning!
Comments
Post a Comment