Attention Mechanisms & Transformers

Transformers Revolution: Self-Attention, BERT, GPT and the Future of NLP

The Transformer Revolution: Attention Is All You Need

Timeline of sequence models from RNNs to Transformers showing performance improvements
Figure 5. Transformers marked a paradigm shift in sequence modeling

The Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need," has become the foundation of modern natural language processing. In this comprehensive guide, we'll explore the self-attention mechanism, Transformer architecture, and how models like BERT and GPT are changing AI.

1. The Limitations of RNNs

While RNNs and LSTMs were state-of-the-art for sequence processing, they suffer from:

  • Sequential computation: Can't parallelize training
  • Memory bottlenecks: Struggle with long-range dependencies
  • Positional encoding: Handle sequence order awkwardly
Key Insight: Transformers replace recurrence with self-attention, allowing direct modeling of relationships between all words in a sequence regardless of distance.

2. The Self-Attention Mechanism

Self-attention computes a weighted sum of all input elements, where weights are based on pairwise similarity:

Visualization of self-attention mechanism showing word relationships
Figure 5.1 Self-attention captures relationships between all words in a sequence

Three learned matrices transform each word into components:

  1. Query (Q): What the word is looking for
  2. Key (K): What the word contains
  3. Value (V): The word's actual representation

Attention is calculated as:

Attention(Q, K, V) = softmax(QKT/√dk)V

Where dk is the dimension of the key vectors (for scaling).

3. Multi-Head Attention

Multi-head attention runs multiple attention mechanisms in parallel:

Diagram of multi-head attention showing different attention patterns
Figure 5.2 Multi-head attention learns diverse relationship patterns

Benefits:

  • Learns different types of relationships (syntax, semantics, etc.)
  • More expressive than single attention
  • Divides computation into parallelizable chunks
# Multi-head attention implementation in PyTorch
class MultiHeadAttention(nn.Module):
  def __init__(self, d_model, num_heads):
    super().__init__()
    self.d_model = d_model
    self.num_heads = num_heads
    self.depth = d_model // num_heads
    # Linear layers for Q, K, V
    self.wq = nn.Linear(d_model, d_model)
    self.wk = nn.Linear(d_model, d_model)
    self.wv = nn.Linear(d_model, d_model)
    self.dense = nn.Linear(d_model, d_model)

  def split_heads(self, x, batch_size):
    x = x.view(batch_size, -1, self.num_heads, self.depth)
    return x.permute(0, 2, 1, 3)

  def forward(self, v, k, q, mask=None):
    batch_size = q.size(0)
    q = self.wq(q)
    k = self.wk(k)
    v = self.wv(v)
    # Split into multiple heads
    q = self.split_heads(q, batch_size)
    k = self.split_heads(k, batch_size)
    v = self.split_heads(v, batch_size)
    # Scaled dot-product attention
    matmul_qk = torch.matmul(q, k.transpose(-2, -1))
    dk = torch.tensor(k.size(-1), dtype=torch.float32)
    scaled_attention_logits = matmul_qk / torch.sqrt(dk)
    if mask is not None:
      scaled_attention_logits += (mask * -1e9)
    attention_weights = F.softmax(scaled_attention_logits, dim=-1)
    output = torch.matmul(attention_weights, v)
    # Concatenate heads
    output = output.permute(0, 2, 1, 3).contiguous()
    output = output.view(batch_size, -1, self.d_model)
    return self.dense(output)

4. The Transformer Architecture

The complete Transformer consists of:

Complete Transformer architecture showing encoder and decoder stacks
Figure 5.3 The original Transformer encoder-decoder architecture

Encoder

  • Stack of N identical layers
  • Each layer has:
    • Multi-head self-attention
    • Position-wise feed-forward network
    • Residual connections and layer normalization

Decoder

  • Also a stack of N identical layers
  • Additional masked multi-head attention to prevent cheating
  • Final linear and softmax layer for output

5. Modern Transformer Models

BERT (Bidirectional Encoder Representations)

  • Encoder-only architecture
  • Pre-trained with masked language modeling
  • Contextual representations for downstream tasks

GPT (Generative Pre-trained Transformer)

  • Decoder-only architecture
  • Pre-trained with next-word prediction
  • Strong generative capabilities

Vision Transformers (ViT)

  • Apply Transformers to image data
  • Split image into patches as "tokens"
  • Competitive with CNNs on many tasks
Comparison of BERT, GPT and Vision Transformer architectures
Figure 5.4 Different Transformer architectures for different purposes

6. Implementing a Transformer in PyTorch

# Simplified Transformer implementation
class Transformer(nn.Module):
  def __init__(self, src_vocab_size, trg_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
    super().__init__()
    self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
    self.decoder_embedding = nn.Embedding(trg_vocab_size, d_model)
    self.positional_encoding = PositionalEncoding(d_model, max_seq_length)
    encoder_layer = nn.TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
    self.encoder = nn.TransformerEncoder(encoder_layer, num_layers)
    decoder_layer = nn.TransformerDecoderLayer(d_model, num_heads, d_ff, dropout)
    self.decoder = nn.TransformerDecoder(decoder_layer, num_layers)
    self.fc_out = nn.Linear(d_model, trg_vocab_size)

  def forward(self, src, trg, src_mask, trg_mask):
    src_embedded = self.positional_encoding(self.encoder_embedding(src))
    trg_embedded = self.positional_encoding(self.decoder_embedding(trg))
    encoder_output = self.encoder(src_embedded, src_mask)
    output = self.decoder(trg_embedded, encoder_output, trg_mask)
    return self.fc_out(output)

Conclusion

The Transformer architecture has fundamentally changed natural language processing and is making inroads into other domains. Its self-attention mechanism provides unparalleled ability to model long-range dependencies while remaining highly parallelizable. Models like BERT and GPT demonstrate the remarkable capabilities of this architecture when scaled to massive sizes and datasets.

In our next post, we'll explore generative models including Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).

Futuristic applications of Transformer models in various domains
Figure 5.5 Transformers are powering a new generation of AI applications

✅ SHARE

LinkedIn WhatsApp
🔍 Curious about Deep Learning? Read our next post on Generative Adversarial Networks

Follow DrASR Deep Learning for more in-depth tutorials, fundamentals, and research-backed content in Deep Learning.

If you found this helpful, leave a comment or share it with your peers. Let’s grow together in AI learning!

Comments

Popular posts from this blog

Generative Adversarial Networks

Deep Learning Model Deployment

Mathematics for Deep Learning