Self-Supervised & Contrastive Learning

Self-Supervised Learning: Mastering SimCLR, MoCo and BYOL for Unlabeled Data

Self-Supervised Learning: Unlocking the Potential of Unlabeled Data

Comparison of supervised and self-supervised learning approaches
Figure 9. Self-supervised learning creates its own supervisory signals from data

With the vast majority of the world's data being unlabeled, self-supervised learning has emerged as a powerful paradigm for learning meaningful representations without manual annotations. In this comprehensive guide, we'll explore contrastive learning methods like SimCLR, MoCo, and BYOL that are closing the gap with supervised learning on many tasks.

1. The Self-Supervised Learning Paradigm

Self-supervised learning creates supervisory signals from the data itself through:

Approach Method Example
Pretext Tasks Define artificial tasks Image rotation prediction
Contrastive Learning Compare similar/dissimilar pairs SimCLR, MoCo
Generative Methods Reconstruct input Autoencoders, BERT
Key Insight: The learned representations can then be transferred to downstream tasks with limited labeled data, often matching or surpassing supervised pre-training.

2. Contrastive Learning Framework

Contrastive methods learn by pulling positive pairs together and pushing negatives apart in embedding space:

Contrastive learning process showing positive pairs pulled together and negatives pushed apart
Figure 9.1 Contrastive learning creates a structured embedding space

Key Components

  • Augmentations: Create positive pairs (two views of same image)
  • Encoder: Maps inputs to embeddings (typically CNN)
  • Projection Head: Small MLP for contrastive loss
  • Loss Function: NT-Xent (normalized temperature-scaled cross entropy)

The NT-Xent loss for a positive pair (i,j):

i,j = -log exp(sim(zi,zj)/τ / ∑k≠i exp(sim(zi,zk)/τ

3. SimCLR: A Simple Framework

SimCLR established several best practices:

SimCLR architecture showing dual augmentation paths and contrastive loss
Figure 9.2 The SimCLR framework for contrastive learning
Component Implementation Impact
Augmentations Random crop + color jitter + blur Creates meaningful positives
Projection Head 2-layer MLP with ReLU Improves representation quality
Large Batch Size 4096+ with LARS optimizer Provides many negatives
# SimCLR implementation in PyTorch
class SimCLR(nn.Module):
  def __init__(self, encoder, projection_dim=128):
    super().__init__()
    self.encoder = encoder
    self.projector = nn.Sequential(
      nn.Linear(encoder.output_dim, encoder.output_dim),
      nn.ReLU(),
      nn.Linear(encoder.output_dim, projection_dim)
    )

  def forward(self, x1, x2):
    # Get representations
    h1 = self.encoder(x1)
    h2 = self.encoder(x2)
    # Project to latent space
    z1 = self.projector(h1)
    z2 = self.projector(h2)
    return h1, h2, z1, z2

def nt_xent_loss(z1, z2, temperature=0.5):
  batch_size = z1.size(0)
  # Concatenate all embeddings
  all_z = torch.cat([z1, z2], dim=0)
  # Compute similarity matrix
  sim_matrix = torch.matmul(all_z, all_z.T) / temperature
  # Create labels (positives are diagonal after concat)
  labels = torch.arange(batch_size, device=z1.device)
  labels = torch.cat([labels + batch_size, labels])
  # Cross-entropy loss
  loss = F.cross_entropy(sim_matrix, labels)
  return loss

4. Momentum Contrast (MoCo)

MoCo addresses the batch size limitation with:

MoCo architecture showing momentum encoder and queue of negatives
Figure 9.3 MoCo maintains a queue of negative samples for contrastive learning

Key Innovations

  • Momentum Encoder: Slowly updated version of main encoder
  • Dynamic Queue: Maintains large set of negatives
  • Shuffling BN: Prevents information leakage
# MoCo v2 implementation
class MoCo(nn.Module):
  def __init__(self, encoder, dim=128, K=65536, m=0.999, T=0.2):
    super().__init__()
    self.K = K # Queue size
    self.m = m # Momentum
    self.T = T # Temperature
    # Encoders
    self.encoder_q = encoder # Query encoder
    self.encoder_k = copy.deepcopy(encoder) # Key encoder
    # Projection heads
    self.projector_q = nn.Sequential(
      nn.Linear(encoder.output_dim, encoder.output_dim),
      nn.ReLU(),
      nn.Linear(encoder.output_dim, dim)
    )
    self.projector_k = copy.deepcopy(self.projector_q)
    # Initialize key encoder
    for param_q, param_k in zip(self.encoder_q.parameters(), self.encoder_k.parameters()):
      param_k.data.copy_(param_q.data)
      param_k.requires_grad = False
    # Create the queue
    self.register_buffer("queue", torch.randn(dim, K))
    self.queue = F.normalize(self.queue, dim=0)
    self.register_buffer("queue_ptr", torch.zeros(1, dtype=torch.long))

  @torch.no_grad()
  def _momentum_update_key_encoder(self):
    for param_q, param_k in zip(self.encoder_q.parameters(), self.encoder_k.parameters()):
      param_k.data = param_k.data * self.m + param_q.data * (1. - self.m)

  @torch.no_grad()
  def _dequeue_and_enqueue(self, keys):
    batch_size = keys.shape[0]
    ptr = int(self.queue_ptr)
    assert self.K % batch_size == 0
    # Replace keys at ptr
    self.queue[:, ptr:ptr + batch_size] = keys.T
    ptr = (ptr + batch_size) % self.K
    self.queue_ptr[0] = ptr

5. Bootstrap Your Own Latent (BYOL)

BYOL achieves state-of-the-art without negative samples:

BYOL architecture showing online and target networks with predictor
Figure 9.4 BYOL's asymmetric architecture enables learning without negative pairs

Key Features

  • Two networks: online (with predictor) and target (momentum)
  • Predicts target network's representation
  • No contrastive loss - uses MSE between projections
  • Surprisingly avoids collapsed solutions
# BYOL implementation
class BYOL(nn.Module):
  def __init__(self, encoder, projection_dim=256, hidden_dim=4096, m=0.996):
    super().__init__()
    self.m = m
    # Online network
    self.online_encoder = encoder
    self.online_projector = nn.Sequential(
      nn.Linear(encoder.output_dim, hidden_dim),
      nn.BatchNorm1d(hidden_dim),
      nn.ReLU(),
      nn.Linear(hidden_dim, projection_dim)
    )
    self.online_predictor = nn.Sequential(
      nn.Linear(projection_dim, hidden_dim),
      nn.BatchNorm1d(hidden_dim),
      nn.ReLU(),
      nn.Linear(hidden_dim, projection_dim)
    )
    # Target network
    self.target_encoder = copy.deepcopy(encoder)
    self.target_projector = copy.deepcopy(self.online_projector)
    # Initialize target as online
    for param in self.target_encoder.parameters():
      param.requires_grad = False
    for param in self.target_projector.parameters():
      param.requires_grad = False

  @torch.no_grad()
  def update_target(self):
    # Momentum update target networks
    for online_param, target_param in zip(
      self.online_encoder.parameters(), self.target_encoder.parameters()
    ):
      target_param.data = self.m * target_param.data + (1 - self.m) * online_param.data
    for online_param, target_param in zip(
      self.online_projector.parameters(), self.target_projector.parameters()
    ):
      target_param.data = self.m * target_param.data + (1 - self.m) * online_param.data

  def forward(self, x1, x2):
    # Online network forward (both augmented views)
    h1 = self.online_encoder(x1)
    z1 = self.online_projector(h1)
    q1 = self.online_predictor(z1)
    h2 = self.online_encoder(x2)
    z2 = self.online_projector(h2)
    q2 = self.online_predictor(z2)
    # Target network forward (with stop gradient)
    with torch.no_grad():
      self.update_target()
      t1 = self.target_encoder(x2)
      t1 = self.target_projector(t1)
      t2 = self.target_encoder(x1)
      t2 = self.target_projector(t2)
    # Normalize
    q1 = F.normalize(q1, dim=1)
    q2 = F.normalize(q2, dim=1)
    t1 = F.normalize(t1, dim=1)
    t2 = F.normalize(t2, dim=1)
    # Symmetric loss
    loss = 2 - 2 * (q1 * t1).sum(dim=1).mean()
    loss += 2 - 2 * (q2 * t2).sum(dim=1).mean()
    return loss

6. Applications of Self-Supervised Learning

Computer Vision

  • Pre-training for object detection
  • Medical image analysis
  • Few-shot learning

Natural Language Processing

  • BERT-style pre-training
  • Cross-modal retrieval
  • Unsupervised translation

Other Domains

  • Audio representation learning
  • Graph neural networks
  • Reinforcement learning
Practical Tip: When working with limited labeled data, start with a model pre-trained using self-supervised learning on a related large dataset, then fine-tune on your specific task.

Conclusion

Self-supervised learning has emerged as a powerful paradigm for learning from unlabeled data, with contrastive methods like SimCLR, MoCo, and BYOL achieving remarkable results across domains. As these techniques continue to mature, they promise to reduce our reliance on costly labeled datasets while enabling more flexible and generalizable representations.

In our next post, we'll explore model deployment techniques to bring your trained models into production.

Collage of self-supervised learning applications across industries
Figure 9.5 Self-supervised learning enables breakthroughs across AI domains

✅ SHARE

LinkedIn WhatsApp
🔍 Curious about Deep Learning? Read our next post on Model Deployment (ONNX, TensorRT, FastAPI)

Follow DrASR Deep Learning for more in-depth tutorials, fundamentals, and research-backed content in Deep Learning.

If you found this helpful, leave a comment or share it with your peers. Let’s grow together in AI learning!

Comments

Popular posts from this blog

Generative Adversarial Networks

Deep Learning Model Deployment

Mathematics for Deep Learning