Diffusion Models Explained: From DDPM to Stable Diffusion - The Complete Guide

Diffusion Models: The New Frontier of Generative AI

Timeline showing evolution from GANs to modern diffusion models with quality comparison

Figure 7. Diffusion models have surpassed GANs in many generative tasks

Diffusion models have emerged as the new state-of-the-art in generative AI, powering systems like DALL-E 2 and Stable Diffusion. In this comprehensive guide, we'll explore how these models work, from the foundational Denoising Diffusion Probabilistic Models (DDPM) to the revolutionary latent diffusion architectures behind today's most impressive AI art generators.

1. The Diffusion Process

Diffusion models are inspired by thermodynamics, gradually adding noise to data then learning to reverse this process:

Forward and reverse diffusion processes showing gradual noising and denoising

Figure 7.1 The two-phase diffusion process: corruption and learning to reverse it

Forward Diffusion (Fixed Process)

Gradually adds Gaussian noise to data over T steps according to a variance schedule β_t:

q(x_t|x_t-1) = N(x_t; √(1-β_t)x_t-1, β_tI)

Reverse Diffusion (Learned Process)

A neural network learns to gradually denoise:

p_θ(x_t-1|x_t) = N(x_t-1; μ_θ(x_t,t), Σ_θ(x_t,t))

Key Insight: The model learns to predict the noise component at each step, allowing iterative refinement from random noise to coherent samples.

2. Denoising Diffusion Probabilistic Models (DDPM)

The foundational 2020 paper introduced key innovations:

Component	Description	Impact
Noise Schedule	Carefully designed β_t values	Balances speed and quality
Reparameterization	Predict noise instead of mean	Simplifies learning
Loss Function	Simple MSE on noise	Stable training

        # DDPM implementation in PyTorch

        class DDPM(nn.Module):

              def __init__(self, network, timesteps, beta_start, beta_end):

                    super().__init__()

                    self.network = network

                    self.timesteps = timesteps

                    # Define beta schedule

                    self.betas = torch.linspace(beta_start, beta_end, timesteps)

                    # Pre-calculate diffusion parameters

                    self.alphas = 1. - self.betas

                    self.alpha_bars = torch.cumprod(self.alphas, dim=0)

              def forward(self, x, t, noise):

                    alpha_bar = self.alpha_bars[t].view(-1, 1, 1, 1)

                    noisy_x = torch.sqrt(alpha_bar) * x + torch.sqrt(1 - alpha_bar) * noise

                    predicted_noise = self.network(noisy_x, t)

                    loss = F.mse_loss(predicted_noise, noise)

                    return loss

              @torch.no_grad()

              def sample(self, shape, device):

                    # Start from pure noise

                    x = torch.randn(shape, device=device)

                    for t in reversed(range(self.timesteps)):

                          t_tensor = torch.full((shape[0],), t, device=device, dtype=torch.long)

                          pred_noise = self.network(x, t_tensor)

                          alpha = self.alphas[t]

                          alpha_bar = self.alpha_bars[t]

                          x = (x - (1-alpha)/torch.sqrt(1-alpha_bar)*pred_noise)/torch.sqrt(alpha)

                          if t > 0:

                                x += torch.sqrt(self.betas[t]) * torch.randn_like(x)

                    return x

3. Latent Diffusion Models (Stable Diffusion)

Stable Diffusion improved efficiency by operating in latent space:

Stable Diffusion architecture showing VAE, UNet and text encoder components

Figure 7.2 Stable Diffusion's three key components working together

Key Components

Component	Purpose	Details
VAE Encoder/Decoder	Image ⇄ Latent Space	Compresses images to smaller latent representations
UNet	Denoising Model	Predicts noise in latent space with cross-attention
Text Encoder (CLIP)	Conditioning	Maps text prompts to embedding space

Why Latent Space? By operating in a compressed latent space, Stable Diffusion reduces compute requirements by 4-8x compared to pixel-space diffusion while maintaining quality.

4. Implementing Stable Diffusion Components

VAE (Variational Autoencoder)

        # Simplified VAE implementation

        class VAE(nn.Module):

              def __init__(self):

                    super().__init__()

                    # Encoder

                    self.encoder = nn.Sequential(

                          nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1),

                          nn.ReLU(),

                          nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),

                          nn.ReLU()

                    )

                    # Latent space

                    self.fc_mu = nn.Linear(128*8*8, 256)

                    self.fc_var = nn.Linear(128*8*8, 256)

                    # Decoder

                    self.decoder = nn.Sequential(

                          nn.Linear(256, 128*8*8),

                          nn.Unflatten(1, (128, 8, 8)),

                          nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1),

                          nn.ReLU(),

                          nn.ConvTranspose2d(64, 3, kernel_size=4, stride=2, padding=1),

                          nn.Sigmoid()

                    )

              def encode(self, x):

                    h = self.encoder(x)

                    h = h.view(h.size(0), -1)

                    return self.fc_mu(h), self.fc_var(h)

              def decode(self, z):

                    return self.decoder(z)

Conditional UNet

        # Simplified UNet with cross-attention

        class UNetBlock(nn.Module):

              def __init__(self, in_c, out_c, time_emb_dim, text_emb_dim):

                    super().__init__()

                    self.time_mlp = nn.Linear(time_emb_dim, out_c)

                    self.conv1 = nn.Conv2d(in_c, out_c, kernel_size=3, padding=1)

                    self.conv2 = nn.Conv2d(out_c, out_c, kernel_size=3, padding=1)

                    # Cross-attention for text conditioning

                    self.attn = nn.MultiheadAttention(embed_dim=out_c, num_heads=4)

                    self.norm = nn.LayerNorm(out_c)

                    self.proj = nn.Linear(text_emb_dim, out_c)

              def forward(self, x, t, text_emb):

                    h = self.conv1(x)

                    # Add time embedding

                    t = F.silu(self.time_mlp(t))

                    h = h + t.view(-1, h.shape[1], 1, 1)

                    # Cross-attention with text

                    batch, channels, height, width = h.shape

                    h_flat = h.view(batch, channels, -1).permute(0, 2, 1)

                    text_proj = self.proj(text_emb).unsqueeze(1)

                    attn_out, _ = self.attn(h_flat, text_proj, text_proj)

                    h = h + attn_out.permute(0, 2, 1).view(batch, channels, height, width)

                    h = self.conv2(h)

                    return h

5. Practical Applications

Creative Tools

AI art generation
Photo realistic synthesis
Style transfer

Design & Media

Concept art creation
Texture generation
Advertising content

Scientific Applications

Molecular design
Medical imaging
Data augmentation

Ethical Note: Like all powerful generative models, diffusion models raise concerns about copyright, misinformation, and artistic integrity that practitioners must consider.

Conclusion

Diffusion models represent a fundamental shift in generative AI, offering unprecedented quality and control compared to previous approaches. From the theoretical foundations of DDPM to the practical breakthroughs of Stable Diffusion, these models have opened new creative possibilities while presenting fascinating technical challenges.

In our next post, we'll explore how deep learning is revolutionizing reinforcement learning through techniques like Deep Q-Networks and Policy Gradient methods.

Futuristic applications of diffusion models across multiple industries

Figure 7.3 Diffusion models are transforming creative and technical fields

✅ SHARE

🔍 Curious about Deep Learning? Read our next post on Deep Reinforcement Learning (DQN, Policy Gradients)

Follow DrASR Deep Learning for more in-depth tutorials, fundamentals, and research-backed content in Deep Learning.

If you found this helpful, leave a comment or share it with your peers. Let’s grow together in AI learning!

Search This Blog

Translate

Deep Learning

Menu