Deep Reinforcement Learning: Mastering DQN, Policy Gradients and PPO

Deep Reinforcement Learning: When AI Learns by Doing

Reinforcement learning loop showing agent-environment interaction with neural networks

Figure 8. The reinforcement learning feedback loop enhanced with deep learning

Deep Reinforcement Learning (DRL) combines the representational power of neural networks with the goal-directed learning of reinforcement learning, enabling machines to master complex tasks from gameplay to robotics. In this comprehensive guide, we'll explore Q-learning, Deep Q-Networks (DQN), Policy Gradient methods, and how these techniques are pushing the boundaries of what AI can learn.

1. Reinforcement Learning Fundamentals

RL problems are formalized as Markov Decision Processes (MDPs) with:

Component	Notation	Description
States	s ∈ S	Representation of environment
Actions	a ∈ A	Choices available to agent
Rewards	r ∈ R	Feedback signal
Policy	π(a\|s)	Agent's behavior strategy

The goal is to learn a policy π that maximizes expected return (sum of discounted rewards):

G_t = R_t+1 + γR_t+2 + γ²R_t+3 + ...

Key Challenge: The credit assignment problem - determining which actions led to eventual rewards in long sequences.

2. Value-Based Methods: Deep Q-Networks (DQN)

Q-Learning Basics

Q-learning estimates the value of taking action a in state s:

Q(s,a) = E[R_t+1 + γ max Q(s',a') | S_t=s, A_t=a]

The Q-function is updated via temporal difference learning:

Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

Deep Q-Networks (DQN)

DQN approximates Q(s,a) with a neural network, introducing key innovations:

DQN architecture showing neural network approximating Q-values

Figure 8.1 DQN replaces Q-tables with neural network function approximation

Innovation	Purpose	Impact
Experience Replay	Break correlations	Stabilizes training
Target Network	Fixed Q-targets	Reduces oscillation
Frame Stacking	Capture temporal info	Handles partial observability

        # DQN implementation in PyTorch

        class DQN(nn.Module):

              def __init__(self, state_dim, action_dim):

                    super().__init__()

                    self.net = nn.Sequential(

                          nn.Linear(state_dim, 64),

                          nn.ReLU(),

                          nn.Linear(64, 64),

                          nn.ReLU(),

                          nn.Linear(64, action_dim)

                    )

              def forward(self, x):

                    return self.net(x)

        class ReplayBuffer:

              def __init__(self, capacity):

                    self.buffer = deque(maxlen=capacity)

              def push(self, state, action, reward, next_state, done):

                    self.buffer.append((state, action, reward, next_state, done))

              def sample(self, batch_size):

                    transitions = random.sample(self.buffer, batch_size)

                    return zip(*transitions)

3. Policy Gradient Methods

Instead of learning values, directly optimize the policy π_θ(a|s) with parameters θ:

∇_θJ(θ) = E[∇_θlog π_θ(a|s) Q^π(s,a)]

Comparison of value-based and policy-based reinforcement learning approaches

Figure 8.2 Policy gradients optimize policies directly rather than via value functions

Advantages

Better for continuous action spaces
Can learn stochastic policies
More direct optimization

REINFORCE Algorithm

The simplest policy gradient method:

        # REINFORCE implementation

        def reinforce(policy, optimizer, env, episodes):

              for _ in range(episodes):

                    states, actions, rewards = [], [], []

                    state = env.reset()

                    done = False

                    # Collect trajectory

                    while not done:

                          action_probs = policy(torch.FloatTensor(state))

                          action = torch.multinomial(action_probs, 1).item()

                          next_state, reward, done, _ = env.step(action)

                          states.append(state)

                          actions.append(action)

                          rewards.append(reward)

                          state = next_state

                    # Calculate returns

                    returns = []

                    G = 0

                    for r in reversed(rewards):

                          G = r + 0.99 * G  # Discount factor 0.99

                          returns.insert(0, G)

                    # Update policy

                    optimizer.zero_grad()

                    for s, a, G in zip(states, actions, returns):

                          action_probs = policy(torch.FloatTensor(s))

                          log_prob = torch.log(action_probs[a])

                          loss = -log_prob * G

                          loss.backward()

                    optimizer.step()

4. Actor-Critic Methods

Combine value-based and policy-based approaches:

Actor-critic architecture showing policy and value networks

Figure 8.3 Actor-critic methods combine policy and value learning

Actor: Policy π_θ(a|s) - what to do
Critic: Value function V_φ(s) - how good it is

Advantage Actor-Critic (A2C)

Uses advantage estimate A(s,a) = Q(s,a) - V(s) to reduce variance:

∇_θJ(θ) = E[∇_θlog π_θ(a|s) A(s,a)]

5. Proximal Policy Optimization (PPO)

PPO is a state-of-the-art policy gradient method with:

Feature	Description	Benefit
Clipped Objective	Limits policy updates	Prevents destructive updates
Multiple Epochs	Reuses samples	Improved sample efficiency

        # PPO clipped objective implementation

        def ppo_loss(old_probs, states, actions, advantages, clip_ratio=0.2):

              # Get current policy probabilities

              new_probs = policy(states).gather(1, actions.unsqueeze(1))

              # Probability ratio

              ratio = new_probs / old_probs

              # Clipped objective

              clipped_ratio = torch.clamp(ratio, 1-clip_ratio, 1+clip_ratio)

              # Surrogate loss

              policy_loss = -torch.min(ratio * advantages, clipped_ratio * advantages).mean()

              return policy_loss

6. Applications of Deep RL

Game Playing

AlphaGo/AlphaZero
OpenAI Five (Dota 2)
StarCraft II agents

Robotics

Manipulation tasks
Locomotion
Autonomous vehicles

Industrial Control

Energy optimization
Supply chain management
Process control

Practical Note: Deep RL requires careful hyperparameter tuning and significant compute resources. Starting with simpler environments (like OpenAI Gym) is recommended before tackling complex problems.

Conclusion

Deep Reinforcement Learning represents one of the most promising avenues toward general AI, combining the representational power of deep learning with the goal-directed learning of reinforcement learning. From DQN's breakthrough in playing Atari games to PPO's success in complex robotic control, these techniques continue to push the boundaries of what machines can learn to do.

In our next post, we'll explore self-supervised learning techniques that allow models to learn from unlabeled data at scale.

Collage of deep RL applications across domains

Figure 8.4 Deep RL is transforming industries from gaming to manufacturing

✅ SHARE

🔍 Curious about Deep Learning? Read our next post on Self-Supervised & Contrastive Learning

Follow DrASR Deep Learning for more in-depth tutorials, fundamentals, and research-backed content in Deep Learning.

If you found this helpful, leave a comment or share it with your peers. Let’s grow together in AI learning!

Search This Blog

Translate

Deep Learning

Menu