Training an RL agent to play Atari Pong using Deep Q-Networks (DQN), plus Double DQN, and optional difficulty scaling (difficulty 2 and 3). Built on Gymnasium + ALE and a classic CNN Atari-style architecture.
video.mp4
- Project layout
- Install
- Environment
- Methods
- Experiments and results
- Run training
- Evaluate a saved model
- Hyperparameter sweeps
- License
pong-dqn/
├─ README.md
├─ LICENSE
├─ requirements.txt
├─ .gitignore
│
├─ docs/
│ ├─ report.md
│ ├─ figures/
│ └─ videos/
│
├─ src/
│ ├─ env.py
│ ├─ preprocess.py
│ ├─ replay_buffer.py
│ ├─ models.py
│ ├─ agent.py
│ ├─ train.py
│ ├─ evaluate.py
│ ├─ config.py
│ └─ utils.py
│
├─ scripts/
│ ├─ train_dqn.py
│ ├─ train_double_dqn.py
│ ├─ train_difficulty.py
│ ├─ eval.py
│ └─ sweep.py
│
├─ experiments/
│ ├─ README.md
│ └─ .gitkeep
│
├─ models/
│ ├─ README.md
│ └─ .gitkeep
│
└─ tests/
├─ test_preprocess.py
├─ test_replay_buffer.py
└─ test_model_shapes.py
python -m venv .venv
source .venv/bin/activatepip install -r requirements.txtpytest -q- Gymnasium ALE environment:
ALE/Pong-v5 - Observations: raw RGB frames (210×160×3)
- Actions: discrete actions (usually 6 for Pong)
- Preprocessing:
- grayscale → 84×84
- normalize to
[0, 1] - stack 4 frames →
(4, 84, 84)for temporal context
We approximate Q(s, a) with a convolutional neural network:
- Conv(32, 8×8, stride 4) + ReLU
- Conv(64, 4×4, stride 2) + ReLU
- Conv(64, 3×3, stride 1) + ReLU
- FC(256) + ReLU
- Output layer: Q-values for each action
Key training components:
- Experience replay
- Target network (periodic hard updates)
- Epsilon-greedy exploration with exponential decay
- Huber loss (
SmoothL1Loss) for stability - Gradient accumulation (default: 4 steps) to emulate larger batches more stably
Same network, but target computation uses:
- action selection from the policy network
- action evaluation from the target network to reduce overestimation bias.
The progression focused on:
- Increasing gamma from 0.95 → 0.99 (more long-term value in Pong)
- Slower and broader exploration: epsilon from 1.0 → 0.1, decayed over 200k–300k steps
- Lower learning rate:
5e-4→1e-4for stability - Larger FC layer: 64/128 → 256 hidden units
- Larger replay: 10k → 50k transitions for diversity
| FC Hidden Units | num_episodes | Replay Buffer | Gamma | Learning Rate | Initial Epsilon | Final Epsilon | Epsilon Decay |
|---|---|---|---|---|---|---|---|
| 128 | 1000 | 10000 | 0.95 | 5e-4 | 0.9 | 0.05 | 100000 |
| FC Hidden Units | num_episodes | Replay Buffer | Gamma | Learning Rate | Initial Epsilon | Final Epsilon | Epsilon Decay |
|---|---|---|---|---|---|---|---|
| 64 | 4000+ | 10000 | 0.99 | 1e-4 | 1.0 | 0.1 | 200000 |
| FC Hidden Units | num_episodes | Replay Buffer | Gamma | Learning Rate | Initial Epsilon | Final Epsilon | Epsilon Decay |
|---|---|---|---|---|---|---|---|
| 256 | 3000+ | 10000 | 0.99 | 1e-4 | 1.0 | 0.1 | 200000 |
| FC Hidden Units | num_episodes | Replay Buffer | Gamma | Learning Rate | Initial Epsilon | Final Epsilon | Epsilon Decay |
|---|---|---|---|---|---|---|---|
| 256 | 2500+ | 50000 | 0.99 | 1e-4 | 1.0 | 0.1 | 200000 |
| FC Hidden Units | num_episodes | Replay Buffer | Gamma | Learning Rate | Initial Epsilon | Final Epsilon | Epsilon Decay |
|---|---|---|---|---|---|---|---|
| 256 | 3000+ | 50000 | 0.99 | 1e-4 | 1.0 | 0.1 | 300000 |
| Parameter | Exp1 | Exp2 | Exp3 | Exp4 | Exp5 |
|---|---|---|---|---|---|
| FC Hidden Units | 128 | 64 | 256 | 256 | 256 |
| num_episodes | 1000 | 4000+ | 3000+ | 2500+ | 3000+ |
| Replay Buffer | 10000 | 10000 | 10000 | 50000 | 50000 |
| Gamma | 0.95 | 0.99 | 0.99 | 0.99 | 0.99 |
| Learning Rate | 5e-4 | 1e-4 | 1e-4 | 1e-4 | 1e-4 |
| Initial Epsilon | 0.9 | 1.0 | 1.0 | 1.0 | 1.0 |
| Final Epsilon | 0.05 | 0.1 | 0.1 | 0.1 | 0.1 |
| Epsilon Decay | 100000 | 200000 | 200000 | 200000 | 300000 |
GitHub renders videos inconsistently; link directly:
docs/videos/video.mp4
video.mp4
| num_episodes | replay_capacity | batch_size | start_training_steps | gamma | learning_rate | initial_epsilon | final_epsilon | epsilon_decay | target_update_freq |
|---|---|---|---|---|---|---|---|---|---|
| 10000 | 50000 | 64 | 10000 | 0.99 | 1e-4 | 1.0 | 0.1 | 300000 | 1000 |
docs/videos/video_double_dqn_episode_1271.mp4
video_double_dqn_episode_1271.mp4
Hyperparameters
| num_episodes | replay_capacity | batch_size | start_training_steps | gamma | learning_rate | initial_epsilon | final_epsilon | epsilon_decay | target_update_freq | difficulty |
|---|---|---|---|---|---|---|---|---|---|---|
| 10000 | 50000 | 64 | 10000 | 0.99 | 1e-4 | 1.0 | 0.1 | 300000 | 1000 | 2 |
Video demo
docs/videos/diff2_best_video_episode_785.mp4
Hyperparameters
| num_episodes | replay_capacity | batch_size | start_training_steps | gamma | learning_rate | initial_epsilon | final_epsilon | epsilon_decay | target_update_freq | difficulty |
|---|---|---|---|---|---|---|---|---|---|---|
| 10000 | 50000 | 64 | 10000 | 0.99 | 1e-4 | 1.0 | 0.1 | 300000 | 1000 | 3 |
Video demo
docs/videos/diff3_best_video_episode_755.mp4
Run from repo root.
python scripts/train_dqn.py --exp-dir experiments/dqn_run --episodes 10000 --device cudaOptional: record videos (every N episodes):
python scripts/train_dqn.py --exp-dir experiments/dqn_run --episodes 10000 --record-video --record-every 100python scripts/train_double_dqn.py --exp-dir experiments/ddqn_run --episodes 10000 --device cudapython scripts/train_difficulty.py --exp-dir experiments/diff2_run --difficulty 2 --algo dqn --episodes 10000
python scripts/train_difficulty.py --exp-dir experiments/diff3_run --difficulty 3 --algo dqn --episodes 10000python scripts/eval.py --checkpoint experiments/dqn_run/best.pth --episodes 20Record evaluation video:
python scripts/eval.py --checkpoint experiments/dqn_run/best.pth --episodes 5 --record-videoA small (illustrative) sweep is included:
python scripts/sweep.py --base-exp-dir experiments/sweep --episodes 2000 --device cudaThis will generate experiment folders like:
experiments/sweep/dqn_g0.99_lr0.0001_decay300000/experiments/sweep/double_dqn_g0.99_lr0.0001_decay200000/…and so on.
MIT (See LICENSE).