This repository contains reinforcement learning experiments using preference data on environments like Acrobot and CartPole. The code supports both RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) approaches.
Acrobot-v1CartPole-v0MountainCarContinuous-v0(retained for reference but not used)
.
├── with-Acrobot-v1
│ ├── policies/
│ ├── pref_data/
│ └── scripts/
│
├── with-CartPole-v0
│ ├── policies/
│ ├── pref_data/
│ └── scripts/
│
├── with-MountainCarContinuous-v0
│ ├── policies/
│ ├── pref_data/
│ └── scripts/
│
├── .gitignore
└── README.md
In each environment folder, you typically follow this order:
-
generate_pairs.ipynb
Generates the preference pairs of trajectories that will be used to train using RLHF and DPO.
This is where you define the size of the dataset with the K parameter. -
trainingRLHF.ipynb
Trains a reward model and use it to optimize our checkpoint policy using REINFORCE based on the generated trajectories preferences. -
training_DPO.ipynb
Optimize a our checkpoint policy using Direct Preference Optimization based on the generated trajectories preferences.