Fork of https://github.com/cvoelcker/reppo with fixes / installer for HoReKa.

Go to file

ys1087@partner.kit.edu 30df18239a Upd Exp Status		2025-07-29 14:32:20 +02:00
config	Fix 6 critical bugs in REPPO repository preventing execution	2025-07-22 17:26:43 +02:00
reppo_alg	Add experiment infrastructure and production scripts	2025-07-22 18:47:43 +02:00
results	Paper code basis	2025-07-15 19:05:07 -04:00
slurm	Add experiment infrastructure and production scripts	2025-07-22 18:47:43 +02:00
.gitignore	Paper code basis	2025-07-15 19:05:07 -04:00
experiment_plan.md	Upd Exp Status	2025-07-29 14:32:20 +02:00
LICENCE	Adds LICENCE	2025-07-15 20:47:44 -04:00
pyproject.toml	Fixes build errors due to name conflicts	2025-07-21 18:31:20 -04:00
README.md	Add experiment infrastructure and production scripts	2025-07-22 18:47:43 +02:00
submit_experiments.py	Add experiment infrastructure and production scripts	2025-07-22 18:47:43 +02:00
submit_job.sh	Add HoReKa cluster support with SLURM scripts and wandb integration	2025-07-22 16:33:12 +02:00

README.md

Relative Entropy Pathwise Policy Optimization

On-policy value-based reinforcement learning without endless hyperparameter tuning

This repository contains the official implementation for REPPO - Relative Entropy Pathwise Policy Optimization arXiv paper link.

We provide reference implementations of the REPPO algorithm, as well as the raw results for our experiments.

Our repo provides you with the core algorithm and the following features:

Jax and Torch support: No matter what your favorite framework is, you can take use the algorithm out of the box
Modern installation: Our algorithm and environment dependencies can be installed with a single command
Fast and reliable learning: REPPO is wallclock time competitive with approaches such as FastTD3 and PPO, while learning reliably and with minimal hyperparameter tuning

HoReKa Cluster Setup

Added by Dominik

Installation on HoReKa

Clone the repository and navigate to it:
```
git clone <repository-url>
cd reppo
```

Create virtual environment with Python 3.12:

python3.12 -m venv .venv
source .venv/bin/activate

Install the package and dependencies:

pip install --upgrade pip
pip install -e .
# Install playground from git (required for MJX environments)
pip install git+https://github.com/younggyoseo/mujoco_playground

Running on HoReKa

The repository includes pre-configured SLURM scripts with wandb integration:

Quick Start

# Submit a ManiSkill job
./submit_job.sh maniskill PickCube-v1 mjx_dmc_medium_data

# Submit a Brax job  
./submit_job.sh brax ant mjx_dmc_small_data

Manual Job Submission

# Submit ManiSkill experiments
sbatch slurm/run_reppo_maniskill.sh

# Submit Brax experiments
sbatch slurm/run_reppo_brax.sh

# With custom environment
ENV_NAME=PlaceApple-v1 EXPERIMENT_TYPE=mjx_dmc_large_data sbatch slurm/run_reppo_maniskill.sh

Supported Environments

ManiSkill environments:

PickCube-v1, PlaceApple-v1, StackCube-v1, PegInsertionSide-v1

Brax environments:

ant, cheetah, hopper, walker2d, humanoid

Experiment types:

mjx_dmc_small_data (32k samples)
mjx_dmc_medium_data (512k samples)
mjx_dmc_large_data (1M samples)

Monitoring Jobs

# Check job status
squeue -u $USER

# View live logs
tail -f logs/reppo_maniskill_<job_id>.out
tail -f logs/reppo_brax_<job_id>.out

All experiments automatically log to wandb with your configured credentials. Results will appear in projects reppo_maniskill and reppo_brax.

Critical Issues in Official Repository

⚠️ The official REPPO repository is not runnable due to a series of fatal bugs. These issues were discovered and fixed during HoReKa cluster deployment:

Fixes Applied to Original Repository Issues

1. Missing MUON Optimizer

Issue: ImportError: cannot import name 'muon' on line 27 of reppo_alg/jaxrl/reppo.py
Root cause: Missing muon.py file in the repository
Fix applied: Replaced all muon.muon(lr) calls with optax.adam(lr) as suggested in code comments

2. Hydra Configuration Issues

Issue: Could not override 'env_name' and Could not override 'experiment_override'
Root cause: Incorrect Hydra parameter paths for environment and experiment configuration
Fix applied: Use env.name=<env> instead of env_name=<env> and direct hyperparameter overrides instead of experiment_override

3. BraxGymnaxWrapper Method Signatures

Issue: TypeError: BraxGymnaxWrapper.action_space() takes 1 positional argument but 2 were given
Root cause: Inconsistent method signatures between different environment wrappers
Fix applied: Added optional params=None parameter to action_space() and observation_space() methods in BraxGymnaxWrapper

4. Training Loop Division by Zero

Issue: ZeroDivisionError: integer division or modulo by zero in training loop calculation
Root cause: eval_interval calculated as 0 when total_time_steps is too small relative to batch size
Fix applied: Increased minimum total_time_steps to 1,000,000 to ensure proper evaluation intervals

5. Incorrect Algorithm Name in Wandb

Issue: Wandb runs show name "resampling-sac-ant" instead of "reppo-*"
Root cause: Config file incorrectly set name: "sac" instead of name: "reppo"
Fix applied: Changed name: "sac" to name: "reppo" in config/reppo.yaml

6. JAX Shape Broadcasting Error in BraxGymnaxWrapper

Issue: ValueError: Incompatible shapes for broadcasting: shapes=[(8, 15), (8,)] during vectorized environment operations
Root cause: BraxGymnaxWrapper wasn't properly vectorized for multi-environment operations
Fix applied: Added proper vectorization support to reset() and step() methods using jax.vmap for handling both single and batched operations

Summary: Fixed 6 critical bugs that prevented the original repository from running. The algorithm now successfully runs with 256 parallel environments and proper wandb integration, achieving strong learning performance (episode returns improving from ~-100 to ~400+ in ant environment).

Original README

Installation

We strongly recommend using the uv tool for python dependency management.

With uv installed, you can install the project and all dependencies in a local virtual environment under .venv with one single command:

uv sync

Our installation requires a GPU with CUDA 12 compatible drivers.

If you use other dependency management tools such as conda, create a new environment with Python 3.12 and install our package with

pip install -e .

Note

Several mujoco_playground environments, such as the Humanoid tasks, are currently unstable. If environments result in nans, we have simply rerun our experiments manually. As soon as these issues are solved upstream, we will update our dependencies.

Note

To provide a level comparison with prior work, we depend on the FastTD3 for of mujoco_playground. As soon as proper terminal state observation handling is merged into the main repository, we will update our dependencies.

Running Experiments

The main code for the algorithm is in src/jaxrl/reppo.py and src/torchrl/reppo.py respectively. In our tests, both versions produce similar returns up to seed variance. However, due to slight variations in the frameworks, we cannot always guarantee this.

For maximum speed, we highly recommend using our jax version. The torch version can result in slow experiment depending on the CPU/GPU configuration, as sampling from a squashed Gaussian is not implemented efficiently in the torch framework. This can result in cases where the GPU is stalled if the CPU cannot provide instructions and kernels fast enough.

Our configurations are handled with hydra.cc. This means parameters can be overwritten by using the syntax

python src/jaxrl/reppo.py PARAMETER=VALUE

By default, the environment type and name need to be provided. Currently the jax version supports env=mjx_dmc, env=mjx_humanoid, env=brax, and env=humanoid_brax. The latter is treated as a separate environment, as the reward scale is much larger than other brax environments, and the min and max Q values need to be tracked per environment. The torch version support env=mjx_dmc, and env=maniskill. We additionally provide wrappers for isaaclab, but this is still under development and might not work out of the box.

The paper experiments can be reproduced easily by using the experiment_override settings. By specifying experiment_override=mjx_smc_small_data for example, you can run the variant of REPPO with a batch size of 32k samples.

Contributing

We welcome contributions! Please feel free to submit issues and pull requests.

License

This project is licensed under the MIT License -- see the LICENSE file for details. The repository is built on prior code from the PureJaxRL and FastTD3 projects, and we thank the respective authors for making their work available in open-source. We include the appropriate licences in ours.

Citation

@article{voelcker2025reppo,
  title     = {Relative Entropy Pathwise Policy Optimization},
  author    = {Voelcker, Claas and Brunnbauer, Axel and Hussing, Marcel and Nauman, Michal and Abbeel, Pieter and Eaton, Eric and Grosu, Radu and Farahmand, Amir-massoud and Gilitschenski, Igor},
  booktitle = {preprint},
  year      = {2025},
}