# Relative Entropy Pathwise Policy Optimization ## On-policy value-based reinforcement learning without endless hyperparameter tuning This repository contains the official implementation for REPPO - Relative Entropy Pathwise Policy Optimization [arXiv paper link](https://arxiv.org/abs/2507.11019). We provide reference implementations of the REPPO algorithm, as well as the raw results for our experiments. Our repo provides you with the core algorithm and the following features: - Jax and Torch support: No matter what your favorite framework is, you can take use the algorithm out of the box - Modern installation: Our algorithm and environment dependencies can be installed with a single command - Fast and reliable learning: REPPO is wallclock time competitive with approaches such as FastTD3 and PPO, while learning reliably and with minimal hyperparameter tuning ## HoReKa Cluster Setup *Added by Dominik* ### Installation on HoReKa Original Repo recommends ´uv´, but I prefer vanilla python and that seems to work... 1. **Clone the repository and navigate to it:** ```bash git clone cd reppo ``` 2. **Create virtual environment with Python 3.12:** ```bash python3.12 -m venv .venv source .venv/bin/activate ``` 3. **Install the package and dependencies:** ```bash pip install --upgrade pip pip install -e . # Install playground from git (required for MJX environments) pip install git+https://github.com/younggyoseo/mujoco_playground ``` ### Running on HoReKa The repository includes pre-configured SLURM scripts with wandb integration: #### Quick Start ```bash ./submit_job.sh brax ant mjx_dmc_small_data ``` #### Manual Job Submission ```bash # Submit Brax experiments sbatch slurm/run_reppo_brax.sh # Submit DMC experiments python submit_dmc_experiments.py --seeds 3 # With custom environment ENV_NAME=PlaceApple-v1 EXPERIMENT_TYPE=mjx_dmc_large_data sbatch slurm/run_reppo_maniskill.sh ``` #### Supported Environments **ManiSkill environments:** - `PickCube-v1`, `PlaceApple-v1`, `StackCube-v1`, `PegInsertionSide-v1`, ... **Brax environments:** - `ant`, `cheetah`, `hopper`, `walker2d`, `humanoid` **Experiment types:** - `mjx_dmc_small_data` (32k samples) - `mjx_dmc_medium_data` (512k samples) - `mjx_dmc_large_data` (1M samples) #### Monitoring Jobs ```bash # Check job status squeue -u $USER # View live logs tail -f logs/reppo_maniskill_.out tail -f logs/reppo_brax_.out ``` #### Critical Issues in Official Repository ⚠️ **The official REPPO repository is not runnable due to a series of fatal bugs.** These issues were discovered and fixed during HoReKa cluster deployment: #### Fixes Applied to Original Repository Issues **1. Missing MUON Optimizer** - **Issue**: `ImportError: cannot import name 'muon'` on line 27 of `reppo_alg/jaxrl/reppo.py` - **Root cause**: Missing `muon.py` file in the repository - **Fix applied**: Replaced all `muon.muon(lr)` calls with `optax.adam(lr)` as suggested in code comments **2. Hydra Configuration Issues** - **Issue**: `Could not override 'env_name'` and `Could not override 'experiment_override'` - **Root cause**: Incorrect Hydra parameter paths for environment and experiment configuration - **Fix applied**: Use `env.name=` instead of `env_name=` and direct hyperparameter overrides instead of experiment_override **3. BraxGymnaxWrapper Method Signatures** - **Issue**: `TypeError: BraxGymnaxWrapper.action_space() takes 1 positional argument but 2 were given` - **Root cause**: Inconsistent method signatures between different environment wrappers - **Fix applied**: Added optional `params=None` parameter to `action_space()` and `observation_space()` methods in BraxGymnaxWrapper **4. Training Loop Division by Zero** - **Issue**: `ZeroDivisionError: integer division or modulo by zero` in training loop calculation - **Root cause**: `eval_interval` calculated as 0 when `total_time_steps` is too small relative to batch size - **Fix applied**: Increased minimum `total_time_steps` to 1,000,000 to ensure proper evaluation intervals **5. Incorrect Algorithm Name in Wandb** - **Issue**: Wandb runs show name "resampling-sac-ant" instead of "reppo-*" - **Root cause**: Config file incorrectly set `name: "sac"` instead of `name: "reppo"` - **Fix applied**: Changed `name: "sac"` to `name: "reppo"` in `config/reppo.yaml` **6. JAX Shape Broadcasting Error in BraxGymnaxWrapper** - **Issue**: `ValueError: Incompatible shapes for broadcasting: shapes=[(8, 15), (8,)]` during vectorized environment operations - **Root cause**: BraxGymnaxWrapper wasn't properly vectorized for multi-environment operations - **Fix applied**: Added proper vectorization support to `reset()` and `step()` methods using `jax.vmap` for handling both single and batched operations --- ## Original README ## Installation We strongly recommend using the [uv tool](https://docs.astral.sh/uv/getting-started/installation/) for python dependency management. With uv installed, you can install the project and all dependencies in a local virtual environment under `.venv` with one single command: ```bash uv sync ``` Our installation requires a GPU with CUDA 12 compatible drivers. If you use other dependency management tools such as conda, create a new environment with `Python 3.12` and install our package with ```bash pip install -e . ``` > [!Note] > Several mujoco_playground environments, such as the Humanoid tasks, are currently unstable. If environments result in nans, we have simply rerun our experiments manually. As soon as these issues are solved upstream, we will update our dependencies. > [!NOTE] > To provide a level comparison with prior work, we depend on the FastTD3 for of mujoco_playground. As soon as proper terminal state observation handling is merged into the main repository, we will update our dependencies. ## Running Experiments The main code for the algorithm is in `src/jaxrl/reppo.py` and `src/torchrl/reppo.py` respectively. In our tests, both versions produce similar returns up to seed variance. However, due to slight variations in the frameworks, we cannot always guarantee this. For maximum speed, we highly recommend using our jax version. The torch version can result in slow experiment depending on the CPU/GPU configuration, as sampling from a squashed Gaussian is not implemented efficiently in the torch framework. This can result in cases where the GPU is stalled if the CPU cannot provide instructions and kernels fast enough. Our configurations are handled with [hydra.cc](https://hydra.cc/). This means parameters can be overwritten by using the syntax ```bash python src/jaxrl/reppo.py PARAMETER=VALUE ``` By default, the environment type and name need to be provided. Currently the jax version supports `env=mjx_dmc`, `env=mjx_humanoid`, `env=brax`, and `env=humanoid_brax`. The latter is treated as a separate environment, as the reward scale is much larger than other brax environments, and the min and max Q values need to be tracked per environment. The torch version support `env=mjx_dmc`, and `env=maniskill`. We additionally provide wrappers for isaaclab, but this is still under development and might not work out of the box. The paper experiments can be reproduced easily by using the `experiment_override` settings. By specifying `experiment_override=mjx_smc_small_data` for example, you can run the variant of REPPO with a batch size of 32k samples. ## Contributing We welcome contributions! Please feel free to submit issues and pull requests. ## License This project is licensed under the MIT License -- see the [LICENSE](LICENSE) file for details. The repository is built on prior code from the [PureJaxRL](https://github.com/luchris429/purejaxrl) and [FastTD3](https://github.com/younggyoseo/FastTD3) projects, and we thank the respective authors for making their work available in open-source. We include the appropriate licences in ours. ## Citation ```bibtex @article{voelcker2025reppo, title = {Relative Entropy Pathwise Policy Optimization}, author = {Voelcker, Claas and Brunnbauer, Axel and Hussing, Marcel and Nauman, Michal and Abbeel, Pieter and Eaton, Eric and Grosu, Radu and Farahmand, Amir-massoud and Gilitschenski, Igor}, booktitle = {preprint}, year = {2025}, } ```