Add HoReKa cluster support with SLURM scripts and wandb integration

- Add SLURM job scripts for ManiSkill and Brax environments
- Add job submission helper script with environment validation
- Update README with HoReKa installation and usage instructions
- Create logs directory structure
- Configure wandb integration (requires external API key setup)
This commit is contained in:
ys1087@partner.kit.edu 2025-07-22 16:15:36 +02:00
parent e2f99648ae
commit 137b9e80c9
4 changed files with 231 additions and 0 deletions

View File

@ -11,6 +11,83 @@ Our repo provides you with the core algorithm and the following features:
- Modern installation: Our algorithm and environment dependencies can be installed with a single command
- Fast and reliable learning: REPPO is wallclock time competitive with approaches such as FastTD3 and PPO, while learning reliably and with minimal hyperparameter tuning
## HoReKa Cluster Setup
*Added by Dominik*
### Installation on HoReKa
1. **Clone the repository and navigate to it:**
```bash
git clone <repository-url>
cd reppo
```
2. **Create virtual environment with Python 3.12:**
```bash
python3.12 -m venv .venv
source .venv/bin/activate
```
3. **Install the package and dependencies:**
```bash
pip install --upgrade pip
pip install -e .
```
### Running on HoReKa
The repository includes pre-configured SLURM scripts with wandb integration:
#### Quick Start
```bash
# Submit a ManiSkill job
./submit_job.sh maniskill PickCube-v1 mjx_dmc_medium_data
# Submit a Brax job
./submit_job.sh brax ant mjx_dmc_small_data
```
#### Manual Job Submission
```bash
# Submit ManiSkill experiments
sbatch slurm/run_reppo_maniskill.sh
# Submit Brax experiments
sbatch slurm/run_reppo_brax.sh
# With custom environment
ENV_NAME=PlaceApple-v1 EXPERIMENT_TYPE=mjx_dmc_large_data sbatch slurm/run_reppo_maniskill.sh
```
#### Supported Environments
**ManiSkill environments:**
- `PickCube-v1`, `PlaceApple-v1`, `StackCube-v1`, `PegInsertionSide-v1`
**Brax environments:**
- `ant`, `cheetah`, `hopper`, `walker2d`, `humanoid`
**Experiment types:**
- `mjx_dmc_small_data` (32k samples)
- `mjx_dmc_medium_data` (512k samples)
- `mjx_dmc_large_data` (1M samples)
#### Monitoring Jobs
```bash
# Check job status
squeue -u $USER
# View live logs
tail -f logs/reppo_maniskill_<job_id>.out
tail -f logs/reppo_brax_<job_id>.out
```
All experiments automatically log to wandb with your configured credentials. Results will appear in projects `reppo_maniskill` and `reppo_brax`.
---
## Original README
## Installation
We strongly recommend using the [uv tool](https://docs.astral.sh/uv/getting-started/installation/) for python dependency management.

52
slurm/run_reppo_brax.sh Executable file
View File

@ -0,0 +1,52 @@
#!/bin/bash
#SBATCH --job-name=reppo_brax
#SBATCH --account=hk-project-p0022232
#SBATCH --partition=accelerated
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=04:00:00
#SBATCH --mem=24G
#SBATCH --output=logs/reppo_brax_%j.out
#SBATCH --error=logs/reppo_brax_%j.err
# Load required modules
module load devel/cuda/12.4
# Set environment variables
export WANDB_MODE=online
export WANDB_PROJECT=reppo_brax
# Change to project directory
cd /hkfs/home/project/hk-project-robolear/ys1087/Projects/reppo
# Activate virtual environment
source .venv/bin/activate
# Note: Ensure WANDB_API_KEY and WANDB_ENTITY are set before running
# Run REPPO with Brax environment
echo "Starting REPPO training with Brax..."
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURM_NODELIST"
echo "GPU: $CUDA_VISIBLE_DEVICES"
# Default environment: ant (can be overridden)
ENV_NAME=${ENV_NAME:-ant}
EXPERIMENT_TYPE=${EXPERIMENT_TYPE:-mjx_dmc_small_data}
echo "Environment: $ENV_NAME"
echo "Experiment type: $EXPERIMENT_TYPE"
# Run the experiment
python reppo_alg/jaxrl/reppo.py \
env=brax \
env_name=$ENV_NAME \
experiment_override=$EXPERIMENT_TYPE \
wandb.mode=online \
wandb.entity=${WANDB_ENTITY} \
wandb.project=$WANDB_PROJECT \
wandb.name="reppo_${ENV_NAME}_${EXPERIMENT_TYPE}_${SLURM_JOB_ID}"
echo "Training completed!"

52
slurm/run_reppo_maniskill.sh Executable file
View File

@ -0,0 +1,52 @@
#!/bin/bash
#SBATCH --job-name=reppo_maniskill
#SBATCH --account=hk-project-p0022232
#SBATCH --partition=accelerated
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=08:00:00
#SBATCH --mem=32G
#SBATCH --output=logs/reppo_maniskill_%j.out
#SBATCH --error=logs/reppo_maniskill_%j.err
# Load required modules
module load devel/cuda/12.4
# Set environment variables
export WANDB_MODE=online
export WANDB_PROJECT=reppo_maniskill
# Change to project directory
cd /hkfs/home/project/hk-project-robolear/ys1087/Projects/reppo
# Activate virtual environment
source .venv/bin/activate
# Note: Ensure WANDB_API_KEY and WANDB_ENTITY are set before running
# Run REPPO with ManiSkill environment
echo "Starting REPPO training with ManiSkill..."
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURM_NODELIST"
echo "GPU: $CUDA_VISIBLE_DEVICES"
# Default environment: PickCube-v1 (can be overridden)
ENV_NAME=${ENV_NAME:-PickCube-v1}
EXPERIMENT_TYPE=${EXPERIMENT_TYPE:-mjx_dmc_medium_data}
echo "Environment: $ENV_NAME"
echo "Experiment type: $EXPERIMENT_TYPE"
# Run the experiment
python reppo_alg/jaxrl/reppo.py \
env=maniskill \
env_name=$ENV_NAME \
experiment_override=$EXPERIMENT_TYPE \
wandb.mode=online \
wandb.entity=${WANDB_ENTITY} \
wandb.project=$WANDB_PROJECT \
wandb.name="reppo_${ENV_NAME}_${EXPERIMENT_TYPE}_${SLURM_JOB_ID}"
echo "Training completed!"

50
submit_job.sh Executable file
View File

@ -0,0 +1,50 @@
#!/bin/bash
# Submit REPPO jobs to SLURM
# Usage: ./submit_job.sh [environment_type] [env_name] [experiment_type]
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
# Default values
ENV_TYPE=${1:-maniskill}
ENV_NAME=${2:-PickCube-v1}
EXPERIMENT_TYPE=${3:-mjx_dmc_medium_data}
echo "Submitting REPPO job..."
echo "Environment type: $ENV_TYPE"
echo "Environment name: $ENV_NAME"
echo "Experiment type: $EXPERIMENT_TYPE"
case $ENV_TYPE in
maniskill)
echo "Submitting ManiSkill job..."
ENV_NAME="$ENV_NAME" EXPERIMENT_TYPE="$EXPERIMENT_TYPE" sbatch slurm/run_reppo_maniskill.sh
;;
brax)
echo "Submitting Brax job..."
ENV_NAME="$ENV_NAME" EXPERIMENT_TYPE="$EXPERIMENT_TYPE" sbatch slurm/run_reppo_brax.sh
;;
*)
echo "Unknown environment type: $ENV_TYPE"
echo "Supported types: maniskill, brax"
exit 1
;;
esac
echo ""
echo "Job submitted! Check status with:"
echo " squeue -u $USER"
echo ""
echo "Check logs in: logs/ directory"
echo ""
echo "Available ManiSkill environments:"
echo " PickCube-v1, PlaceApple-v1, StackCube-v1, PegInsertionSide-v1"
echo ""
echo "Available Brax environments:"
echo " ant, cheetah, hopper, walker2d, humanoid"
echo ""
echo "Available experiment types:"
echo " mjx_dmc_small_data (32k), mjx_dmc_medium_data (512k), mjx_dmc_large_data (1M)"