Add HoReKa cluster support with SLURM scripts and wandb integration
- Add SLURM job scripts for ManiSkill and Brax environments - Add job submission helper script with environment validation - Update README with HoReKa installation and usage instructions - Create logs directory structure - Configure wandb integration (requires external API key setup)
This commit is contained in:
parent
e2f99648ae
commit
137b9e80c9
77
README.md
77
README.md
@ -11,6 +11,83 @@ Our repo provides you with the core algorithm and the following features:
|
||||
- Modern installation: Our algorithm and environment dependencies can be installed with a single command
|
||||
- Fast and reliable learning: REPPO is wallclock time competitive with approaches such as FastTD3 and PPO, while learning reliably and with minimal hyperparameter tuning
|
||||
|
||||
## HoReKa Cluster Setup
|
||||
*Added by Dominik*
|
||||
|
||||
### Installation on HoReKa
|
||||
|
||||
1. **Clone the repository and navigate to it:**
|
||||
```bash
|
||||
git clone <repository-url>
|
||||
cd reppo
|
||||
```
|
||||
|
||||
2. **Create virtual environment with Python 3.12:**
|
||||
```bash
|
||||
python3.12 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
3. **Install the package and dependencies:**
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
### Running on HoReKa
|
||||
|
||||
The repository includes pre-configured SLURM scripts with wandb integration:
|
||||
|
||||
#### Quick Start
|
||||
```bash
|
||||
# Submit a ManiSkill job
|
||||
./submit_job.sh maniskill PickCube-v1 mjx_dmc_medium_data
|
||||
|
||||
# Submit a Brax job
|
||||
./submit_job.sh brax ant mjx_dmc_small_data
|
||||
```
|
||||
|
||||
#### Manual Job Submission
|
||||
```bash
|
||||
# Submit ManiSkill experiments
|
||||
sbatch slurm/run_reppo_maniskill.sh
|
||||
|
||||
# Submit Brax experiments
|
||||
sbatch slurm/run_reppo_brax.sh
|
||||
|
||||
# With custom environment
|
||||
ENV_NAME=PlaceApple-v1 EXPERIMENT_TYPE=mjx_dmc_large_data sbatch slurm/run_reppo_maniskill.sh
|
||||
```
|
||||
|
||||
#### Supported Environments
|
||||
|
||||
**ManiSkill environments:**
|
||||
- `PickCube-v1`, `PlaceApple-v1`, `StackCube-v1`, `PegInsertionSide-v1`
|
||||
|
||||
**Brax environments:**
|
||||
- `ant`, `cheetah`, `hopper`, `walker2d`, `humanoid`
|
||||
|
||||
**Experiment types:**
|
||||
- `mjx_dmc_small_data` (32k samples)
|
||||
- `mjx_dmc_medium_data` (512k samples)
|
||||
- `mjx_dmc_large_data` (1M samples)
|
||||
|
||||
#### Monitoring Jobs
|
||||
```bash
|
||||
# Check job status
|
||||
squeue -u $USER
|
||||
|
||||
# View live logs
|
||||
tail -f logs/reppo_maniskill_<job_id>.out
|
||||
tail -f logs/reppo_brax_<job_id>.out
|
||||
```
|
||||
|
||||
All experiments automatically log to wandb with your configured credentials. Results will appear in projects `reppo_maniskill` and `reppo_brax`.
|
||||
|
||||
---
|
||||
|
||||
## Original README
|
||||
|
||||
## Installation
|
||||
|
||||
We strongly recommend using the [uv tool](https://docs.astral.sh/uv/getting-started/installation/) for python dependency management.
|
||||
|
52
slurm/run_reppo_brax.sh
Executable file
52
slurm/run_reppo_brax.sh
Executable file
@ -0,0 +1,52 @@
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=reppo_brax
|
||||
#SBATCH --account=hk-project-p0022232
|
||||
#SBATCH --partition=accelerated
|
||||
#SBATCH --gres=gpu:1
|
||||
#SBATCH --nodes=1
|
||||
#SBATCH --ntasks-per-node=1
|
||||
#SBATCH --cpus-per-task=8
|
||||
#SBATCH --time=04:00:00
|
||||
#SBATCH --mem=24G
|
||||
#SBATCH --output=logs/reppo_brax_%j.out
|
||||
#SBATCH --error=logs/reppo_brax_%j.err
|
||||
|
||||
# Load required modules
|
||||
module load devel/cuda/12.4
|
||||
|
||||
# Set environment variables
|
||||
export WANDB_MODE=online
|
||||
export WANDB_PROJECT=reppo_brax
|
||||
|
||||
# Change to project directory
|
||||
cd /hkfs/home/project/hk-project-robolear/ys1087/Projects/reppo
|
||||
|
||||
# Activate virtual environment
|
||||
source .venv/bin/activate
|
||||
|
||||
# Note: Ensure WANDB_API_KEY and WANDB_ENTITY are set before running
|
||||
|
||||
# Run REPPO with Brax environment
|
||||
echo "Starting REPPO training with Brax..."
|
||||
echo "Job ID: $SLURM_JOB_ID"
|
||||
echo "Node: $SLURM_NODELIST"
|
||||
echo "GPU: $CUDA_VISIBLE_DEVICES"
|
||||
|
||||
# Default environment: ant (can be overridden)
|
||||
ENV_NAME=${ENV_NAME:-ant}
|
||||
EXPERIMENT_TYPE=${EXPERIMENT_TYPE:-mjx_dmc_small_data}
|
||||
|
||||
echo "Environment: $ENV_NAME"
|
||||
echo "Experiment type: $EXPERIMENT_TYPE"
|
||||
|
||||
# Run the experiment
|
||||
python reppo_alg/jaxrl/reppo.py \
|
||||
env=brax \
|
||||
env_name=$ENV_NAME \
|
||||
experiment_override=$EXPERIMENT_TYPE \
|
||||
wandb.mode=online \
|
||||
wandb.entity=${WANDB_ENTITY} \
|
||||
wandb.project=$WANDB_PROJECT \
|
||||
wandb.name="reppo_${ENV_NAME}_${EXPERIMENT_TYPE}_${SLURM_JOB_ID}"
|
||||
|
||||
echo "Training completed!"
|
52
slurm/run_reppo_maniskill.sh
Executable file
52
slurm/run_reppo_maniskill.sh
Executable file
@ -0,0 +1,52 @@
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=reppo_maniskill
|
||||
#SBATCH --account=hk-project-p0022232
|
||||
#SBATCH --partition=accelerated
|
||||
#SBATCH --gres=gpu:1
|
||||
#SBATCH --nodes=1
|
||||
#SBATCH --ntasks-per-node=1
|
||||
#SBATCH --cpus-per-task=8
|
||||
#SBATCH --time=08:00:00
|
||||
#SBATCH --mem=32G
|
||||
#SBATCH --output=logs/reppo_maniskill_%j.out
|
||||
#SBATCH --error=logs/reppo_maniskill_%j.err
|
||||
|
||||
# Load required modules
|
||||
module load devel/cuda/12.4
|
||||
|
||||
# Set environment variables
|
||||
export WANDB_MODE=online
|
||||
export WANDB_PROJECT=reppo_maniskill
|
||||
|
||||
# Change to project directory
|
||||
cd /hkfs/home/project/hk-project-robolear/ys1087/Projects/reppo
|
||||
|
||||
# Activate virtual environment
|
||||
source .venv/bin/activate
|
||||
|
||||
# Note: Ensure WANDB_API_KEY and WANDB_ENTITY are set before running
|
||||
|
||||
# Run REPPO with ManiSkill environment
|
||||
echo "Starting REPPO training with ManiSkill..."
|
||||
echo "Job ID: $SLURM_JOB_ID"
|
||||
echo "Node: $SLURM_NODELIST"
|
||||
echo "GPU: $CUDA_VISIBLE_DEVICES"
|
||||
|
||||
# Default environment: PickCube-v1 (can be overridden)
|
||||
ENV_NAME=${ENV_NAME:-PickCube-v1}
|
||||
EXPERIMENT_TYPE=${EXPERIMENT_TYPE:-mjx_dmc_medium_data}
|
||||
|
||||
echo "Environment: $ENV_NAME"
|
||||
echo "Experiment type: $EXPERIMENT_TYPE"
|
||||
|
||||
# Run the experiment
|
||||
python reppo_alg/jaxrl/reppo.py \
|
||||
env=maniskill \
|
||||
env_name=$ENV_NAME \
|
||||
experiment_override=$EXPERIMENT_TYPE \
|
||||
wandb.mode=online \
|
||||
wandb.entity=${WANDB_ENTITY} \
|
||||
wandb.project=$WANDB_PROJECT \
|
||||
wandb.name="reppo_${ENV_NAME}_${EXPERIMENT_TYPE}_${SLURM_JOB_ID}"
|
||||
|
||||
echo "Training completed!"
|
50
submit_job.sh
Executable file
50
submit_job.sh
Executable file
@ -0,0 +1,50 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Submit REPPO jobs to SLURM
|
||||
# Usage: ./submit_job.sh [environment_type] [env_name] [experiment_type]
|
||||
|
||||
set -e
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
cd "$SCRIPT_DIR"
|
||||
|
||||
# Default values
|
||||
ENV_TYPE=${1:-maniskill}
|
||||
ENV_NAME=${2:-PickCube-v1}
|
||||
EXPERIMENT_TYPE=${3:-mjx_dmc_medium_data}
|
||||
|
||||
echo "Submitting REPPO job..."
|
||||
echo "Environment type: $ENV_TYPE"
|
||||
echo "Environment name: $ENV_NAME"
|
||||
echo "Experiment type: $EXPERIMENT_TYPE"
|
||||
|
||||
case $ENV_TYPE in
|
||||
maniskill)
|
||||
echo "Submitting ManiSkill job..."
|
||||
ENV_NAME="$ENV_NAME" EXPERIMENT_TYPE="$EXPERIMENT_TYPE" sbatch slurm/run_reppo_maniskill.sh
|
||||
;;
|
||||
brax)
|
||||
echo "Submitting Brax job..."
|
||||
ENV_NAME="$ENV_NAME" EXPERIMENT_TYPE="$EXPERIMENT_TYPE" sbatch slurm/run_reppo_brax.sh
|
||||
;;
|
||||
*)
|
||||
echo "Unknown environment type: $ENV_TYPE"
|
||||
echo "Supported types: maniskill, brax"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
|
||||
echo ""
|
||||
echo "Job submitted! Check status with:"
|
||||
echo " squeue -u $USER"
|
||||
echo ""
|
||||
echo "Check logs in: logs/ directory"
|
||||
echo ""
|
||||
echo "Available ManiSkill environments:"
|
||||
echo " PickCube-v1, PlaceApple-v1, StackCube-v1, PegInsertionSide-v1"
|
||||
echo ""
|
||||
echo "Available Brax environments:"
|
||||
echo " ant, cheetah, hopper, walker2d, humanoid"
|
||||
echo ""
|
||||
echo "Available experiment types:"
|
||||
echo " mjx_dmc_small_data (32k), mjx_dmc_medium_data (512k), mjx_dmc_large_data (1M)"
|
Loading…
Reference in New Issue
Block a user