Add HoReKa cluster support with SLURM scripts and wandb integration
- Add SLURM job scripts for ManiSkill and Brax environments - Add job submission helper script with environment validation - Update README with HoReKa installation and usage instructions - Create logs directory structure - Configure wandb integration (requires external API key setup)
This commit is contained in:
parent
e2f99648ae
commit
137b9e80c9
77
README.md
77
README.md
@ -11,6 +11,83 @@ Our repo provides you with the core algorithm and the following features:
|
|||||||
- Modern installation: Our algorithm and environment dependencies can be installed with a single command
|
- Modern installation: Our algorithm and environment dependencies can be installed with a single command
|
||||||
- Fast and reliable learning: REPPO is wallclock time competitive with approaches such as FastTD3 and PPO, while learning reliably and with minimal hyperparameter tuning
|
- Fast and reliable learning: REPPO is wallclock time competitive with approaches such as FastTD3 and PPO, while learning reliably and with minimal hyperparameter tuning
|
||||||
|
|
||||||
|
## HoReKa Cluster Setup
|
||||||
|
*Added by Dominik*
|
||||||
|
|
||||||
|
### Installation on HoReKa
|
||||||
|
|
||||||
|
1. **Clone the repository and navigate to it:**
|
||||||
|
```bash
|
||||||
|
git clone <repository-url>
|
||||||
|
cd reppo
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Create virtual environment with Python 3.12:**
|
||||||
|
```bash
|
||||||
|
python3.12 -m venv .venv
|
||||||
|
source .venv/bin/activate
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Install the package and dependencies:**
|
||||||
|
```bash
|
||||||
|
pip install --upgrade pip
|
||||||
|
pip install -e .
|
||||||
|
```
|
||||||
|
|
||||||
|
### Running on HoReKa
|
||||||
|
|
||||||
|
The repository includes pre-configured SLURM scripts with wandb integration:
|
||||||
|
|
||||||
|
#### Quick Start
|
||||||
|
```bash
|
||||||
|
# Submit a ManiSkill job
|
||||||
|
./submit_job.sh maniskill PickCube-v1 mjx_dmc_medium_data
|
||||||
|
|
||||||
|
# Submit a Brax job
|
||||||
|
./submit_job.sh brax ant mjx_dmc_small_data
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Manual Job Submission
|
||||||
|
```bash
|
||||||
|
# Submit ManiSkill experiments
|
||||||
|
sbatch slurm/run_reppo_maniskill.sh
|
||||||
|
|
||||||
|
# Submit Brax experiments
|
||||||
|
sbatch slurm/run_reppo_brax.sh
|
||||||
|
|
||||||
|
# With custom environment
|
||||||
|
ENV_NAME=PlaceApple-v1 EXPERIMENT_TYPE=mjx_dmc_large_data sbatch slurm/run_reppo_maniskill.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Supported Environments
|
||||||
|
|
||||||
|
**ManiSkill environments:**
|
||||||
|
- `PickCube-v1`, `PlaceApple-v1`, `StackCube-v1`, `PegInsertionSide-v1`
|
||||||
|
|
||||||
|
**Brax environments:**
|
||||||
|
- `ant`, `cheetah`, `hopper`, `walker2d`, `humanoid`
|
||||||
|
|
||||||
|
**Experiment types:**
|
||||||
|
- `mjx_dmc_small_data` (32k samples)
|
||||||
|
- `mjx_dmc_medium_data` (512k samples)
|
||||||
|
- `mjx_dmc_large_data` (1M samples)
|
||||||
|
|
||||||
|
#### Monitoring Jobs
|
||||||
|
```bash
|
||||||
|
# Check job status
|
||||||
|
squeue -u $USER
|
||||||
|
|
||||||
|
# View live logs
|
||||||
|
tail -f logs/reppo_maniskill_<job_id>.out
|
||||||
|
tail -f logs/reppo_brax_<job_id>.out
|
||||||
|
```
|
||||||
|
|
||||||
|
All experiments automatically log to wandb with your configured credentials. Results will appear in projects `reppo_maniskill` and `reppo_brax`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Original README
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
We strongly recommend using the [uv tool](https://docs.astral.sh/uv/getting-started/installation/) for python dependency management.
|
We strongly recommend using the [uv tool](https://docs.astral.sh/uv/getting-started/installation/) for python dependency management.
|
||||||
|
52
slurm/run_reppo_brax.sh
Executable file
52
slurm/run_reppo_brax.sh
Executable file
@ -0,0 +1,52 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=reppo_brax
|
||||||
|
#SBATCH --account=hk-project-p0022232
|
||||||
|
#SBATCH --partition=accelerated
|
||||||
|
#SBATCH --gres=gpu:1
|
||||||
|
#SBATCH --nodes=1
|
||||||
|
#SBATCH --ntasks-per-node=1
|
||||||
|
#SBATCH --cpus-per-task=8
|
||||||
|
#SBATCH --time=04:00:00
|
||||||
|
#SBATCH --mem=24G
|
||||||
|
#SBATCH --output=logs/reppo_brax_%j.out
|
||||||
|
#SBATCH --error=logs/reppo_brax_%j.err
|
||||||
|
|
||||||
|
# Load required modules
|
||||||
|
module load devel/cuda/12.4
|
||||||
|
|
||||||
|
# Set environment variables
|
||||||
|
export WANDB_MODE=online
|
||||||
|
export WANDB_PROJECT=reppo_brax
|
||||||
|
|
||||||
|
# Change to project directory
|
||||||
|
cd /hkfs/home/project/hk-project-robolear/ys1087/Projects/reppo
|
||||||
|
|
||||||
|
# Activate virtual environment
|
||||||
|
source .venv/bin/activate
|
||||||
|
|
||||||
|
# Note: Ensure WANDB_API_KEY and WANDB_ENTITY are set before running
|
||||||
|
|
||||||
|
# Run REPPO with Brax environment
|
||||||
|
echo "Starting REPPO training with Brax..."
|
||||||
|
echo "Job ID: $SLURM_JOB_ID"
|
||||||
|
echo "Node: $SLURM_NODELIST"
|
||||||
|
echo "GPU: $CUDA_VISIBLE_DEVICES"
|
||||||
|
|
||||||
|
# Default environment: ant (can be overridden)
|
||||||
|
ENV_NAME=${ENV_NAME:-ant}
|
||||||
|
EXPERIMENT_TYPE=${EXPERIMENT_TYPE:-mjx_dmc_small_data}
|
||||||
|
|
||||||
|
echo "Environment: $ENV_NAME"
|
||||||
|
echo "Experiment type: $EXPERIMENT_TYPE"
|
||||||
|
|
||||||
|
# Run the experiment
|
||||||
|
python reppo_alg/jaxrl/reppo.py \
|
||||||
|
env=brax \
|
||||||
|
env_name=$ENV_NAME \
|
||||||
|
experiment_override=$EXPERIMENT_TYPE \
|
||||||
|
wandb.mode=online \
|
||||||
|
wandb.entity=${WANDB_ENTITY} \
|
||||||
|
wandb.project=$WANDB_PROJECT \
|
||||||
|
wandb.name="reppo_${ENV_NAME}_${EXPERIMENT_TYPE}_${SLURM_JOB_ID}"
|
||||||
|
|
||||||
|
echo "Training completed!"
|
52
slurm/run_reppo_maniskill.sh
Executable file
52
slurm/run_reppo_maniskill.sh
Executable file
@ -0,0 +1,52 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=reppo_maniskill
|
||||||
|
#SBATCH --account=hk-project-p0022232
|
||||||
|
#SBATCH --partition=accelerated
|
||||||
|
#SBATCH --gres=gpu:1
|
||||||
|
#SBATCH --nodes=1
|
||||||
|
#SBATCH --ntasks-per-node=1
|
||||||
|
#SBATCH --cpus-per-task=8
|
||||||
|
#SBATCH --time=08:00:00
|
||||||
|
#SBATCH --mem=32G
|
||||||
|
#SBATCH --output=logs/reppo_maniskill_%j.out
|
||||||
|
#SBATCH --error=logs/reppo_maniskill_%j.err
|
||||||
|
|
||||||
|
# Load required modules
|
||||||
|
module load devel/cuda/12.4
|
||||||
|
|
||||||
|
# Set environment variables
|
||||||
|
export WANDB_MODE=online
|
||||||
|
export WANDB_PROJECT=reppo_maniskill
|
||||||
|
|
||||||
|
# Change to project directory
|
||||||
|
cd /hkfs/home/project/hk-project-robolear/ys1087/Projects/reppo
|
||||||
|
|
||||||
|
# Activate virtual environment
|
||||||
|
source .venv/bin/activate
|
||||||
|
|
||||||
|
# Note: Ensure WANDB_API_KEY and WANDB_ENTITY are set before running
|
||||||
|
|
||||||
|
# Run REPPO with ManiSkill environment
|
||||||
|
echo "Starting REPPO training with ManiSkill..."
|
||||||
|
echo "Job ID: $SLURM_JOB_ID"
|
||||||
|
echo "Node: $SLURM_NODELIST"
|
||||||
|
echo "GPU: $CUDA_VISIBLE_DEVICES"
|
||||||
|
|
||||||
|
# Default environment: PickCube-v1 (can be overridden)
|
||||||
|
ENV_NAME=${ENV_NAME:-PickCube-v1}
|
||||||
|
EXPERIMENT_TYPE=${EXPERIMENT_TYPE:-mjx_dmc_medium_data}
|
||||||
|
|
||||||
|
echo "Environment: $ENV_NAME"
|
||||||
|
echo "Experiment type: $EXPERIMENT_TYPE"
|
||||||
|
|
||||||
|
# Run the experiment
|
||||||
|
python reppo_alg/jaxrl/reppo.py \
|
||||||
|
env=maniskill \
|
||||||
|
env_name=$ENV_NAME \
|
||||||
|
experiment_override=$EXPERIMENT_TYPE \
|
||||||
|
wandb.mode=online \
|
||||||
|
wandb.entity=${WANDB_ENTITY} \
|
||||||
|
wandb.project=$WANDB_PROJECT \
|
||||||
|
wandb.name="reppo_${ENV_NAME}_${EXPERIMENT_TYPE}_${SLURM_JOB_ID}"
|
||||||
|
|
||||||
|
echo "Training completed!"
|
50
submit_job.sh
Executable file
50
submit_job.sh
Executable file
@ -0,0 +1,50 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# Submit REPPO jobs to SLURM
|
||||||
|
# Usage: ./submit_job.sh [environment_type] [env_name] [experiment_type]
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
cd "$SCRIPT_DIR"
|
||||||
|
|
||||||
|
# Default values
|
||||||
|
ENV_TYPE=${1:-maniskill}
|
||||||
|
ENV_NAME=${2:-PickCube-v1}
|
||||||
|
EXPERIMENT_TYPE=${3:-mjx_dmc_medium_data}
|
||||||
|
|
||||||
|
echo "Submitting REPPO job..."
|
||||||
|
echo "Environment type: $ENV_TYPE"
|
||||||
|
echo "Environment name: $ENV_NAME"
|
||||||
|
echo "Experiment type: $EXPERIMENT_TYPE"
|
||||||
|
|
||||||
|
case $ENV_TYPE in
|
||||||
|
maniskill)
|
||||||
|
echo "Submitting ManiSkill job..."
|
||||||
|
ENV_NAME="$ENV_NAME" EXPERIMENT_TYPE="$EXPERIMENT_TYPE" sbatch slurm/run_reppo_maniskill.sh
|
||||||
|
;;
|
||||||
|
brax)
|
||||||
|
echo "Submitting Brax job..."
|
||||||
|
ENV_NAME="$ENV_NAME" EXPERIMENT_TYPE="$EXPERIMENT_TYPE" sbatch slurm/run_reppo_brax.sh
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Unknown environment type: $ENV_TYPE"
|
||||||
|
echo "Supported types: maniskill, brax"
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Job submitted! Check status with:"
|
||||||
|
echo " squeue -u $USER"
|
||||||
|
echo ""
|
||||||
|
echo "Check logs in: logs/ directory"
|
||||||
|
echo ""
|
||||||
|
echo "Available ManiSkill environments:"
|
||||||
|
echo " PickCube-v1, PlaceApple-v1, StackCube-v1, PegInsertionSide-v1"
|
||||||
|
echo ""
|
||||||
|
echo "Available Brax environments:"
|
||||||
|
echo " ant, cheetah, hopper, walker2d, humanoid"
|
||||||
|
echo ""
|
||||||
|
echo "Available experiment types:"
|
||||||
|
echo " mjx_dmc_small_data (32k), mjx_dmc_medium_data (512k), mjx_dmc_large_data (1M)"
|
Loading…
Reference in New Issue
Block a user