Complete HoReKa README with experiment management tools

- Added reference to experiment_plan.md for current progress
- Updated running instructions with batch experiment commands
- Added monitoring tools and paper replication workflow
- Listed all available scripts and their purposes
- Complete install and run instructions for HoReKa users
This commit is contained in:
ys1087@partner.kit.edu 2025-07-22 17:18:27 +02:00
parent e95c2c4e11
commit 69502c8911

View File

@ -15,6 +15,8 @@ For more information, please see our [project webpage](https://younggyo.me/fast_
This repository includes optimized scripts for running FastTD3 on the HoReKa supercomputer cluster with SLURM job scheduling and wandb logging. This repository includes optimized scripts for running FastTD3 on the HoReKa supercomputer cluster with SLURM job scheduling and wandb logging.
**📋 Current Progress**: See [experiment_plan.md](experiment_plan.md) for ongoing paper replication experiments and job tracking.
### Installation on HoReKa ### Installation on HoReKa
```bash ```bash
@ -58,37 +60,55 @@ python test_setup.py
### Running on HoReKa ### Running on HoReKa
**Easy submission:** **Single job submission:**
```bash ```bash
python submit_job.py python submit_job.py
``` ```
**Batch experiments (paper replication):**
```bash
# Submit Phase 1: MuJoCo Playground (4 tasks × 3 seeds)
python submit_experiment_batch.py --phase 1 --seeds 3
# Submit Phase 2: IsaacLab (6 tasks × 3 seeds)
python submit_experiment_batch.py --phase 2 --seeds 3
# Submit Phase 3: HumanoidBench (5 tasks × 3 seeds)
python submit_experiment_batch.py --phase 3 --seeds 3
```
**Monitor experiments:**
```bash
# Check job status
squeue -u $USER
# Monitor experiments with tracking
python monitor_experiments.py --watch
# View specific job output
tail -f logs/fasttd3_*.out
# Cancel job if needed
scancel <job_id>
```
**Manual submission:** **Manual submission:**
```bash ```bash
sbatch run_fasttd3.slurm sbatch run_fasttd3.slurm
``` ```
**Monitor jobs:**
```bash
# Check job status
squeue -u $USER
# View output
tail -f fasttd3_<job_id>.out
# Cancel job if needed
scancel <job_id>
```
### Configuration ### Configuration
The setup includes: The setup includes:
- **SLURM script** (`run_fasttd3.slurm`) configured for accelerated partition with GPU - **SLURM scripts** (`run_fasttd3.slurm`, `run_fasttd3_full.slurm`) configured for accelerated partition with GPU
- **Job helper** (`submit_job.py`) for easy job submission with wandb setup - **Job helpers** (`submit_job.py`, `submit_experiment_batch.py`) for single/batch job submission
- **Monitoring tool** (`monitor_experiments.py`) for real-time experiment tracking
- **Test script** (`test_setup.py`) for environment verification - **Test script** (`test_setup.py`) for environment verification
- **MuJoCo Playground environment** (`T1JoystickFlatTerrain`) for humanoid control - **Experiment plan** (`experiment_plan.md`) with current progress and TODO tracking
- **Automatic GPU detection** and CUDA 12.4 compatibility - **MuJoCo Playground environment** (`T1JoystickFlatTerrain`) working and tested
- **Automatic GPU detection** and CUDA 12.4 compatibility
- **Wandb logging** with online mode by default - **Wandb logging** with online mode by default
- **Paper-accurate hyperparameters** for systematic replication
### Wandb Integration ### Wandb Integration