From 1f7ecc301f45768d1100668747b2c3cf89c15d32 Mon Sep 17 00:00:00 2001
From: Dominik Roth <mail@dominik-roth.eu>
Date: Thu, 12 Mar 2026 17:38:20 +0100
Subject: [PATCH] docs: document NuconGoalEnv and HER training in README

- Describe both NuconEnv and NuconGoalEnv with their obs/action spaces
- Explain goal-conditioned approach and why HER is appropriate
- Add SAC + HerReplayBuffer usage example with recommended hyperparams
- Show how to inject a custom goal at inference time
- List registered goal env presets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 README.md | 104 ++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 77 insertions(+), 27 deletions(-)

diff --git a/README.md b/README.md
index bf93189..6a796b6 100644
--- a/README.md
+++ b/README.md
@@ -123,18 +123,24 @@ To use you'll need to install `gymnasium` and `numpy`. You can do so via
 pip install -e '.[rl]'
 ```
 
-### RL Environment
+### Environments
 
-The `NuconEnv` class in `nucon/rl.py` provides a Gym-compatible environment for reinforcement learning tasks in the Nucleares simulation. Key features include:
+Two environment classes are provided in `nucon/rl.py`:
 
-- Observation space: Includes all readable parameters from the NuCon system.
-- Action space: Encompasses all writable parameters in the NuCon system.
-- Step function: Applies actions to the NuCon system and returns new observations.
-- Objective function: Allows for predefined or custom objective functions to be defined for training.
+**`NuconEnv`** — classic fixed-objective environment. You define one or more objectives at construction time (e.g. maximise power output, keep temperature in range). The agent always trains toward the same goal.
 
-### Usage
+- Observation space: all readable numeric parameters (~290 dims).
+- Action space: all readable-back writable parameters (~30 dims): 9 individual rod bank positions, 3 MSCVs, 3 turbine bypass valves, 6 coolant pump speeds, condenser pump, freight/vent switches, resistor banks, and more.
+- Objectives: predefined strings (`'max_power'`, `'episode_time'`) or arbitrary callables `(obs) -> float`. Multiple objectives are weighted-summed.
+
+**`NuconGoalEnv`** — goal-conditioned environment. The desired goal (e.g. target generator output) is sampled at the start of each episode and provided as part of the observation. A single policy learns to reach *any* goal in the specified range, making it far more useful than a fixed-objective agent. Designed for training with [Hindsight Experience Replay (HER)](https://arxiv.org/abs/1707.01495), which makes sparse-reward goal-conditioned training tractable.
+
+- Observation space: `Dict` with keys `observation` (non-goal params), `achieved_goal` (current goal param values, normalised to [0,1]), `desired_goal` (target, normalised to [0,1]).
+- Goals are sampled uniformly from the specified `goal_range` each episode.
+- Reward defaults to negative L2 distance in normalised goal space (dense). Pass `tolerance` for a sparse `{0, -1}` reward — this works particularly well with HER.
+
+### NuconEnv Usage
 
-Here's a basic example of how to use the RL environment:
 ```python
 from nucon.rl import NuconEnv, Parameterized_Objectives
 
@@ -154,44 +160,88 @@ env.close()
 
 Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values.
 
-You can e.g. train an PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
+You can e.g. train a PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
 ```python
 from nucon.rl import NuconEnv
 from stable_baselines3 import PPO
 
 env = NuconEnv(objectives=['max_power'], seconds_per_step=5)
 
-# Create the PPO (Proximal Policy Optimization) model
 model = PPO(
-    "MlpPolicy", 
-    env, 
+    "MlpPolicy",
+    env,
     verbose=1,
-    learning_rate=3e-4,  # You can adjust hyperparameters as needed
-    n_steps=2048, 
-    batch_size=64, 
-    n_epochs=10, 
-    gamma=0.99, 
-    gae_lambda=0.95, 
-    clip_range=0.2, 
-    ent_coef=0.01
+    learning_rate=3e-4,
+    n_steps=2048,
+    batch_size=64,
+    n_epochs=10,
+    gamma=0.99,
+    gae_lambda=0.95,
+    clip_range=0.2,
+    ent_coef=0.01,
 )
+model.learn(total_timesteps=100_000)
 
-# Train the model
-model.learn(total_timesteps=100000)  # Adjust total_timesteps as needed
-
-# Test the trained model
 obs, info = env.reset()
 for _ in range(1000):
     action, _states = model.predict(obs, deterministic=True)
     obs, reward, terminated, truncated, info = env.step(action)
-
     if terminated or truncated:
         obs, info = env.reset()
-
-# Close the environment
 env.close()
 ```
 
+### NuconGoalEnv + HER Usage
+
+HER works by relabelling past trajectories with the goal that was *actually achieved*, turning every episode into useful training signal even when the agent never reaches the intended target. This makes it much more sample-efficient than standard RL for goal-reaching tasks — important given how slow the real game is.
+
+```python
+from nucon.rl import NuconGoalEnv
+from stable_baselines3 import SAC
+from stable_baselines3.common.buffers import HerReplayBuffer
+
+env = NuconGoalEnv(
+    goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
+    goal_range={
+        'GENERATOR_0_KW': (0.0, 1200.0),
+        'GENERATOR_1_KW': (0.0, 1200.0),
+        'GENERATOR_2_KW': (0.0, 1200.0),
+    },
+    tolerance=0.05,        # sparse: within 5% of range counts as success (recommended with HER)
+    seconds_per_step=5,
+    simulator=simulator,   # use a pre-trained simulator for fast pre-training
+)
+# Or use a preset: env = gym.make('Nucon-goal_power-v0', simulator=simulator)
+
+model = SAC(
+    'MultiInputPolicy',
+    env,
+    replay_buffer_class=HerReplayBuffer,
+    replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
+    verbose=1,
+    learning_rate=1e-3,
+    batch_size=256,
+    tau=0.005,
+    gamma=0.98,
+    train_freq=1,
+    gradient_steps=1,
+)
+model.learn(total_timesteps=500_000)
+```
+
+At inference time, inject any target by constructing the observation manually:
+```python
+import numpy as np
+obs, _ = env.reset()
+# Override the desired goal (values are normalised to [0,1] within goal_range)
+obs['desired_goal'] = np.array([0.8, 0.8, 0.8], dtype=np.float32)  # ~960 kW per generator
+action, _ = model.predict(obs, deterministic=True)
+```
+
+Predefined goal environments:
+- `Nucon-goal_power-v0`: target total generator output (3 × 0–1200 kW)
+- `Nucon-goal_temp-v0`: target core temperature (280–380 °C)
+
 But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a
 
 ## Simulator (Work in Progress)