docs: document NuconGoalEnv and HER training in README

- Describe both NuconEnv and NuconGoalEnv with their obs/action spaces - Explain goal-conditioned approach and why HER is appropriate - Add SAC + HerReplayBuffer usage example with recommended hyperparams - Show how to inject a custom goal at inference time - List registered goal env presets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 17:38:20 +01:00 · 2026-03-12 17:38:20 +01:00 · 1f7ecc301f
commit 1f7ecc301f
parent 0dab7a6cec
1 changed files with 77 additions and 27 deletions
--- a/README.md
+++ b/README.md
@ -123,18 +123,24 @@ To use you'll need to install `gymnasium` and `numpy`. You can do so via
 pip install -e '.[rl]'
 ```

-### RL Environment
+### Environments

-The `NuconEnv` class in `nucon/rl.py` provides a Gym-compatible environment for reinforcement learning tasks in the Nucleares simulation. Key features include:
+Two environment classes are provided in `nucon/rl.py`:

- Observation space: Includes all readable parameters from the NuCon system.
- Action space: Encompasses all writable parameters in the NuCon system.
- Step function: Applies actions to the NuCon system and returns new observations.
- Objective function: Allows for predefined or custom objective functions to be defined for training.
+**`NuconEnv`** — classic fixed-objective environment. You define one or more objectives at construction time (e.g. maximise power output, keep temperature in range). The agent always trains toward the same goal.

-### Usage
+- Observation space: all readable numeric parameters (~290 dims).
+- Action space: all readable-back writable parameters (~30 dims): 9 individual rod bank positions, 3 MSCVs, 3 turbine bypass valves, 6 coolant pump speeds, condenser pump, freight/vent switches, resistor banks, and more.
+- Objectives: predefined strings (`'max_power'`, `'episode_time'`) or arbitrary callables `(obs) -> float`. Multiple objectives are weighted-summed.
+
+**`NuconGoalEnv`** — goal-conditioned environment. The desired goal (e.g. target generator output) is sampled at the start of each episode and provided as part of the observation. A single policy learns to reach *any* goal in the specified range, making it far more useful than a fixed-objective agent. Designed for training with [Hindsight Experience Replay (HER)](https://arxiv.org/abs/1707.01495), which makes sparse-reward goal-conditioned training tractable.
+
+- Observation space: `Dict` with keys `observation` (non-goal params), `achieved_goal` (current goal param values, normalised to [0,1]), `desired_goal` (target, normalised to [0,1]).
+- Goals are sampled uniformly from the specified `goal_range` each episode.
+- Reward defaults to negative L2 distance in normalised goal space (dense). Pass `tolerance` for a sparse `{0, -1}` reward — this works particularly well with HER.
+
+### NuconEnv Usage

-Here's a basic example of how to use the RL environment:
 ```python
 from nucon.rl import NuconEnv, Parameterized_Objectives

@ -154,44 +160,88 @@ env.close()

 Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values.

-You can e.g. train an PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
+You can e.g. train a PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
 ```python
 from nucon.rl import NuconEnv
 from stable_baselines3 import PPO

 env = NuconEnv(objectives=['max_power'], seconds_per_step=5)

-# Create the PPO (Proximal Policy Optimization) model
 model = PPO(
-    "MlpPolicy", 
-    env, 
+    "MlpPolicy",
+    env,
    verbose=1,
-    learning_rate=3e-4,  # You can adjust hyperparameters as needed
-    n_steps=2048, 
-    batch_size=64, 
-    n_epochs=10, 
-    gamma=0.99, 
-    gae_lambda=0.95, 
-    clip_range=0.2, 
-    ent_coef=0.01
+    learning_rate=3e-4,
+    n_steps=2048,
+    batch_size=64,
+    n_epochs=10,
+    gamma=0.99,
+    gae_lambda=0.95,
+    clip_range=0.2,
+    ent_coef=0.01,
 )
+model.learn(total_timesteps=100_000)

-# Train the model
-model.learn(total_timesteps=100000)  # Adjust total_timesteps as needed
-
-# Test the trained model
 obs, info = env.reset()
 for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
-
    if terminated or truncated:
        obs, info = env.reset()
-
-# Close the environment
 env.close()
 ```

+### NuconGoalEnv + HER Usage
+
+HER works by relabelling past trajectories with the goal that was *actually achieved*, turning every episode into useful training signal even when the agent never reaches the intended target. This makes it much more sample-efficient than standard RL for goal-reaching tasks — important given how slow the real game is.
+
+```python
+from nucon.rl import NuconGoalEnv
+from stable_baselines3 import SAC
+from stable_baselines3.common.buffers import HerReplayBuffer
+
+env = NuconGoalEnv(
+    goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
+    goal_range={
+        'GENERATOR_0_KW': (0.0, 1200.0),
+        'GENERATOR_1_KW': (0.0, 1200.0),
+        'GENERATOR_2_KW': (0.0, 1200.0),
+    },
+    tolerance=0.05,        # sparse: within 5% of range counts as success (recommended with HER)
+    seconds_per_step=5,
+    simulator=simulator,   # use a pre-trained simulator for fast pre-training
+)
+# Or use a preset: env = gym.make('Nucon-goal_power-v0', simulator=simulator)
+
+model = SAC(
+    'MultiInputPolicy',
+    env,
+    replay_buffer_class=HerReplayBuffer,
+    replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
+    verbose=1,
+    learning_rate=1e-3,
+    batch_size=256,
+    tau=0.005,
+    gamma=0.98,
+    train_freq=1,
+    gradient_steps=1,
+)
+model.learn(total_timesteps=500_000)
+```
+
+At inference time, inject any target by constructing the observation manually:
+```python
+import numpy as np
+obs, _ = env.reset()
+# Override the desired goal (values are normalised to [0,1] within goal_range)
+obs['desired_goal'] = np.array([0.8, 0.8, 0.8], dtype=np.float32)  # ~960 kW per generator
+action, _ = model.predict(obs, deterministic=True)
+```
+
+Predefined goal environments:
+- `Nucon-goal_power-v0`: target total generator output (3 × 0–1200 kW)
+- `Nucon-goal_temp-v0`: target core temperature (280–380 °C)
+
 But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a

 ## Simulator (Work in Progress)