docs: document NuconGoalEnv and HER training in README

- Describe both NuconEnv and NuconGoalEnv with their obs/action spaces - Explain goal-conditioned approach and why HER is appropriate - Add SAC + HerReplayBuffer usage example with recommended hyperparams - Show how to inject a custom goal at inference time - List registered goal env presets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 17:38:20 +01:00 · 2026-03-12 17:38:20 +01:00 · 1f7ecc301f
commit 1f7ecc301f
parent 0dab7a6cec
1 changed files with 77 additions and 27 deletions
--- a/README.md
+++ b/README.md
@ -123,18 +123,24 @@ To use you'll need to install `gymnasium` and `numpy`. You can do so via
 pip install -e '.[rl]'
 ```
-### RL Environment
+### Environments
-The `NuconEnv` class in `nucon/rl.py` provides a Gym-compatible environment for reinforcement learning tasks in the Nucleares simulation. Key features include:
+Two environment classes are provided in `nucon/rl.py`:
- Observation space: Includes all readable parameters from the NuCon system.
+**`NuconEnv`** — classic fixed-objective environment. You define one or more objectives at construction time (e.g. maximise power output, keep temperature in range). The agent always trains toward the same goal.
 - Action space: Encompasses all writable parameters in the NuCon system.
 - Step function: Applies actions to the NuCon system and returns new observations.
 - Objective function: Allows for predefined or custom objective functions to be defined for training.
-### Usage
+- Observation space: all readable numeric parameters (~290 dims).
 - Action space: all readable-back writable parameters (~30 dims): 9 individual rod bank positions, 3 MSCVs, 3 turbine bypass valves, 6 coolant pump speeds, condenser pump, freight/vent switches, resistor banks, and more.
 - Objectives: predefined strings (`'max_power'`, `'episode_time'`) or arbitrary callables `(obs) -> float`. Multiple objectives are weighted-summed.
 **`NuconGoalEnv`** — goal-conditioned environment. The desired goal (e.g. target generator output) is sampled at the start of each episode and provided as part of the observation. A single policy learns to reach *any* goal in the specified range, making it far more useful than a fixed-objective agent. Designed for training with [Hindsight Experience Replay (HER)](https://arxiv.org/abs/1707.01495), which makes sparse-reward goal-conditioned training tractable.
 - Observation space: `Dict` with keys `observation` (non-goal params), `achieved_goal` (current goal param values, normalised to [0,1]), `desired_goal` (target, normalised to [0,1]).
 - Goals are sampled uniformly from the specified `goal_range` each episode.
 - Reward defaults to negative L2 distance in normalised goal space (dense). Pass `tolerance` for a sparse `{0, -1}` reward — this works particularly well with HER.
 ### NuconEnv Usage
 Here's a basic example of how to use the RL environment:
 ```python
 from nucon.rl import NuconEnv, Parameterized_Objectives
@ -154,44 +160,88 @@ env.close()
 Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values.
-You can e.g. train an PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
+You can e.g. train a PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
 ```python
 from nucon.rl import NuconEnv
 from stable_baselines3 import PPO
 env = NuconEnv(objectives=['max_power'], seconds_per_step=5)
 # Create the PPO (Proximal Policy Optimization) model
 model = PPO(
-    "MlpPolicy", 
+    "MlpPolicy",
-    env, 
+    env,
    verbose=1,
-    learning_rate=3e-4,  # You can adjust hyperparameters as needed
+    learning_rate=3e-4,
-    n_steps=2048, 
+    n_steps=2048,
-    batch_size=64, 
+    batch_size=64,
-    n_epochs=10, 
+    n_epochs=10,
-    gamma=0.99, 
+    gamma=0.99,
-    gae_lambda=0.95, 
+    gae_lambda=0.95,
-    clip_range=0.2, 
+    clip_range=0.2,
-    ent_coef=0.01
+    ent_coef=0.01,
 )
 model.learn(total_timesteps=100_000)
 # Train the model
 model.learn(total_timesteps=100000)  # Adjust total_timesteps as needed
 # Test the trained model
 obs, info = env.reset()
 for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()
 # Close the environment
 env.close()
 ```
 ### NuconGoalEnv + HER Usage
 HER works by relabelling past trajectories with the goal that was *actually achieved*, turning every episode into useful training signal even when the agent never reaches the intended target. This makes it much more sample-efficient than standard RL for goal-reaching tasks — important given how slow the real game is.
 ```python
 from nucon.rl import NuconGoalEnv
 from stable_baselines3 import SAC
 from stable_baselines3.common.buffers import HerReplayBuffer
 env = NuconGoalEnv(
    goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
    goal_range={
        'GENERATOR_0_KW': (0.0, 1200.0),
        'GENERATOR_1_KW': (0.0, 1200.0),
        'GENERATOR_2_KW': (0.0, 1200.0),
    },
    tolerance=0.05,        # sparse: within 5% of range counts as success (recommended with HER)
    seconds_per_step=5,
    simulator=simulator,   # use a pre-trained simulator for fast pre-training
 )
 # Or use a preset: env = gym.make('Nucon-goal_power-v0', simulator=simulator)
 model = SAC(
    'MultiInputPolicy',
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
    verbose=1,
    learning_rate=1e-3,
    batch_size=256,
    tau=0.005,
    gamma=0.98,
    train_freq=1,
    gradient_steps=1,
 )
 model.learn(total_timesteps=500_000)
 ```
 At inference time, inject any target by constructing the observation manually:
 ```python
 import numpy as np
 obs, _ = env.reset()
 # Override the desired goal (values are normalised to [0,1] within goal_range)
 obs['desired_goal'] = np.array([0.8, 0.8, 0.8], dtype=np.float32)  # ~960 kW per generator
 action, _ = model.predict(obs, deterministic=True)
 ```
 Predefined goal environments:
 - `Nucon-goal_power-v0`: target total generator output (3 × 0–1200 kW)
 - `Nucon-goal_temp-v0`: target core temperature (280–380 °C)
 But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a
 ## Simulator (Work in Progress)