docs: document NuconGoalEnv and HER training in README

- Describe both NuconEnv and NuconGoalEnv with their obs/action spaces
- Explain goal-conditioned approach and why HER is appropriate
- Add SAC + HerReplayBuffer usage example with recommended hyperparams
- Show how to inject a custom goal at inference time
- List registered goal env presets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Dominik Moritz Roth 2026-03-12 17:38:20 +01:00
parent 0dab7a6cec
commit 1f7ecc301f

104
README.md
View File

@ -123,18 +123,24 @@ To use you'll need to install `gymnasium` and `numpy`. You can do so via
pip install -e '.[rl]' pip install -e '.[rl]'
``` ```
### RL Environment ### Environments
The `NuconEnv` class in `nucon/rl.py` provides a Gym-compatible environment for reinforcement learning tasks in the Nucleares simulation. Key features include: Two environment classes are provided in `nucon/rl.py`:
- Observation space: Includes all readable parameters from the NuCon system. **`NuconEnv`** — classic fixed-objective environment. You define one or more objectives at construction time (e.g. maximise power output, keep temperature in range). The agent always trains toward the same goal.
- Action space: Encompasses all writable parameters in the NuCon system.
- Step function: Applies actions to the NuCon system and returns new observations.
- Objective function: Allows for predefined or custom objective functions to be defined for training.
### Usage - Observation space: all readable numeric parameters (~290 dims).
- Action space: all readable-back writable parameters (~30 dims): 9 individual rod bank positions, 3 MSCVs, 3 turbine bypass valves, 6 coolant pump speeds, condenser pump, freight/vent switches, resistor banks, and more.
- Objectives: predefined strings (`'max_power'`, `'episode_time'`) or arbitrary callables `(obs) -> float`. Multiple objectives are weighted-summed.
**`NuconGoalEnv`** — goal-conditioned environment. The desired goal (e.g. target generator output) is sampled at the start of each episode and provided as part of the observation. A single policy learns to reach *any* goal in the specified range, making it far more useful than a fixed-objective agent. Designed for training with [Hindsight Experience Replay (HER)](https://arxiv.org/abs/1707.01495), which makes sparse-reward goal-conditioned training tractable.
- Observation space: `Dict` with keys `observation` (non-goal params), `achieved_goal` (current goal param values, normalised to [0,1]), `desired_goal` (target, normalised to [0,1]).
- Goals are sampled uniformly from the specified `goal_range` each episode.
- Reward defaults to negative L2 distance in normalised goal space (dense). Pass `tolerance` for a sparse `{0, -1}` reward — this works particularly well with HER.
### NuconEnv Usage
Here's a basic example of how to use the RL environment:
```python ```python
from nucon.rl import NuconEnv, Parameterized_Objectives from nucon.rl import NuconEnv, Parameterized_Objectives
@ -154,44 +160,88 @@ env.close()
Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values. Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values.
You can e.g. train an PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation: You can e.g. train a PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
```python ```python
from nucon.rl import NuconEnv from nucon.rl import NuconEnv
from stable_baselines3 import PPO from stable_baselines3 import PPO
env = NuconEnv(objectives=['max_power'], seconds_per_step=5) env = NuconEnv(objectives=['max_power'], seconds_per_step=5)
# Create the PPO (Proximal Policy Optimization) model
model = PPO( model = PPO(
"MlpPolicy", "MlpPolicy",
env, env,
verbose=1, verbose=1,
learning_rate=3e-4, # You can adjust hyperparameters as needed learning_rate=3e-4,
n_steps=2048, n_steps=2048,
batch_size=64, batch_size=64,
n_epochs=10, n_epochs=10,
gamma=0.99, gamma=0.99,
gae_lambda=0.95, gae_lambda=0.95,
clip_range=0.2, clip_range=0.2,
ent_coef=0.01 ent_coef=0.01,
) )
model.learn(total_timesteps=100_000)
# Train the model
model.learn(total_timesteps=100000) # Adjust total_timesteps as needed
# Test the trained model
obs, info = env.reset() obs, info = env.reset()
for _ in range(1000): for _ in range(1000):
action, _states = model.predict(obs, deterministic=True) action, _states = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action) obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated: if terminated or truncated:
obs, info = env.reset() obs, info = env.reset()
# Close the environment
env.close() env.close()
``` ```
### NuconGoalEnv + HER Usage
HER works by relabelling past trajectories with the goal that was *actually achieved*, turning every episode into useful training signal even when the agent never reaches the intended target. This makes it much more sample-efficient than standard RL for goal-reaching tasks — important given how slow the real game is.
```python
from nucon.rl import NuconGoalEnv
from stable_baselines3 import SAC
from stable_baselines3.common.buffers import HerReplayBuffer
env = NuconGoalEnv(
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
goal_range={
'GENERATOR_0_KW': (0.0, 1200.0),
'GENERATOR_1_KW': (0.0, 1200.0),
'GENERATOR_2_KW': (0.0, 1200.0),
},
tolerance=0.05, # sparse: within 5% of range counts as success (recommended with HER)
seconds_per_step=5,
simulator=simulator, # use a pre-trained simulator for fast pre-training
)
# Or use a preset: env = gym.make('Nucon-goal_power-v0', simulator=simulator)
model = SAC(
'MultiInputPolicy',
env,
replay_buffer_class=HerReplayBuffer,
replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
verbose=1,
learning_rate=1e-3,
batch_size=256,
tau=0.005,
gamma=0.98,
train_freq=1,
gradient_steps=1,
)
model.learn(total_timesteps=500_000)
```
At inference time, inject any target by constructing the observation manually:
```python
import numpy as np
obs, _ = env.reset()
# Override the desired goal (values are normalised to [0,1] within goal_range)
obs['desired_goal'] = np.array([0.8, 0.8, 0.8], dtype=np.float32) # ~960 kW per generator
action, _ = model.predict(obs, deterministic=True)
```
Predefined goal environments:
- `Nucon-goal_power-v0`: target total generator output (3 × 01200 kW)
- `Nucon-goal_temp-v0`: target core temperature (280380 °C)
But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a
## Simulator (Work in Progress) ## Simulator (Work in Progress)