From 1f7ecc301f45768d1100668747b2c3cf89c15d32 Mon Sep 17 00:00:00 2001 From: Dominik Roth Date: Thu, 12 Mar 2026 17:38:20 +0100 Subject: [PATCH] docs: document NuconGoalEnv and HER training in README - Describe both NuconEnv and NuconGoalEnv with their obs/action spaces - Explain goal-conditioned approach and why HER is appropriate - Add SAC + HerReplayBuffer usage example with recommended hyperparams - Show how to inject a custom goal at inference time - List registered goal env presets Co-Authored-By: Claude Sonnet 4.6 --- README.md | 104 ++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 77 insertions(+), 27 deletions(-) diff --git a/README.md b/README.md index bf93189..6a796b6 100644 --- a/README.md +++ b/README.md @@ -123,18 +123,24 @@ To use you'll need to install `gymnasium` and `numpy`. You can do so via pip install -e '.[rl]' ``` -### RL Environment +### Environments -The `NuconEnv` class in `nucon/rl.py` provides a Gym-compatible environment for reinforcement learning tasks in the Nucleares simulation. Key features include: +Two environment classes are provided in `nucon/rl.py`: -- Observation space: Includes all readable parameters from the NuCon system. -- Action space: Encompasses all writable parameters in the NuCon system. -- Step function: Applies actions to the NuCon system and returns new observations. -- Objective function: Allows for predefined or custom objective functions to be defined for training. +**`NuconEnv`** — classic fixed-objective environment. You define one or more objectives at construction time (e.g. maximise power output, keep temperature in range). The agent always trains toward the same goal. -### Usage +- Observation space: all readable numeric parameters (~290 dims). +- Action space: all readable-back writable parameters (~30 dims): 9 individual rod bank positions, 3 MSCVs, 3 turbine bypass valves, 6 coolant pump speeds, condenser pump, freight/vent switches, resistor banks, and more. +- Objectives: predefined strings (`'max_power'`, `'episode_time'`) or arbitrary callables `(obs) -> float`. Multiple objectives are weighted-summed. + +**`NuconGoalEnv`** — goal-conditioned environment. The desired goal (e.g. target generator output) is sampled at the start of each episode and provided as part of the observation. A single policy learns to reach *any* goal in the specified range, making it far more useful than a fixed-objective agent. Designed for training with [Hindsight Experience Replay (HER)](https://arxiv.org/abs/1707.01495), which makes sparse-reward goal-conditioned training tractable. + +- Observation space: `Dict` with keys `observation` (non-goal params), `achieved_goal` (current goal param values, normalised to [0,1]), `desired_goal` (target, normalised to [0,1]). +- Goals are sampled uniformly from the specified `goal_range` each episode. +- Reward defaults to negative L2 distance in normalised goal space (dense). Pass `tolerance` for a sparse `{0, -1}` reward — this works particularly well with HER. + +### NuconEnv Usage -Here's a basic example of how to use the RL environment: ```python from nucon.rl import NuconEnv, Parameterized_Objectives @@ -154,44 +160,88 @@ env.close() Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values. -You can e.g. train an PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation: +You can e.g. train a PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation: ```python from nucon.rl import NuconEnv from stable_baselines3 import PPO env = NuconEnv(objectives=['max_power'], seconds_per_step=5) -# Create the PPO (Proximal Policy Optimization) model model = PPO( - "MlpPolicy", - env, + "MlpPolicy", + env, verbose=1, - learning_rate=3e-4, # You can adjust hyperparameters as needed - n_steps=2048, - batch_size=64, - n_epochs=10, - gamma=0.99, - gae_lambda=0.95, - clip_range=0.2, - ent_coef=0.01 + learning_rate=3e-4, + n_steps=2048, + batch_size=64, + n_epochs=10, + gamma=0.99, + gae_lambda=0.95, + clip_range=0.2, + ent_coef=0.01, ) +model.learn(total_timesteps=100_000) -# Train the model -model.learn(total_timesteps=100000) # Adjust total_timesteps as needed - -# Test the trained model obs, info = env.reset() for _ in range(1000): action, _states = model.predict(obs, deterministic=True) obs, reward, terminated, truncated, info = env.step(action) - if terminated or truncated: obs, info = env.reset() - -# Close the environment env.close() ``` +### NuconGoalEnv + HER Usage + +HER works by relabelling past trajectories with the goal that was *actually achieved*, turning every episode into useful training signal even when the agent never reaches the intended target. This makes it much more sample-efficient than standard RL for goal-reaching tasks — important given how slow the real game is. + +```python +from nucon.rl import NuconGoalEnv +from stable_baselines3 import SAC +from stable_baselines3.common.buffers import HerReplayBuffer + +env = NuconGoalEnv( + goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'], + goal_range={ + 'GENERATOR_0_KW': (0.0, 1200.0), + 'GENERATOR_1_KW': (0.0, 1200.0), + 'GENERATOR_2_KW': (0.0, 1200.0), + }, + tolerance=0.05, # sparse: within 5% of range counts as success (recommended with HER) + seconds_per_step=5, + simulator=simulator, # use a pre-trained simulator for fast pre-training +) +# Or use a preset: env = gym.make('Nucon-goal_power-v0', simulator=simulator) + +model = SAC( + 'MultiInputPolicy', + env, + replay_buffer_class=HerReplayBuffer, + replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'}, + verbose=1, + learning_rate=1e-3, + batch_size=256, + tau=0.005, + gamma=0.98, + train_freq=1, + gradient_steps=1, +) +model.learn(total_timesteps=500_000) +``` + +At inference time, inject any target by constructing the observation manually: +```python +import numpy as np +obs, _ = env.reset() +# Override the desired goal (values are normalised to [0,1] within goal_range) +obs['desired_goal'] = np.array([0.8, 0.8, 0.8], dtype=np.float32) # ~960 kW per generator +action, _ = model.predict(obs, deterministic=True) +``` + +Predefined goal environments: +- `Nucon-goal_power-v0`: target total generator output (3 × 0–1200 kW) +- `Nucon-goal_temp-v0`: target core temperature (280–380 °C) + But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a ## Simulator (Work in Progress)