docs: document NuconGoalEnv and HER training in README
- Describe both NuconEnv and NuconGoalEnv with their obs/action spaces - Explain goal-conditioned approach and why HER is appropriate - Add SAC + HerReplayBuffer usage example with recommended hyperparams - Show how to inject a custom goal at inference time - List registered goal env presets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
0dab7a6cec
commit
1f7ecc301f
104
README.md
104
README.md
@ -123,18 +123,24 @@ To use you'll need to install `gymnasium` and `numpy`. You can do so via
|
||||
pip install -e '.[rl]'
|
||||
```
|
||||
|
||||
### RL Environment
|
||||
### Environments
|
||||
|
||||
The `NuconEnv` class in `nucon/rl.py` provides a Gym-compatible environment for reinforcement learning tasks in the Nucleares simulation. Key features include:
|
||||
Two environment classes are provided in `nucon/rl.py`:
|
||||
|
||||
- Observation space: Includes all readable parameters from the NuCon system.
|
||||
- Action space: Encompasses all writable parameters in the NuCon system.
|
||||
- Step function: Applies actions to the NuCon system and returns new observations.
|
||||
- Objective function: Allows for predefined or custom objective functions to be defined for training.
|
||||
**`NuconEnv`** — classic fixed-objective environment. You define one or more objectives at construction time (e.g. maximise power output, keep temperature in range). The agent always trains toward the same goal.
|
||||
|
||||
### Usage
|
||||
- Observation space: all readable numeric parameters (~290 dims).
|
||||
- Action space: all readable-back writable parameters (~30 dims): 9 individual rod bank positions, 3 MSCVs, 3 turbine bypass valves, 6 coolant pump speeds, condenser pump, freight/vent switches, resistor banks, and more.
|
||||
- Objectives: predefined strings (`'max_power'`, `'episode_time'`) or arbitrary callables `(obs) -> float`. Multiple objectives are weighted-summed.
|
||||
|
||||
**`NuconGoalEnv`** — goal-conditioned environment. The desired goal (e.g. target generator output) is sampled at the start of each episode and provided as part of the observation. A single policy learns to reach *any* goal in the specified range, making it far more useful than a fixed-objective agent. Designed for training with [Hindsight Experience Replay (HER)](https://arxiv.org/abs/1707.01495), which makes sparse-reward goal-conditioned training tractable.
|
||||
|
||||
- Observation space: `Dict` with keys `observation` (non-goal params), `achieved_goal` (current goal param values, normalised to [0,1]), `desired_goal` (target, normalised to [0,1]).
|
||||
- Goals are sampled uniformly from the specified `goal_range` each episode.
|
||||
- Reward defaults to negative L2 distance in normalised goal space (dense). Pass `tolerance` for a sparse `{0, -1}` reward — this works particularly well with HER.
|
||||
|
||||
### NuconEnv Usage
|
||||
|
||||
Here's a basic example of how to use the RL environment:
|
||||
```python
|
||||
from nucon.rl import NuconEnv, Parameterized_Objectives
|
||||
|
||||
@ -154,44 +160,88 @@ env.close()
|
||||
|
||||
Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values.
|
||||
|
||||
You can e.g. train an PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
|
||||
You can e.g. train a PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
|
||||
```python
|
||||
from nucon.rl import NuconEnv
|
||||
from stable_baselines3 import PPO
|
||||
|
||||
env = NuconEnv(objectives=['max_power'], seconds_per_step=5)
|
||||
|
||||
# Create the PPO (Proximal Policy Optimization) model
|
||||
model = PPO(
|
||||
"MlpPolicy",
|
||||
env,
|
||||
"MlpPolicy",
|
||||
env,
|
||||
verbose=1,
|
||||
learning_rate=3e-4, # You can adjust hyperparameters as needed
|
||||
n_steps=2048,
|
||||
batch_size=64,
|
||||
n_epochs=10,
|
||||
gamma=0.99,
|
||||
gae_lambda=0.95,
|
||||
clip_range=0.2,
|
||||
ent_coef=0.01
|
||||
learning_rate=3e-4,
|
||||
n_steps=2048,
|
||||
batch_size=64,
|
||||
n_epochs=10,
|
||||
gamma=0.99,
|
||||
gae_lambda=0.95,
|
||||
clip_range=0.2,
|
||||
ent_coef=0.01,
|
||||
)
|
||||
model.learn(total_timesteps=100_000)
|
||||
|
||||
# Train the model
|
||||
model.learn(total_timesteps=100000) # Adjust total_timesteps as needed
|
||||
|
||||
# Test the trained model
|
||||
obs, info = env.reset()
|
||||
for _ in range(1000):
|
||||
action, _states = model.predict(obs, deterministic=True)
|
||||
obs, reward, terminated, truncated, info = env.step(action)
|
||||
|
||||
if terminated or truncated:
|
||||
obs, info = env.reset()
|
||||
|
||||
# Close the environment
|
||||
env.close()
|
||||
```
|
||||
|
||||
### NuconGoalEnv + HER Usage
|
||||
|
||||
HER works by relabelling past trajectories with the goal that was *actually achieved*, turning every episode into useful training signal even when the agent never reaches the intended target. This makes it much more sample-efficient than standard RL for goal-reaching tasks — important given how slow the real game is.
|
||||
|
||||
```python
|
||||
from nucon.rl import NuconGoalEnv
|
||||
from stable_baselines3 import SAC
|
||||
from stable_baselines3.common.buffers import HerReplayBuffer
|
||||
|
||||
env = NuconGoalEnv(
|
||||
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
|
||||
goal_range={
|
||||
'GENERATOR_0_KW': (0.0, 1200.0),
|
||||
'GENERATOR_1_KW': (0.0, 1200.0),
|
||||
'GENERATOR_2_KW': (0.0, 1200.0),
|
||||
},
|
||||
tolerance=0.05, # sparse: within 5% of range counts as success (recommended with HER)
|
||||
seconds_per_step=5,
|
||||
simulator=simulator, # use a pre-trained simulator for fast pre-training
|
||||
)
|
||||
# Or use a preset: env = gym.make('Nucon-goal_power-v0', simulator=simulator)
|
||||
|
||||
model = SAC(
|
||||
'MultiInputPolicy',
|
||||
env,
|
||||
replay_buffer_class=HerReplayBuffer,
|
||||
replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
|
||||
verbose=1,
|
||||
learning_rate=1e-3,
|
||||
batch_size=256,
|
||||
tau=0.005,
|
||||
gamma=0.98,
|
||||
train_freq=1,
|
||||
gradient_steps=1,
|
||||
)
|
||||
model.learn(total_timesteps=500_000)
|
||||
```
|
||||
|
||||
At inference time, inject any target by constructing the observation manually:
|
||||
```python
|
||||
import numpy as np
|
||||
obs, _ = env.reset()
|
||||
# Override the desired goal (values are normalised to [0,1] within goal_range)
|
||||
obs['desired_goal'] = np.array([0.8, 0.8, 0.8], dtype=np.float32) # ~960 kW per generator
|
||||
action, _ = model.predict(obs, deterministic=True)
|
||||
```
|
||||
|
||||
Predefined goal environments:
|
||||
- `Nucon-goal_power-v0`: target total generator output (3 × 0–1200 kW)
|
||||
- `Nucon-goal_temp-v0`: target core temperature (280–380 °C)
|
||||
|
||||
But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a
|
||||
|
||||
## Simulator (Work in Progress)
|
||||
|
||||
Loading…
Reference in New Issue
Block a user