docs: add full training loop section to README

Documents the iterative sim-to-real workflow:
1. Human data collection during gameplay
2. Initial model fitting (kNN or NN)
3. RL training in simulator (SAC + HER)
4. Eval in game while collecting new data
5. Refit model, repeat

Includes ASCII flow diagram, code for each step, and a convergence
criterion (low kNN uncertainty throughout episode).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Dominik Moritz Roth 2026-03-12 18:13:12 +01:00
parent c3111ad5be
commit a4f898c3ad

159
README.md
View File

@ -356,6 +356,165 @@ knn_learner.save_model('reactor_knn.pkl')
The trained models can be integrated into the NuconSimulator to provide accurate dynamics based on real game data.
## Full Training Loop
The recommended end-to-end workflow for training an RL operator is an iterative cycle of real-game data collection, model fitting, and simulated training. The real game is slow and cannot be parallelised, so the bulk of RL training happens in the simulator — the game is used only as an oracle for data and evaluation.
```
┌─────────────────────────────────────────────────────────────┐
│ 1. Human dataset collection │
│ Play the game: start up the reactor, operate it across │
│ a range of states. NuCon records state transitions. │
└───────────────────────┬─────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. Initial model fitting │
│ Fit NN or kNN dynamics model to the collected dataset. │
│ kNN is instant; NN needs gradient steps but generalises │
│ better with more data. │
└───────────────────────┬─────────────────────────────────────┘
┌─────────▼──────────┐
│ 3. Train RL │◄──────────────────────┐
│ in simulator │ │
│ (fast, many │ │
│ trajectories) │ │
└─────────┬──────────┘ │
│ │
▼ │
┌─────────────────────┐ │
│ 4. Eval in game │ │
│ + collect new data │ │
│ (merge & prune │ │
│ dataset) │ │
└─────────┬───────────┘ │
│ │
▼ │
┌─────────────────────┐ model improved? │
│ 5. Refit model ├──────── yes ──────────┘
│ on expanded data │
└─────────────────────┘
```
### Step 1 — Human dataset collection
Start `NuconModelLearner` before or during your play session. Try to cover a wide range of reactor states — startup from cold, ramping power up and down, adjusting individual rod banks, pump speed changes. Diversity in the dataset directly determines how accurate the simulator will be.
```python
from nucon.model import NuconModelLearner
learner = NuconModelLearner(
dataset_path='reactor_dataset.pkl',
time_delta=10.0, # 10 game-seconds per sample
)
learner.collect_data(num_steps=500, save_every=10)
```
The collector saves every 10 steps, retries automatically on game crashes, and scales wall-clock sleep with `GAME_SIM_SPEED` so samples are always 10 game-seconds apart regardless of simulation speed.
### Step 2 — Initial model fitting
```python
from nucon.model import NuconModelLearner
learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
# Option A: kNN + GP (instant fit, built-in uncertainty estimation)
learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
learner.fit_knn(k=10)
learner.save_model('reactor_knn.pkl')
# Option B: Neural network (better extrapolation with larger datasets)
learner.train_model(batch_size=32, num_epochs=50)
learner.drop_well_fitted(error_threshold=1.0) # keep hard samples for next round
learner.save_model('reactor_nn.pth')
```
### Step 3 — Train RL in simulator
Load the fitted model into the simulator and train with SAC + HER. The simulator runs orders of magnitude faster than the real game, allowing millions of steps in reasonable time.
```python
from nucon.sim import NuconSimulator, OperatingState
from nucon.rl import NuconGoalEnv
from stable_baselines3 import SAC
from stable_baselines3.common.buffers import HerReplayBuffer
simulator = NuconSimulator()
simulator.load_model('reactor_knn.pkl')
simulator.set_state(OperatingState.NOMINAL)
env = NuconGoalEnv(
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
tolerance=0.05,
simulator=simulator,
seconds_per_step=10,
)
model = SAC(
'MultiInputPolicy', env,
replay_buffer_class=HerReplayBuffer,
replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
verbose=1,
)
model.learn(total_timesteps=500_000)
model.save('rl_policy.zip')
```
### Step 4 — Eval in game + collect new data
Run the trained policy against the real game. This validates whether the simulator was accurate enough, and simultaneously collects new data covering states the policy visits — which may be regions the original dataset missed.
```python
from nucon.rl import NuconGoalEnv
from nucon.model import NuconModelLearner
from stable_baselines3 import SAC
import numpy as np
# Load policy and run in real game
env = NuconGoalEnv(
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
seconds_per_step=10,
)
policy = SAC.load('rl_policy.zip')
# Simultaneously collect new data
new_data_learner = NuconModelLearner(
dataset_path='reactor_dataset_new.pkl',
time_delta=10.0,
)
obs, _ = env.reset()
for _ in range(200):
action, _ = policy.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
if terminated or truncated:
obs, _ = env.reset()
```
### Step 5 — Refit model on expanded data
Merge the new data into the original dataset and refit:
```python
learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
learner.merge_datasets('reactor_dataset_new.pkl')
# Prune redundant samples before refitting
learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
print(f"Dataset size after pruning: {len(learner.dataset)}")
learner.fit_knn(k=10)
learner.save_model('reactor_knn.pkl')
```
Then go back to Step 3 with the improved model. Each iteration the simulator gets more accurate, the policy gets better, and the new data collection explores increasingly interesting regions of state space.
**When to stop**: when the policy performs well in the real game and the kNN uncertainty stays low throughout an episode (indicating the policy stays within the known data distribution).
## Testing
NuCon includes a test suite to verify its functionality and compatibility with the Nucleares game.