docs: add full training loop section to README

Documents the iterative sim-to-real workflow:
1. Human data collection during gameplay
2. Initial model fitting (kNN or NN)
3. RL training in simulator (SAC + HER)
4. Eval in game while collecting new data
5. Refit model, repeat

Includes ASCII flow diagram, code for each step, and a convergence
criterion (low kNN uncertainty throughout episode).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Dominik Moritz Roth 2026-03-12 18:13:12 +01:00
parent c3111ad5be
commit a4f898c3ad

159
README.md
View File

@ -356,6 +356,165 @@ knn_learner.save_model('reactor_knn.pkl')
The trained models can be integrated into the NuconSimulator to provide accurate dynamics based on real game data. The trained models can be integrated into the NuconSimulator to provide accurate dynamics based on real game data.
## Full Training Loop
The recommended end-to-end workflow for training an RL operator is an iterative cycle of real-game data collection, model fitting, and simulated training. The real game is slow and cannot be parallelised, so the bulk of RL training happens in the simulator — the game is used only as an oracle for data and evaluation.
```
┌─────────────────────────────────────────────────────────────┐
│ 1. Human dataset collection │
│ Play the game: start up the reactor, operate it across │
│ a range of states. NuCon records state transitions. │
└───────────────────────┬─────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. Initial model fitting │
│ Fit NN or kNN dynamics model to the collected dataset. │
│ kNN is instant; NN needs gradient steps but generalises │
│ better with more data. │
└───────────────────────┬─────────────────────────────────────┘
┌─────────▼──────────┐
│ 3. Train RL │◄──────────────────────┐
│ in simulator │ │
│ (fast, many │ │
│ trajectories) │ │
└─────────┬──────────┘ │
│ │
▼ │
┌─────────────────────┐ │
│ 4. Eval in game │ │
│ + collect new data │ │
│ (merge & prune │ │
│ dataset) │ │
└─────────┬───────────┘ │
│ │
▼ │
┌─────────────────────┐ model improved? │
│ 5. Refit model ├──────── yes ──────────┘
│ on expanded data │
└─────────────────────┘
```
### Step 1 — Human dataset collection
Start `NuconModelLearner` before or during your play session. Try to cover a wide range of reactor states — startup from cold, ramping power up and down, adjusting individual rod banks, pump speed changes. Diversity in the dataset directly determines how accurate the simulator will be.
```python
from nucon.model import NuconModelLearner
learner = NuconModelLearner(
dataset_path='reactor_dataset.pkl',
time_delta=10.0, # 10 game-seconds per sample
)
learner.collect_data(num_steps=500, save_every=10)
```
The collector saves every 10 steps, retries automatically on game crashes, and scales wall-clock sleep with `GAME_SIM_SPEED` so samples are always 10 game-seconds apart regardless of simulation speed.
### Step 2 — Initial model fitting
```python
from nucon.model import NuconModelLearner
learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
# Option A: kNN + GP (instant fit, built-in uncertainty estimation)
learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
learner.fit_knn(k=10)
learner.save_model('reactor_knn.pkl')
# Option B: Neural network (better extrapolation with larger datasets)
learner.train_model(batch_size=32, num_epochs=50)
learner.drop_well_fitted(error_threshold=1.0) # keep hard samples for next round
learner.save_model('reactor_nn.pth')
```
### Step 3 — Train RL in simulator
Load the fitted model into the simulator and train with SAC + HER. The simulator runs orders of magnitude faster than the real game, allowing millions of steps in reasonable time.
```python
from nucon.sim import NuconSimulator, OperatingState
from nucon.rl import NuconGoalEnv
from stable_baselines3 import SAC
from stable_baselines3.common.buffers import HerReplayBuffer
simulator = NuconSimulator()
simulator.load_model('reactor_knn.pkl')
simulator.set_state(OperatingState.NOMINAL)
env = NuconGoalEnv(
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
tolerance=0.05,
simulator=simulator,
seconds_per_step=10,
)
model = SAC(
'MultiInputPolicy', env,
replay_buffer_class=HerReplayBuffer,
replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
verbose=1,
)
model.learn(total_timesteps=500_000)
model.save('rl_policy.zip')
```
### Step 4 — Eval in game + collect new data
Run the trained policy against the real game. This validates whether the simulator was accurate enough, and simultaneously collects new data covering states the policy visits — which may be regions the original dataset missed.
```python
from nucon.rl import NuconGoalEnv
from nucon.model import NuconModelLearner
from stable_baselines3 import SAC
import numpy as np
# Load policy and run in real game
env = NuconGoalEnv(
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
seconds_per_step=10,
)
policy = SAC.load('rl_policy.zip')
# Simultaneously collect new data
new_data_learner = NuconModelLearner(
dataset_path='reactor_dataset_new.pkl',
time_delta=10.0,
)
obs, _ = env.reset()
for _ in range(200):
action, _ = policy.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
if terminated or truncated:
obs, _ = env.reset()
```
### Step 5 — Refit model on expanded data
Merge the new data into the original dataset and refit:
```python
learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
learner.merge_datasets('reactor_dataset_new.pkl')
# Prune redundant samples before refitting
learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
print(f"Dataset size after pruning: {len(learner.dataset)}")
learner.fit_knn(k=10)
learner.save_model('reactor_knn.pkl')
```
Then go back to Step 3 with the improved model. Each iteration the simulator gets more accurate, the policy gets better, and the new data collection explores increasingly interesting regions of state space.
**When to stop**: when the policy performs well in the real game and the kNN uncertainty stays low throughout an episode (indicating the policy stays within the known data distribution).
## Testing ## Testing
NuCon includes a test suite to verify its functionality and compatibility with the Nucleares game. NuCon includes a test suite to verify its functionality and compatibility with the Nucleares game.