docs: add full training loop section to README

Documents the iterative sim-to-real workflow: 1. Human data collection during gameplay 2. Initial model fitting (kNN or NN) 3. RL training in simulator (SAC + HER) 4. Eval in game while collecting new data 5. Refit model, repeat Includes ASCII flow diagram, code for each step, and a convergence criterion (low kNN uncertainty throughout episode). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 18:13:12 +01:00 · 2026-03-12 18:13:12 +01:00 · a4f898c3ad
commit a4f898c3ad
parent c3111ad5be
1 changed files with 159 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -356,6 +356,165 @@ knn_learner.save_model('reactor_knn.pkl')

 The trained models can be integrated into the NuconSimulator to provide accurate dynamics based on real game data.

+## Full Training Loop
+
+The recommended end-to-end workflow for training an RL operator is an iterative cycle of real-game data collection, model fitting, and simulated training. The real game is slow and cannot be parallelised, so the bulk of RL training happens in the simulator — the game is used only as an oracle for data and evaluation.
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  1. Human dataset collection                                │
+│     Play the game: start up the reactor, operate it across │
+│     a range of states. NuCon records state transitions.     │
+└───────────────────────┬─────────────────────────────────────┘
+                        │
+                        ▼
+┌─────────────────────────────────────────────────────────────┐
+│  2. Initial model fitting                                   │
+│     Fit NN or kNN dynamics model to the collected dataset.  │
+│     kNN is instant; NN needs gradient steps but generalises │
+│     better with more data.                                  │
+└───────────────────────┬─────────────────────────────────────┘
+                        │
+              ┌─────────▼──────────┐
+              │   3. Train RL      │◄──────────────────────┐
+              │   in simulator     │                        │
+              │   (fast, many      │                        │
+              │   trajectories)    │                        │
+              └─────────┬──────────┘                        │
+                        │                                   │
+                        ▼                                   │
+              ┌─────────────────────┐                       │
+              │  4. Eval in game    │                       │
+              │  + collect new data │                       │
+              │  (merge & prune     │                       │
+              │   dataset)          │                       │
+              └─────────┬───────────┘                       │
+                        │                                   │
+                        ▼                                   │
+              ┌─────────────────────┐      model improved?  │
+              │  5. Refit model     ├──────── yes ──────────┘
+              │  on expanded data   │
+              └─────────────────────┘
+```
+
+### Step 1 — Human dataset collection
+
+Start `NuconModelLearner` before or during your play session. Try to cover a wide range of reactor states — startup from cold, ramping power up and down, adjusting individual rod banks, pump speed changes. Diversity in the dataset directly determines how accurate the simulator will be.
+
+```python
+from nucon.model import NuconModelLearner
+
+learner = NuconModelLearner(
+    dataset_path='reactor_dataset.pkl',
+    time_delta=10.0,   # 10 game-seconds per sample
+)
+learner.collect_data(num_steps=500, save_every=10)
+```
+
+The collector saves every 10 steps, retries automatically on game crashes, and scales wall-clock sleep with `GAME_SIM_SPEED` so samples are always 10 game-seconds apart regardless of simulation speed.
+
+### Step 2 — Initial model fitting
+
+```python
+from nucon.model import NuconModelLearner
+
+learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
+
+# Option A: kNN + GP (instant fit, built-in uncertainty estimation)
+learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
+learner.fit_knn(k=10)
+learner.save_model('reactor_knn.pkl')
+
+# Option B: Neural network (better extrapolation with larger datasets)
+learner.train_model(batch_size=32, num_epochs=50)
+learner.drop_well_fitted(error_threshold=1.0)  # keep hard samples for next round
+learner.save_model('reactor_nn.pth')
+```
+
+### Step 3 — Train RL in simulator
+
+Load the fitted model into the simulator and train with SAC + HER. The simulator runs orders of magnitude faster than the real game, allowing millions of steps in reasonable time.
+
+```python
+from nucon.sim import NuconSimulator, OperatingState
+from nucon.rl import NuconGoalEnv
+from stable_baselines3 import SAC
+from stable_baselines3.common.buffers import HerReplayBuffer
+
+simulator = NuconSimulator()
+simulator.load_model('reactor_knn.pkl')
+simulator.set_state(OperatingState.NOMINAL)
+
+env = NuconGoalEnv(
+    goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
+    goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
+    tolerance=0.05,
+    simulator=simulator,
+    seconds_per_step=10,
+)
+
+model = SAC(
+    'MultiInputPolicy', env,
+    replay_buffer_class=HerReplayBuffer,
+    replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
+    verbose=1,
+)
+model.learn(total_timesteps=500_000)
+model.save('rl_policy.zip')
+```
+
+### Step 4 — Eval in game + collect new data
+
+Run the trained policy against the real game. This validates whether the simulator was accurate enough, and simultaneously collects new data covering states the policy visits — which may be regions the original dataset missed.
+
+```python
+from nucon.rl import NuconGoalEnv
+from nucon.model import NuconModelLearner
+from stable_baselines3 import SAC
+import numpy as np
+
+# Load policy and run in real game
+env = NuconGoalEnv(
+    goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
+    goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
+    seconds_per_step=10,
+)
+policy = SAC.load('rl_policy.zip')
+
+# Simultaneously collect new data
+new_data_learner = NuconModelLearner(
+    dataset_path='reactor_dataset_new.pkl',
+    time_delta=10.0,
+)
+
+obs, _ = env.reset()
+for _ in range(200):
+    action, _ = policy.predict(obs, deterministic=True)
+    obs, reward, terminated, truncated, _ = env.step(action)
+    if terminated or truncated:
+        obs, _ = env.reset()
+```
+
+### Step 5 — Refit model on expanded data
+
+Merge the new data into the original dataset and refit:
+
+```python
+learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
+learner.merge_datasets('reactor_dataset_new.pkl')
+
+# Prune redundant samples before refitting
+learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
+print(f"Dataset size after pruning: {len(learner.dataset)}")
+
+learner.fit_knn(k=10)
+learner.save_model('reactor_knn.pkl')
+```
+
+Then go back to Step 3 with the improved model. Each iteration the simulator gets more accurate, the policy gets better, and the new data collection explores increasingly interesting regions of state space.
+
+**When to stop**: when the policy performs well in the real game and the kNN uncertainty stays low throughout an episode (indicating the policy stays within the known data distribution).
+
 ## Testing

 NuCon includes a test suite to verify its functionality and compatibility with the Nucleares game.