docs: add full training loop section to README

Documents the iterative sim-to-real workflow: 1. Human data collection during gameplay 2. Initial model fitting (kNN or NN) 3. RL training in simulator (SAC + HER) 4. Eval in game while collecting new data 5. Refit model, repeat Includes ASCII flow diagram, code for each step, and a convergence criterion (low kNN uncertainty throughout episode). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 18:13:12 +01:00 · 2026-03-12 18:13:12 +01:00 · a4f898c3ad
commit a4f898c3ad
parent c3111ad5be
1 changed files with 159 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -356,6 +356,165 @@ knn_learner.save_model('reactor_knn.pkl')
 The trained models can be integrated into the NuconSimulator to provide accurate dynamics based on real game data.
 ## Full Training Loop
 The recommended end-to-end workflow for training an RL operator is an iterative cycle of real-game data collection, model fitting, and simulated training. The real game is slow and cannot be parallelised, so the bulk of RL training happens in the simulator — the game is used only as an oracle for data and evaluation.
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │  1. Human dataset collection                                │
 │     Play the game: start up the reactor, operate it across │
 │     a range of states. NuCon records state transitions.     │
 └───────────────────────┬─────────────────────────────────────┘
                        │
                        ▼
 ┌─────────────────────────────────────────────────────────────┐
 │  2. Initial model fitting                                   │
 │     Fit NN or kNN dynamics model to the collected dataset.  │
 │     kNN is instant; NN needs gradient steps but generalises │
 │     better with more data.                                  │
 └───────────────────────┬─────────────────────────────────────┘
                        │
              ┌─────────▼──────────┐
              │   3. Train RL      │◄──────────────────────┐
              │   in simulator     │                        │
              │   (fast, many      │                        │
              │   trajectories)    │                        │
              └─────────┬──────────┘                        │
                        │                                   │
                        ▼                                   │
              ┌─────────────────────┐                       │
              │  4. Eval in game    │                       │
              │  + collect new data │                       │
              │  (merge & prune     │                       │
              │   dataset)          │                       │
              └─────────┬───────────┘                       │
                        │                                   │
                        ▼                                   │
              ┌─────────────────────┐      model improved?  │
              │  5. Refit model     ├──────── yes ──────────┘
              │  on expanded data   │
              └─────────────────────┘
 ```
 ### Step 1 — Human dataset collection
 Start `NuconModelLearner` before or during your play session. Try to cover a wide range of reactor states — startup from cold, ramping power up and down, adjusting individual rod banks, pump speed changes. Diversity in the dataset directly determines how accurate the simulator will be.
 ```python
 from nucon.model import NuconModelLearner
 learner = NuconModelLearner(
    dataset_path='reactor_dataset.pkl',
    time_delta=10.0,   # 10 game-seconds per sample
 )
 learner.collect_data(num_steps=500, save_every=10)
 ```
 The collector saves every 10 steps, retries automatically on game crashes, and scales wall-clock sleep with `GAME_SIM_SPEED` so samples are always 10 game-seconds apart regardless of simulation speed.
 ### Step 2 — Initial model fitting
 ```python
 from nucon.model import NuconModelLearner
 learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
 # Option A: kNN + GP (instant fit, built-in uncertainty estimation)
 learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
 learner.fit_knn(k=10)
 learner.save_model('reactor_knn.pkl')
 # Option B: Neural network (better extrapolation with larger datasets)
 learner.train_model(batch_size=32, num_epochs=50)
 learner.drop_well_fitted(error_threshold=1.0)  # keep hard samples for next round
 learner.save_model('reactor_nn.pth')
 ```
 ### Step 3 — Train RL in simulator
 Load the fitted model into the simulator and train with SAC + HER. The simulator runs orders of magnitude faster than the real game, allowing millions of steps in reasonable time.
 ```python
 from nucon.sim import NuconSimulator, OperatingState
 from nucon.rl import NuconGoalEnv
 from stable_baselines3 import SAC
 from stable_baselines3.common.buffers import HerReplayBuffer
 simulator = NuconSimulator()
 simulator.load_model('reactor_knn.pkl')
 simulator.set_state(OperatingState.NOMINAL)
 env = NuconGoalEnv(
    goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
    goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
    tolerance=0.05,
    simulator=simulator,
    seconds_per_step=10,
 )
 model = SAC(
    'MultiInputPolicy', env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
    verbose=1,
 )
 model.learn(total_timesteps=500_000)
 model.save('rl_policy.zip')
 ```
 ### Step 4 — Eval in game + collect new data
 Run the trained policy against the real game. This validates whether the simulator was accurate enough, and simultaneously collects new data covering states the policy visits — which may be regions the original dataset missed.
 ```python
 from nucon.rl import NuconGoalEnv
 from nucon.model import NuconModelLearner
 from stable_baselines3 import SAC
 import numpy as np
 # Load policy and run in real game
 env = NuconGoalEnv(
    goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
    goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
    seconds_per_step=10,
 )
 policy = SAC.load('rl_policy.zip')
 # Simultaneously collect new data
 new_data_learner = NuconModelLearner(
    dataset_path='reactor_dataset_new.pkl',
    time_delta=10.0,
 )
 obs, _ = env.reset()
 for _ in range(200):
    action, _ = policy.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, _ = env.step(action)
    if terminated or truncated:
        obs, _ = env.reset()
 ```
 ### Step 5 — Refit model on expanded data
 Merge the new data into the original dataset and refit:
 ```python
 learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
 learner.merge_datasets('reactor_dataset_new.pkl')
 # Prune redundant samples before refitting
 learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
 print(f"Dataset size after pruning: {len(learner.dataset)}")
 learner.fit_knn(k=10)
 learner.save_model('reactor_knn.pkl')
 ```
 Then go back to Step 3 with the improved model. Each iteration the simulator gets more accurate, the policy gets better, and the new data collection explores increasingly interesting regions of state space.
 **When to stop**: when the policy performs well in the real game and the kNN uncertainty stays low throughout an episode (indicating the policy stays within the known data distribution).
 ## Testing
 NuCon includes a test suite to verify its functionality and compatibility with the Nucleares game.