docs: replace step-by-step code blocks in training loop with prose

The prior sections already have full code examples; the training loop section now just describes each step concisely and links back to them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 18:20:10 +01:00 · 2026-03-12 18:20:10 +01:00 · 7ee8272034
commit 7ee8272034
parent f0cc7ba9c4
1 changed files with 6 additions and 112 deletions
--- a/README.md
+++ b/README.md
@ -397,123 +397,17 @@ The recommended end-to-end workflow for training an RL operator is an iterative
              └─────────────────────┘
 ```
-### Step 1 — Human dataset collection
+**Step 1 — Human dataset collection**: Run `NuconModelLearner.collect_data()` during your play session. Cover a wide range of states: startup from cold, ramping power, individual rod bank adjustments. Diversity in the dataset directly determines simulator accuracy. See [Model Learning](#model-learning-work-in-progress) for collection details.
-Start `NuconModelLearner` before or during your play session. Try to cover a wide range of reactor states: startup from cold, ramping power up and down, adjusting individual rod banks, pump speed changes. Diversity in the dataset directly determines how accurate the simulator will be.
+**Step 2 — Initial model fitting**: Fit a kNN model (instant) or NN (better extrapolation with larger datasets) using `fit_knn()` or `train_model()`. Prune near-duplicate samples with `drop_redundant()` before fitting. See [Model Learning](#model-learning-work-in-progress).
-```python
+**Step 3 — Train RL in simulator**: Load the fitted model into `NuconSimulator`, then train a `NuconGoalEnv` policy with SAC + HER. The simulator runs far faster than the real game, allowing many trajectories in reasonable time. See [NuconGoalEnv + HER Usage](#nucongoalenv--her-usage).
 from nucon.model import NuconModelLearner
-learner = NuconModelLearner(
+**Step 4 — Eval in game + collect new data**: Run the trained policy against the real game. This validates simulator accuracy and simultaneously collects new data from states the policy visits, which may be regions the original dataset missed. Run a second `NuconModelLearner` in a background thread to collect concurrently.
    dataset_path='reactor_dataset.pkl',
    time_delta=10.0,   # 10 game-seconds per sample
 )
 learner.collect_data(num_steps=500, save_every=10)
 ```
-The collector saves every 10 steps, retries automatically on game crashes, and scales wall-clock sleep with `GAME_SIM_SPEED` so samples are always 10 game-seconds apart regardless of simulation speed.
+**Step 5 — Refit model on expanded data**: Merge new data into the original dataset with `merge_datasets()`, prune with `drop_redundant()`, and refit. Then return to Step 3 with the improved model. Each iteration the simulator gets more accurate and the policy improves.
-### Step 2 — Initial model fitting
+Stop when the policy performs well in the real game and kNN uncertainty stays low throughout an episode, indicating the policy stays within the known data distribution.
 ```python
 from nucon.model import NuconModelLearner
 learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
 # Option A: kNN + GP (instant fit, built-in uncertainty estimation)
 learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
 learner.fit_knn(k=10)
 learner.save_model('reactor_knn.pkl')
 # Option B: Neural network (better extrapolation with larger datasets)
 learner.train_model(batch_size=32, num_epochs=50)
 learner.drop_well_fitted(error_threshold=1.0)  # keep hard samples for next round
 learner.save_model('reactor_nn.pth')
 ```
 ### Step 3 — Train RL in simulator
 Load the fitted model into the simulator and train with SAC + HER. The simulator runs orders of magnitude faster than the real game, allowing millions of steps in reasonable time.
 ```python
 from nucon.sim import NuconSimulator, OperatingState
 from nucon.rl import NuconGoalEnv
 from stable_baselines3 import SAC
 from stable_baselines3.common.buffers import HerReplayBuffer
 simulator = NuconSimulator()
 simulator.load_model('reactor_knn.pkl')
 simulator.set_state(OperatingState.NOMINAL)
 env = NuconGoalEnv(
    goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
    goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
    tolerance=0.05,
    simulator=simulator,
    seconds_per_step=10,
 )
 model = SAC(
    'MultiInputPolicy', env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
    verbose=1,
 )
 model.learn(total_timesteps=500_000)
 model.save('rl_policy.zip')
 ```
 ### Step 4 — Eval in game + collect new data
 Run the trained policy against the real game. This validates whether the simulator was accurate enough, and simultaneously collects new data covering states the policy visits, which may be regions the original dataset missed.
 ```python
 from nucon.rl import NuconGoalEnv
 from nucon.model import NuconModelLearner
 from stable_baselines3 import SAC
 import numpy as np
 # Load policy and run in real game
 env = NuconGoalEnv(
    goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
    goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
    seconds_per_step=10,
 )
 policy = SAC.load('rl_policy.zip')
 # Simultaneously collect new data
 new_data_learner = NuconModelLearner(
    dataset_path='reactor_dataset_new.pkl',
    time_delta=10.0,
 )
 obs, _ = env.reset()
 for _ in range(200):
    action, _ = policy.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, _ = env.step(action)
    if terminated or truncated:
        obs, _ = env.reset()
 ```
 ### Step 5 — Refit model on expanded data
 Merge the new data into the original dataset and refit:
 ```python
 learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
 learner.merge_datasets('reactor_dataset_new.pkl')
 # Prune redundant samples before refitting
 learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
 print(f"Dataset size after pruning: {len(learner.dataset)}")
 learner.fit_knn(k=10)
 learner.save_model('reactor_knn.pkl')
 ```
 Then go back to Step 3 with the improved model. Each iteration the simulator gets more accurate, the policy gets better, and the new data collection explores increasingly interesting regions of state space.
 **When to stop**: when the policy performs well in the real game and the kNN uncertainty stays low throughout an episode (indicating the policy stays within the known data distribution).
 ## Testing