docs: add full training loop section to README
Documents the iterative sim-to-real workflow: 1. Human data collection during gameplay 2. Initial model fitting (kNN or NN) 3. RL training in simulator (SAC + HER) 4. Eval in game while collecting new data 5. Refit model, repeat Includes ASCII flow diagram, code for each step, and a convergence criterion (low kNN uncertainty throughout episode). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
c3111ad5be
commit
a4f898c3ad
159
README.md
159
README.md
@ -356,6 +356,165 @@ knn_learner.save_model('reactor_knn.pkl')
|
|||||||
|
|
||||||
The trained models can be integrated into the NuconSimulator to provide accurate dynamics based on real game data.
|
The trained models can be integrated into the NuconSimulator to provide accurate dynamics based on real game data.
|
||||||
|
|
||||||
|
## Full Training Loop
|
||||||
|
|
||||||
|
The recommended end-to-end workflow for training an RL operator is an iterative cycle of real-game data collection, model fitting, and simulated training. The real game is slow and cannot be parallelised, so the bulk of RL training happens in the simulator — the game is used only as an oracle for data and evaluation.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ 1. Human dataset collection │
|
||||||
|
│ Play the game: start up the reactor, operate it across │
|
||||||
|
│ a range of states. NuCon records state transitions. │
|
||||||
|
└───────────────────────┬─────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ 2. Initial model fitting │
|
||||||
|
│ Fit NN or kNN dynamics model to the collected dataset. │
|
||||||
|
│ kNN is instant; NN needs gradient steps but generalises │
|
||||||
|
│ better with more data. │
|
||||||
|
└───────────────────────┬─────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌─────────▼──────────┐
|
||||||
|
│ 3. Train RL │◄──────────────────────┐
|
||||||
|
│ in simulator │ │
|
||||||
|
│ (fast, many │ │
|
||||||
|
│ trajectories) │ │
|
||||||
|
└─────────┬──────────┘ │
|
||||||
|
│ │
|
||||||
|
▼ │
|
||||||
|
┌─────────────────────┐ │
|
||||||
|
│ 4. Eval in game │ │
|
||||||
|
│ + collect new data │ │
|
||||||
|
│ (merge & prune │ │
|
||||||
|
│ dataset) │ │
|
||||||
|
└─────────┬───────────┘ │
|
||||||
|
│ │
|
||||||
|
▼ │
|
||||||
|
┌─────────────────────┐ model improved? │
|
||||||
|
│ 5. Refit model ├──────── yes ──────────┘
|
||||||
|
│ on expanded data │
|
||||||
|
└─────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 1 — Human dataset collection
|
||||||
|
|
||||||
|
Start `NuconModelLearner` before or during your play session. Try to cover a wide range of reactor states — startup from cold, ramping power up and down, adjusting individual rod banks, pump speed changes. Diversity in the dataset directly determines how accurate the simulator will be.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from nucon.model import NuconModelLearner
|
||||||
|
|
||||||
|
learner = NuconModelLearner(
|
||||||
|
dataset_path='reactor_dataset.pkl',
|
||||||
|
time_delta=10.0, # 10 game-seconds per sample
|
||||||
|
)
|
||||||
|
learner.collect_data(num_steps=500, save_every=10)
|
||||||
|
```
|
||||||
|
|
||||||
|
The collector saves every 10 steps, retries automatically on game crashes, and scales wall-clock sleep with `GAME_SIM_SPEED` so samples are always 10 game-seconds apart regardless of simulation speed.
|
||||||
|
|
||||||
|
### Step 2 — Initial model fitting
|
||||||
|
|
||||||
|
```python
|
||||||
|
from nucon.model import NuconModelLearner
|
||||||
|
|
||||||
|
learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
|
||||||
|
|
||||||
|
# Option A: kNN + GP (instant fit, built-in uncertainty estimation)
|
||||||
|
learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
|
||||||
|
learner.fit_knn(k=10)
|
||||||
|
learner.save_model('reactor_knn.pkl')
|
||||||
|
|
||||||
|
# Option B: Neural network (better extrapolation with larger datasets)
|
||||||
|
learner.train_model(batch_size=32, num_epochs=50)
|
||||||
|
learner.drop_well_fitted(error_threshold=1.0) # keep hard samples for next round
|
||||||
|
learner.save_model('reactor_nn.pth')
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3 — Train RL in simulator
|
||||||
|
|
||||||
|
Load the fitted model into the simulator and train with SAC + HER. The simulator runs orders of magnitude faster than the real game, allowing millions of steps in reasonable time.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from nucon.sim import NuconSimulator, OperatingState
|
||||||
|
from nucon.rl import NuconGoalEnv
|
||||||
|
from stable_baselines3 import SAC
|
||||||
|
from stable_baselines3.common.buffers import HerReplayBuffer
|
||||||
|
|
||||||
|
simulator = NuconSimulator()
|
||||||
|
simulator.load_model('reactor_knn.pkl')
|
||||||
|
simulator.set_state(OperatingState.NOMINAL)
|
||||||
|
|
||||||
|
env = NuconGoalEnv(
|
||||||
|
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
|
||||||
|
goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
|
||||||
|
tolerance=0.05,
|
||||||
|
simulator=simulator,
|
||||||
|
seconds_per_step=10,
|
||||||
|
)
|
||||||
|
|
||||||
|
model = SAC(
|
||||||
|
'MultiInputPolicy', env,
|
||||||
|
replay_buffer_class=HerReplayBuffer,
|
||||||
|
replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
|
||||||
|
verbose=1,
|
||||||
|
)
|
||||||
|
model.learn(total_timesteps=500_000)
|
||||||
|
model.save('rl_policy.zip')
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4 — Eval in game + collect new data
|
||||||
|
|
||||||
|
Run the trained policy against the real game. This validates whether the simulator was accurate enough, and simultaneously collects new data covering states the policy visits — which may be regions the original dataset missed.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from nucon.rl import NuconGoalEnv
|
||||||
|
from nucon.model import NuconModelLearner
|
||||||
|
from stable_baselines3 import SAC
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
# Load policy and run in real game
|
||||||
|
env = NuconGoalEnv(
|
||||||
|
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
|
||||||
|
goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
|
||||||
|
seconds_per_step=10,
|
||||||
|
)
|
||||||
|
policy = SAC.load('rl_policy.zip')
|
||||||
|
|
||||||
|
# Simultaneously collect new data
|
||||||
|
new_data_learner = NuconModelLearner(
|
||||||
|
dataset_path='reactor_dataset_new.pkl',
|
||||||
|
time_delta=10.0,
|
||||||
|
)
|
||||||
|
|
||||||
|
obs, _ = env.reset()
|
||||||
|
for _ in range(200):
|
||||||
|
action, _ = policy.predict(obs, deterministic=True)
|
||||||
|
obs, reward, terminated, truncated, _ = env.step(action)
|
||||||
|
if terminated or truncated:
|
||||||
|
obs, _ = env.reset()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5 — Refit model on expanded data
|
||||||
|
|
||||||
|
Merge the new data into the original dataset and refit:
|
||||||
|
|
||||||
|
```python
|
||||||
|
learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
|
||||||
|
learner.merge_datasets('reactor_dataset_new.pkl')
|
||||||
|
|
||||||
|
# Prune redundant samples before refitting
|
||||||
|
learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
|
||||||
|
print(f"Dataset size after pruning: {len(learner.dataset)}")
|
||||||
|
|
||||||
|
learner.fit_knn(k=10)
|
||||||
|
learner.save_model('reactor_knn.pkl')
|
||||||
|
```
|
||||||
|
|
||||||
|
Then go back to Step 3 with the improved model. Each iteration the simulator gets more accurate, the policy gets better, and the new data collection explores increasingly interesting regions of state space.
|
||||||
|
|
||||||
|
**When to stop**: when the policy performs well in the real game and the kNN uncertainty stays low throughout an episode (indicating the policy stays within the known data distribution).
|
||||||
|
|
||||||
## Testing
|
## Testing
|
||||||
|
|
||||||
NuCon includes a test suite to verify its functionality and compatibility with the Nucleares game.
|
NuCon includes a test suite to verify its functionality and compatibility with the Nucleares game.
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user