docs: replace step-by-step code blocks in training loop with prose
The prior sections already have full code examples; the training loop section now just describes each step concisely and links back to them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
f0cc7ba9c4
commit
7ee8272034
118
README.md
118
README.md
@ -397,123 +397,17 @@ The recommended end-to-end workflow for training an RL operator is an iterative
|
|||||||
└─────────────────────┘
|
└─────────────────────┘
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 1 — Human dataset collection
|
**Step 1 — Human dataset collection**: Run `NuconModelLearner.collect_data()` during your play session. Cover a wide range of states: startup from cold, ramping power, individual rod bank adjustments. Diversity in the dataset directly determines simulator accuracy. See [Model Learning](#model-learning-work-in-progress) for collection details.
|
||||||
|
|
||||||
Start `NuconModelLearner` before or during your play session. Try to cover a wide range of reactor states: startup from cold, ramping power up and down, adjusting individual rod banks, pump speed changes. Diversity in the dataset directly determines how accurate the simulator will be.
|
**Step 2 — Initial model fitting**: Fit a kNN model (instant) or NN (better extrapolation with larger datasets) using `fit_knn()` or `train_model()`. Prune near-duplicate samples with `drop_redundant()` before fitting. See [Model Learning](#model-learning-work-in-progress).
|
||||||
|
|
||||||
```python
|
**Step 3 — Train RL in simulator**: Load the fitted model into `NuconSimulator`, then train a `NuconGoalEnv` policy with SAC + HER. The simulator runs far faster than the real game, allowing many trajectories in reasonable time. See [NuconGoalEnv + HER Usage](#nucongoalenv--her-usage).
|
||||||
from nucon.model import NuconModelLearner
|
|
||||||
|
|
||||||
learner = NuconModelLearner(
|
**Step 4 — Eval in game + collect new data**: Run the trained policy against the real game. This validates simulator accuracy and simultaneously collects new data from states the policy visits, which may be regions the original dataset missed. Run a second `NuconModelLearner` in a background thread to collect concurrently.
|
||||||
dataset_path='reactor_dataset.pkl',
|
|
||||||
time_delta=10.0, # 10 game-seconds per sample
|
|
||||||
)
|
|
||||||
learner.collect_data(num_steps=500, save_every=10)
|
|
||||||
```
|
|
||||||
|
|
||||||
The collector saves every 10 steps, retries automatically on game crashes, and scales wall-clock sleep with `GAME_SIM_SPEED` so samples are always 10 game-seconds apart regardless of simulation speed.
|
**Step 5 — Refit model on expanded data**: Merge new data into the original dataset with `merge_datasets()`, prune with `drop_redundant()`, and refit. Then return to Step 3 with the improved model. Each iteration the simulator gets more accurate and the policy improves.
|
||||||
|
|
||||||
### Step 2 — Initial model fitting
|
Stop when the policy performs well in the real game and kNN uncertainty stays low throughout an episode, indicating the policy stays within the known data distribution.
|
||||||
|
|
||||||
```python
|
|
||||||
from nucon.model import NuconModelLearner
|
|
||||||
|
|
||||||
learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
|
|
||||||
|
|
||||||
# Option A: kNN + GP (instant fit, built-in uncertainty estimation)
|
|
||||||
learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
|
|
||||||
learner.fit_knn(k=10)
|
|
||||||
learner.save_model('reactor_knn.pkl')
|
|
||||||
|
|
||||||
# Option B: Neural network (better extrapolation with larger datasets)
|
|
||||||
learner.train_model(batch_size=32, num_epochs=50)
|
|
||||||
learner.drop_well_fitted(error_threshold=1.0) # keep hard samples for next round
|
|
||||||
learner.save_model('reactor_nn.pth')
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 3 — Train RL in simulator
|
|
||||||
|
|
||||||
Load the fitted model into the simulator and train with SAC + HER. The simulator runs orders of magnitude faster than the real game, allowing millions of steps in reasonable time.
|
|
||||||
|
|
||||||
```python
|
|
||||||
from nucon.sim import NuconSimulator, OperatingState
|
|
||||||
from nucon.rl import NuconGoalEnv
|
|
||||||
from stable_baselines3 import SAC
|
|
||||||
from stable_baselines3.common.buffers import HerReplayBuffer
|
|
||||||
|
|
||||||
simulator = NuconSimulator()
|
|
||||||
simulator.load_model('reactor_knn.pkl')
|
|
||||||
simulator.set_state(OperatingState.NOMINAL)
|
|
||||||
|
|
||||||
env = NuconGoalEnv(
|
|
||||||
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
|
|
||||||
goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
|
|
||||||
tolerance=0.05,
|
|
||||||
simulator=simulator,
|
|
||||||
seconds_per_step=10,
|
|
||||||
)
|
|
||||||
|
|
||||||
model = SAC(
|
|
||||||
'MultiInputPolicy', env,
|
|
||||||
replay_buffer_class=HerReplayBuffer,
|
|
||||||
replay_buffer_kwargs={'n_sampled_goal': 4, 'goal_selection_strategy': 'future'},
|
|
||||||
verbose=1,
|
|
||||||
)
|
|
||||||
model.learn(total_timesteps=500_000)
|
|
||||||
model.save('rl_policy.zip')
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 4 — Eval in game + collect new data
|
|
||||||
|
|
||||||
Run the trained policy against the real game. This validates whether the simulator was accurate enough, and simultaneously collects new data covering states the policy visits, which may be regions the original dataset missed.
|
|
||||||
|
|
||||||
```python
|
|
||||||
from nucon.rl import NuconGoalEnv
|
|
||||||
from nucon.model import NuconModelLearner
|
|
||||||
from stable_baselines3 import SAC
|
|
||||||
import numpy as np
|
|
||||||
|
|
||||||
# Load policy and run in real game
|
|
||||||
env = NuconGoalEnv(
|
|
||||||
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
|
|
||||||
goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
|
|
||||||
seconds_per_step=10,
|
|
||||||
)
|
|
||||||
policy = SAC.load('rl_policy.zip')
|
|
||||||
|
|
||||||
# Simultaneously collect new data
|
|
||||||
new_data_learner = NuconModelLearner(
|
|
||||||
dataset_path='reactor_dataset_new.pkl',
|
|
||||||
time_delta=10.0,
|
|
||||||
)
|
|
||||||
|
|
||||||
obs, _ = env.reset()
|
|
||||||
for _ in range(200):
|
|
||||||
action, _ = policy.predict(obs, deterministic=True)
|
|
||||||
obs, reward, terminated, truncated, _ = env.step(action)
|
|
||||||
if terminated or truncated:
|
|
||||||
obs, _ = env.reset()
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 5 — Refit model on expanded data
|
|
||||||
|
|
||||||
Merge the new data into the original dataset and refit:
|
|
||||||
|
|
||||||
```python
|
|
||||||
learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
|
|
||||||
learner.merge_datasets('reactor_dataset_new.pkl')
|
|
||||||
|
|
||||||
# Prune redundant samples before refitting
|
|
||||||
learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
|
|
||||||
print(f"Dataset size after pruning: {len(learner.dataset)}")
|
|
||||||
|
|
||||||
learner.fit_knn(k=10)
|
|
||||||
learner.save_model('reactor_knn.pkl')
|
|
||||||
```
|
|
||||||
|
|
||||||
Then go back to Step 3 with the improved model. Each iteration the simulator gets more accurate, the policy gets better, and the new data collection explores increasingly interesting regions of state space.
|
|
||||||
|
|
||||||
**When to stop**: when the policy performs well in the real game and the kNN uncertainty stays low throughout an episode (indicating the policy stays within the known data distribution).
|
|
||||||
|
|
||||||
## Testing
|
## Testing
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user