fix: add objectives support to NuconGoalEnv; fix README uncertainty example

- NuconGoalEnv now accepts objectives/objective_weights; additive on top
  of the goal reward, same interface as NuconEnv
- README: use UncertaintyPenalty/UncertaintyAbort correctly (via objectives
  and terminators, not as constructor params that don't exist)
- Step 3 prose updated to reference composable callables

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Dominik Moritz Roth 2026-03-12 18:55:16 +01:00
parent f4d45d3cfd
commit 36a33e74e5
2 changed files with 14 additions and 2 deletions

View File

@ -196,10 +196,11 @@ env.close()
HER works by relabelling past trajectories with the goal that was *actually achieved*, turning every episode into useful training signal even when the agent never reaches the intended target. This makes it much more sample-efficient than standard RL for goal-reaching tasks. This matters a lot given how slow the real game is. HER works by relabelling past trajectories with the goal that was *actually achieved*, turning every episode into useful training signal even when the agent never reaches the intended target. This makes it much more sample-efficient than standard RL for goal-reaching tasks. This matters a lot given how slow the real game is.
```python ```python
from nucon.rl import NuconGoalEnv from nucon.rl import NuconGoalEnv, UncertaintyPenalty, UncertaintyAbort
from stable_baselines3 import SAC from stable_baselines3 import SAC
from stable_baselines3.common.buffers import HerReplayBuffer from stable_baselines3.common.buffers import HerReplayBuffer
env = NuconGoalEnv( env = NuconGoalEnv(
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'], goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
goal_range={ goal_range={
@ -210,6 +211,11 @@ env = NuconGoalEnv(
tolerance=0.05, # sparse: within 5% of range counts as success (recommended with HER) tolerance=0.05, # sparse: within 5% of range counts as success (recommended with HER)
seconds_per_step=5, seconds_per_step=5,
simulator=simulator, # use a pre-trained simulator for fast pre-training simulator=simulator, # use a pre-trained simulator for fast pre-training
# Keep policy within the simulator's known data distribution.
# SIM_UNCERTAINTY (kNN-GP posterior std) is injected into obs when a simulator is active.
# Tune start/scale/threshold to taste.
objectives=[UncertaintyPenalty(start=0.3, scale=1.0)], # L2 penalty above soft threshold
terminators=[UncertaintyAbort(threshold=0.7)], # abort episode at hard threshold
) )
# Or use a preset: env = gym.make('Nucon-goal_power-v0', simulator=simulator) # Or use a preset: env = gym.make('Nucon-goal_power-v0', simulator=simulator)
@ -405,7 +411,7 @@ The recommended end-to-end workflow for training an RL operator is an iterative
**Step 2 — Initial model fitting**: Fit a kNN-GP model (instant) or NN (better extrapolation with larger datasets) using `fit_knn()` or `train_model()`. Prune near-duplicate samples with `drop_redundant()` before fitting. See [Model Learning](#model-learning). **Step 2 — Initial model fitting**: Fit a kNN-GP model (instant) or NN (better extrapolation with larger datasets) using `fit_knn()` or `train_model()`. Prune near-duplicate samples with `drop_redundant()` before fitting. See [Model Learning](#model-learning).
**Step 3 — Train RL in simulator**: Load the fitted model into `NuconSimulator`, then train a `NuconGoalEnv` policy with SAC + HER. The simulator runs far faster than the real game, allowing many trajectories in reasonable time. Use `uncertainty_penalty_start` and `uncertainty_abort` on the env to discourage the policy from wandering into regions the model hasn't seen: a linear penalty kicks in above the soft threshold, and the episode is truncated at the hard threshold. This keeps training within the reliable part of the model's knowledge. See [NuconGoalEnv + HER Usage](#nucongoalenv--her-usage). **Step 3 — Train RL in simulator**: Load the fitted model into `NuconSimulator`, then train a `NuconGoalEnv` policy with SAC + HER. The simulator runs far faster than the real game, allowing many trajectories in reasonable time. Pass `UncertaintyPenalty` and `UncertaintyAbort` as objectives/terminators to discourage the policy from wandering into regions the model hasn't seen; `SIM_UNCERTAINTY` is automatically injected into the obs dict when a simulator is active. See [NuconGoalEnv + HER Usage](#nucongoalenv--her-usage).
**Step 4 — Eval in game + collect new data**: Run the trained policy against the real game. This validates simulator accuracy and simultaneously collects new data from states the policy visits, which may be regions the original dataset missed. Run a second `NuconModelLearner` in a background thread to collect concurrently. **Step 4 — Eval in game + collect new data**: Run the trained policy against the real game. This validates simulator accuracy and simultaneously collects new data from states the policy visits, which may be regions the original dataset missed. Run a second `NuconModelLearner` in a background thread to collect concurrently.

View File

@ -280,6 +280,8 @@ class NuconGoalEnv(gym.Env):
seconds_per_step=5, seconds_per_step=5,
terminators=None, terminators=None,
terminate_above=0, terminate_above=0,
objectives=None,
objective_weights=None,
): ):
super().__init__() super().__init__()
@ -349,6 +351,9 @@ class NuconGoalEnv(gym.Env):
self.action_space = spaces.Dict(action_spaces) self.action_space = spaces.Dict(action_spaces)
self._terminators = terminators or [] self._terminators = terminators or []
_objs = objectives or []
self._objectives = [Objectives[o] if isinstance(o, str) else o for o in _objs]
self._objective_weights = objective_weights or [1.0] * len(self._objectives)
self._desired_goal = np.zeros(n_goals, dtype=np.float32) self._desired_goal = np.zeros(n_goals, dtype=np.float32)
self._total_steps = 0 self._total_steps = 0
@ -410,6 +415,7 @@ class NuconGoalEnv(gym.Env):
info = {'achieved_goal': obs['achieved_goal'], 'desired_goal': obs['desired_goal'], info = {'achieved_goal': obs['achieved_goal'], 'desired_goal': obs['desired_goal'],
'obs': obs['observation']} 'obs': obs['observation']}
reward = float(self.compute_reward(obs['achieved_goal'], obs['desired_goal'], info)) reward = float(self.compute_reward(obs['achieved_goal'], obs['desired_goal'], info))
reward += sum(w * o(obs['observation']) for o, w in zip(self._objectives, self._objective_weights))
terminated = any(t(obs['observation']) > self.terminate_above for t in self._terminators) terminated = any(t(obs['observation']) > self.terminate_above for t in self._terminators)
truncated = False truncated = False
return obs, reward, terminated, truncated, info return obs, reward, terminated, truncated, info