feat: NuconGoalEnv, composable uncertainty helpers, kNN-GP naming

- Add NuconGoalEnv for goal-conditioned HER training (SAC + HER)
- Add UncertaintyPenalty and UncertaintyAbort composable callables;
  SIM_UNCERTAINTY injected into obs dict when simulator is active
- Fix rl.py: str-typed params crash, missing Enum import, write-only
  params in action space, broken step() iteration order
- Remove uncertainty state from sim (return value from update() instead)
- Rename kNN -> kNN-GP throughout README; add model selection note

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Dominik Moritz Roth 2026-03-12 18:51:13 +01:00
parent 1b93699501
commit f4d45d3cfd
2 changed files with 255 additions and 209 deletions

View File

@ -300,8 +300,8 @@ To address the challenge of unknown game dynamics, NuCon provides tools for coll
- **Data Collection**: Gathers state transitions from human play or automated agents. `time_delta` is specified in game-time seconds; wall-clock sleep is automatically adjusted for `GAME_SIM_SPEED` so collected deltas are uniform regardless of simulation speed. - **Data Collection**: Gathers state transitions from human play or automated agents. `time_delta` is specified in game-time seconds; wall-clock sleep is automatically adjusted for `GAME_SIM_SPEED` so collected deltas are uniform regardless of simulation speed.
- **Automatic param filtering**: Junk params (GAME_VERSION, TIME, ALARMS_ACTIVE, …) and params from uninstalled subsystems (returns `None`) are automatically excluded from model inputs/outputs. - **Automatic param filtering**: Junk params (GAME_VERSION, TIME, ALARMS_ACTIVE, …) and params from uninstalled subsystems (returns `None`) are automatically excluded from model inputs/outputs.
- **Two model backends**: Neural network (NN) or k-Nearest Neighbours with GP interpolation (kNN). - **Two model backends**: Neural network (NN) or a local Gaussian Process approximated via k-Nearest Neighbours (kNN-GP).
- **Uncertainty estimation**: The kNN backend returns a GP posterior standard deviation alongside each prediction; 0 means the query lies on known data, ~1 means it is out of distribution. - **Uncertainty estimation**: The kNN-GP backend returns a GP posterior standard deviation alongside each prediction; 0 means the query lies on known data, ~1 means it is out of distribution.
- **Dataset management**: Tools for saving, loading, merging, and pruning datasets. - **Dataset management**: Tools for saving, loading, merging, and pruning datasets.
### Additional Dependencies ### Additional Dependencies
@ -310,12 +310,16 @@ To address the challenge of unknown game dynamics, NuCon provides tools for coll
pip install -e '.[model]' pip install -e '.[model]'
``` ```
### Model selection
**kNN-GP** (the `ReactorKNNModel` backend) is a local Gaussian Process: it finds the `k` nearest neighbours in the training set, fits an RBF kernel on them, and returns a prediction plus a GP posterior std as uncertainty. It works well from a few hundred samples and requires no training. **NN** needs input normalisation and several thousand samples to generalise; use it once you have a large dataset. For initial experiments, start with kNN-GP (`k=10`).
### Usage ### Usage
```python ```python
from nucon.model import NuconModelLearner from nucon.model import NuconModelLearner
# --- Data collection (model_type not needed here) --- # --- Data collection ---
learner = NuconModelLearner( learner = NuconModelLearner(
time_delta=10.0, # 10 game-seconds per step (wall sleep auto-scales with sim speed) time_delta=10.0, # 10 game-seconds per step (wall sleep auto-scales with sim speed)
include_valve_states=False, # set True to include all 53 valve positions as model inputs include_valve_states=False, # set True to include all 53 valve positions as model inputs
@ -333,13 +337,13 @@ nn_learner.train_model(batch_size=32, num_epochs=50) # creates NN model on firs
nn_learner.drop_well_fitted(error_threshold=1.0) nn_learner.drop_well_fitted(error_threshold=1.0)
nn_learner.save_model('reactor_nn.pth') nn_learner.save_model('reactor_nn.pth')
# --- kNN + GP backend --- # --- kNN-GP backend ---
knn_learner = NuconModelLearner(dataset_path='reactor_dataset.pkl') knn_learner = NuconModelLearner(dataset_path='reactor_dataset.pkl')
# Drop near-duplicate samples before fitting (keeps diverse coverage). # Drop near-duplicate samples before fitting (keeps diverse coverage).
# A sample is dropped only if BOTH its input state AND output transition # A sample is dropped only if BOTH its input state AND output transition
# are within the given distances of an already-kept sample. # are within the given distances of an already-kept sample.
knn_learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05) knn_learner.drop_redundant(min_state_distance=0.1, min_output_distance=0.05)
knn_learner.fit_knn(k=10) # creates kNN model on first call knn_learner.fit_knn(k=10) # creates kNN-GP model on first call
# Point prediction # Point prediction
state = knn_learner._get_state() state = knn_learner._get_state()
@ -399,7 +403,7 @@ The recommended end-to-end workflow for training an RL operator is an iterative
**Step 1 — Human dataset collection**: Run `NuconModelLearner.collect_data()` during your play session. Cover a wide range of states: startup from cold, ramping power, individual rod bank adjustments. Diversity in the dataset directly determines simulator accuracy. See [Model Learning](#model-learning-work-in-progress) for collection details. **Step 1 — Human dataset collection**: Run `NuconModelLearner.collect_data()` during your play session. Cover a wide range of states: startup from cold, ramping power, individual rod bank adjustments. Diversity in the dataset directly determines simulator accuracy. See [Model Learning](#model-learning-work-in-progress) for collection details.
**Step 2 — Initial model fitting**: Fit a kNN model (instant) or NN (better extrapolation with larger datasets) using `fit_knn()` or `train_model()`. Prune near-duplicate samples with `drop_redundant()` before fitting. See [Model Learning](#model-learning-work-in-progress). **Step 2 — Initial model fitting**: Fit a kNN-GP model (instant) or NN (better extrapolation with larger datasets) using `fit_knn()` or `train_model()`. Prune near-duplicate samples with `drop_redundant()` before fitting. See [Model Learning](#model-learning).
**Step 3 — Train RL in simulator**: Load the fitted model into `NuconSimulator`, then train a `NuconGoalEnv` policy with SAC + HER. The simulator runs far faster than the real game, allowing many trajectories in reasonable time. Use `uncertainty_penalty_start` and `uncertainty_abort` on the env to discourage the policy from wandering into regions the model hasn't seen: a linear penalty kicks in above the soft threshold, and the episode is truncated at the hard threshold. This keeps training within the reliable part of the model's knowledge. See [NuconGoalEnv + HER Usage](#nucongoalenv--her-usage). **Step 3 — Train RL in simulator**: Load the fitted model into `NuconSimulator`, then train a `NuconGoalEnv` policy with SAC + HER. The simulator runs far faster than the real game, allowing many trajectories in reasonable time. Use `uncertainty_penalty_start` and `uncertainty_abort` on the env to discourage the policy from wandering into regions the model hasn't seen: a linear penalty kicks in above the soft threshold, and the episode is truncated at the hard threshold. This keeps training within the reliable part of the model's knowledge. See [NuconGoalEnv + HER Usage](#nucongoalenv--her-usage).
@ -407,7 +411,7 @@ The recommended end-to-end workflow for training an RL operator is an iterative
**Step 5 — Refit model on expanded data**: Merge new data into the original dataset with `merge_datasets()`, prune with `drop_redundant()`, and refit. Then return to Step 3 with the improved model. Each iteration the simulator gets more accurate and the policy improves. **Step 5 — Refit model on expanded data**: Merge new data into the original dataset with `merge_datasets()`, prune with `drop_redundant()`, and refit. Then return to Step 3 with the improved model. Each iteration the simulator gets more accurate and the policy improves.
Stop when the policy performs well in the real game and kNN uncertainty stays low throughout an episode, indicating the policy stays within the known data distribution. Stop when the policy performs well in the real game and kNN-GP uncertainty stays low throughout an episode, indicating the policy stays within the known data distribution.
## Testing ## Testing

View File

@ -1,11 +1,17 @@
import inspect
import gymnasium as gym import gymnasium as gym
from gymnasium import spaces from gymnasium import spaces
import numpy as np import numpy as np
import time import time
from typing import Dict, Any from typing import Dict, Any, Callable, List, Optional
from enum import Enum from enum import Enum
from nucon import Nucon, BreakerStatus, PumpStatus, PumpDryStatus, PumpOverloadStatus from nucon import Nucon, BreakerStatus, PumpStatus, PumpDryStatus, PumpOverloadStatus
# ---------------------------------------------------------------------------
# Reward / objective helpers
# ---------------------------------------------------------------------------
Objectives = { Objectives = {
"null": lambda obs: 0, "null": lambda obs: 0,
"max_power": lambda obs: obs["GENERATOR_0_KW"] + obs["GENERATOR_1_KW"] + obs["GENERATOR_2_KW"], "max_power": lambda obs: obs["GENERATOR_0_KW"] + obs["GENERATOR_1_KW"] + obs["GENERATOR_2_KW"],
@ -20,148 +26,58 @@ Parameterized_Objectives = {
"constant": lambda constant: lambda obs: constant, "constant": lambda constant: lambda obs: constant,
} }
class NuconEnv(gym.Env):
metadata = {'render_modes': ['human']}
def __init__(self, nucon=None, simulator=None, render_mode=None, seconds_per_step=5, objectives=['null'], terminators=['null'], objective_weights=None, terminate_above=0, def UncertaintyPenalty(start: float = 0.3, scale: float = 1.0, mode: str = 'l2') -> Callable:
uncertainty_penalty_start: float = None, uncertainty_abort: float = None, uncertainty_penalty_scale: float = 1.0): """Objective that penalises high simulator uncertainty.
super().__init__()
self.render_mode = render_mode Returns a callable ``(obs) -> float`` suitable for use as an objective or
self.seconds_per_step = seconds_per_step terminator in NuconEnv / NuconGoalEnv. Works because ``SIM_UNCERTAINTY``
if objective_weights is None: is injected into the obs dict whenever a simulator is active.
objective_weights = [1.0 for objective in objectives]
self.objective_weights = objective_weights
self.terminate_above = terminate_above
self.simulator = simulator
self.uncertainty_penalty_start = uncertainty_penalty_start
self.uncertainty_abort = uncertainty_abort
self.uncertainty_penalty_scale = uncertainty_penalty_scale
if nucon is None: Args:
if simulator: start: uncertainty level at which the penalty starts (default 0.3).
nucon = Nucon(port=simulator.port) scale: penalty coefficient.
mode: ``'l2'`` (quadratic, default) or ``'linear'``.
Example::
env = NuconEnv(
objectives=['max_power', UncertaintyPenalty(start=0.3, scale=2.0)],
objective_weights=[1.0, 1.0],
simulator=simulator,
)
"""
excess = lambda obs: max(0.0, obs.get('SIM_UNCERTAINTY', 0.0) - start)
if mode == 'l2':
return lambda obs: -scale * excess(obs) ** 2
elif mode == 'linear':
return lambda obs: -scale * excess(obs)
else: else:
nucon = Nucon() raise ValueError(f"Unknown mode '{mode}'. Use 'l2' or 'linear'.")
self.nucon = nucon
# Define observation space
obs_spaces = {'EPISODE_TIME': spaces.Box(low=0, high=np.inf, shape=(1,), dtype=np.float32)}
for param_id, param in self.nucon.get_all_readable().items():
sp = _build_param_space(param)
if sp is not None:
obs_spaces[param_id] = sp
self.observation_space = spaces.Dict(obs_spaces)
# Define action space (only controllable, non-cheat, readable-back params) def UncertaintyAbort(threshold: float = 0.7) -> Callable:
action_spaces = {} """Terminator that aborts the episode when simulator uncertainty is too high.
for param_id, param in self.nucon.get_all_writable().items():
if not param.is_readable or param.is_cheat:
continue # write-only (VALVE_OPEN/CLOSE, SCRAM, etc.) and cheat params excluded
sp = _build_param_space(param)
if sp is not None:
action_spaces[param_id] = sp
self.action_space = spaces.Dict(action_spaces)
self.objectives = [] Returns a callable ``(obs) -> float`` for use as a *terminator*. When
self.terminators = [] the GP posterior std exceeds ``threshold`` the episode is truncated
(``terminated=True``).
for objective in objectives: Example::
if objective in Objectives:
self.objectives.append(Objectives[objective])
elif callable(objective):
self.objectives.append(objective)
else:
raise ValueError(f"Unsupported objective: {objective}")
for terminator in terminators: env = NuconEnv(
if terminator in Objectives: objectives=['max_power'],
self.terminators.append(Objectives[terminator]) terminators=[UncertaintyAbort(threshold=0.7)],
elif callable(terminator): terminate_above=0,
self.terminators.append(terminator) simulator=simulator,
else: )
raise ValueError(f"Unsupported terminator: {terminator}") """
return lambda obs: 1.0 if obs.get('SIM_UNCERTAINTY', 0.0) >= threshold else 0.0
def _get_obs(self):
obs = {}
for param_id, param in self.nucon.get_all_readable().items():
if param.param_type == str or param_id not in self.observation_space.spaces:
continue
value = self.nucon.get(param_id)
if isinstance(value, Enum):
value = value.value
obs[param_id] = value
obs["EPISODE_TIME"] = self._total_steps * self.seconds_per_step
return obs
def _get_info(self):
info = {'objectives': {}, 'objectives_weighted': {}}
for objective, weight in zip(self.objectives, self.objective_weights):
obj = objective(self._get_obs())
info['objectives'][objective.__name__] = obj
info['objectives_weighted'][objective.__name__] = obj * weight
return info
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self._total_steps = 0
observation = self._get_obs()
info = self._get_info()
return observation, info
def step(self, action):
# Apply the action to the Nucon system
for param_id, value in action.items():
param = self.nucon._parameters[param_id]
if issubclass(param.param_type, Enum):
value = param.param_type(int(np.asarray(value).flat[0]))
else:
value = param.param_type(np.asarray(value).flat[0])
if param.min_val is not None and param.max_val is not None:
value = np.clip(value, param.min_val, param.max_val)
self.nucon.set(param, value)
observation = self._get_obs()
terminated = np.sum([terminator(observation) for terminator in self.terminators]) > self.terminate_above
truncated = False
info = self._get_info()
reward = sum(obj for obj in info['objectives_weighted'].values())
self._total_steps += 1
if self.simulator:
needs_uncertainty = self.uncertainty_penalty_start is not None or self.uncertainty_abort is not None
uncertainty = self.simulator.update(self.seconds_per_step, return_uncertainty=needs_uncertainty)
if uncertainty is not None:
if self.uncertainty_abort is not None and uncertainty >= self.uncertainty_abort:
truncated = True
if self.uncertainty_penalty_start is not None and uncertainty > self.uncertainty_penalty_start:
reward -= self.uncertainty_penalty_scale * (uncertainty - self.uncertainty_penalty_start)
else:
sim_speed = self.nucon.GAME_SIM_SPEED.value or 1.0
time.sleep(self.seconds_per_step / sim_speed)
return observation, reward, terminated, truncated, info
def render(self):
if self.render_mode == "human":
pass
def close(self):
pass
def _flatten_action(self, action):
return np.concatenate([v.flatten() for v in action.values()])
def _unflatten_action(self, flat_action):
return {k: v.reshape(1, -1) for k, v in self.action_space.items()}
def _flatten_observation(self, observation):
return np.concatenate([v.flatten() for v in observation.values()])
def _unflatten_observation(self, flat_observation):
return {k: v.reshape(1, -1) for k, v in self.observation_space.items()}
# ---------------------------------------------------------------------------
# Internal helpers
# ---------------------------------------------------------------------------
def _build_param_space(param): def _build_param_space(param):
"""Return a gymnasium Box for a single NuconParameter, or None if unsupported.""" """Return a gymnasium Box for a single NuconParameter, or None if unsupported."""
@ -180,30 +96,174 @@ def _build_param_space(param):
return None return None
def _apply_action(nucon, action):
for param_id, value in action.items():
param = nucon._parameters[param_id]
if issubclass(param.param_type, Enum):
value = param.param_type(int(np.asarray(value).flat[0]))
else:
value = param.param_type(np.asarray(value).flat[0])
if param.min_val is not None and param.max_val is not None:
value = np.clip(value, param.min_val, param.max_val)
nucon.set(param, value)
# ---------------------------------------------------------------------------
# NuconEnv
# ---------------------------------------------------------------------------
class NuconEnv(gym.Env):
metadata = {'render_modes': ['human']}
def __init__(self, nucon=None, simulator=None, render_mode=None, seconds_per_step=5,
objectives=['null'], terminators=['null'], objective_weights=None, terminate_above=0):
super().__init__()
self.render_mode = render_mode
self.seconds_per_step = seconds_per_step
if objective_weights is None:
objective_weights = [1.0 for _ in objectives]
self.objective_weights = objective_weights
self.terminate_above = terminate_above
self.simulator = simulator
if nucon is None:
nucon = Nucon(port=simulator.port) if simulator else Nucon()
self.nucon = nucon
# Observation space — SIM_UNCERTAINTY included when a simulator is present
obs_spaces = {'EPISODE_TIME': spaces.Box(low=0, high=np.inf, shape=(1,), dtype=np.float32)}
if simulator is not None:
obs_spaces['SIM_UNCERTAINTY'] = spaces.Box(low=0.0, high=1.0, shape=(1,), dtype=np.float32)
for param_id, param in self.nucon.get_all_readable().items():
sp = _build_param_space(param)
if sp is not None:
obs_spaces[param_id] = sp
self.observation_space = spaces.Dict(obs_spaces)
# Action space
action_spaces = {}
for param_id, param in self.nucon.get_all_writable().items():
if not param.is_readable or param.is_cheat:
continue
sp = _build_param_space(param)
if sp is not None:
action_spaces[param_id] = sp
self.action_space = spaces.Dict(action_spaces)
self.objectives = []
self.terminators = []
for objective in objectives:
if objective in Objectives:
self.objectives.append(Objectives[objective])
elif callable(objective):
self.objectives.append(objective)
else:
raise ValueError(f"Unsupported objective: {objective}")
for terminator in terminators:
if terminator in Objectives:
self.terminators.append(Objectives[terminator])
elif callable(terminator):
self.terminators.append(terminator)
else:
raise ValueError(f"Unsupported terminator: {terminator}")
def _get_obs(self, sim_uncertainty=None):
obs = {}
for param_id, param in self.nucon.get_all_readable().items():
if param.param_type == str or param_id not in self.observation_space.spaces:
continue
value = self.nucon.get(param_id)
if isinstance(value, Enum):
value = value.value
obs[param_id] = value
obs['EPISODE_TIME'] = self._total_steps * self.seconds_per_step
if 'SIM_UNCERTAINTY' in self.observation_space.spaces:
obs['SIM_UNCERTAINTY'] = sim_uncertainty if sim_uncertainty is not None else 0.0
return obs
def _get_info(self, obs):
info = {'objectives': {}, 'objectives_weighted': {}}
for objective, weight in zip(self.objectives, self.objective_weights):
obj = objective(obs)
name = getattr(objective, '__name__', repr(objective))
info['objectives'][name] = obj
info['objectives_weighted'][name] = obj * weight
return info
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self._total_steps = 0
observation = self._get_obs()
return observation, self._get_info(observation)
def step(self, action):
_apply_action(self.nucon, action)
# Advance sim (or sleep) — get uncertainty for obs injection
truncated = False
uncertainty = None
if self.simulator:
uncertainty = self.simulator.update(self.seconds_per_step, return_uncertainty=True)
else:
sim_speed = self.nucon.GAME_SIM_SPEED.value or 1.0
time.sleep(self.seconds_per_step / sim_speed)
self._total_steps += 1
observation = self._get_obs(sim_uncertainty=uncertainty)
info = self._get_info(observation)
reward = sum(obj for obj in info['objectives_weighted'].values())
terminated = np.sum([t(observation) for t in self.terminators]) > self.terminate_above
return observation, reward, terminated, truncated, info
def render(self):
pass
def close(self):
pass
def _flatten_observation(self, observation):
return np.concatenate([np.asarray(v).flatten() for v in observation.values()])
# ---------------------------------------------------------------------------
# NuconGoalEnv
# ---------------------------------------------------------------------------
class NuconGoalEnv(gym.Env): class NuconGoalEnv(gym.Env):
""" """
Goal-conditioned reactor environment compatible with SB3 HER (Hindsight Experience Replay). Goal-conditioned reactor environment compatible with SB3 HER (Hindsight Experience Replay).
The observation is a Dict with three keys as required by GoalEnv / HER: Observation is a Dict with three keys:
- 'observation': all readable non-goal, non-str params (same encoding as NuconEnv) - 'observation': all readable non-goal, non-str params + SIM_UNCERTAINTY (when sim active)
- 'achieved_goal': current values of goal_params, normalised to [0, 1] within goal_range - 'achieved_goal': current values of goal_params, normalised to [0, 1] within goal_range
- 'desired_goal': target values sampled each episode, normalised to [0, 1] - 'desired_goal': target values sampled each episode, normalised to [0, 1]
Reward defaults to negative L2 distance in the normalised goal space (dense). ``SIM_UNCERTAINTY`` in 'observation' lets reward_fn / terminators reference uncertainty directly.
Pass ``tolerance`` for a sparse {0, -1} reward (0 = within tolerance).
reward_fn signature: ``(achieved, desired)`` or ``(achieved, desired, obs)`` the 3-arg form
receives the full observation dict (including SIM_UNCERTAINTY) for uncertainty-aware shaping.
Usage with SB3 HER:: Usage with SB3 HER::
from stable_baselines3 import SAC from stable_baselines3 import SAC
from stable_baselines3.common.buffers import HerReplayBuffer from stable_baselines3.common.buffers import HerReplayBuffer
from nucon.rl import NuconGoalEnv, UncertaintyPenalty, UncertaintyAbort
env = NuconGoalEnv( env = NuconGoalEnv(
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'], goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)}, goal_range={'GENERATOR_0_KW': (0, 1200), 'GENERATOR_1_KW': (0, 1200), 'GENERATOR_2_KW': (0, 1200)},
tolerance=0.05,
simulator=simulator, simulator=simulator,
# uncertainty-aware reward: penalise OOD, abort if too far out
reward_fn=lambda ag, dg, obs: (
-(np.linalg.norm(ag - dg) ** 2)
- 2.0 * max(0, obs.get('SIM_UNCERTAINTY', 0) - 0.3) ** 2
),
terminators=[UncertaintyAbort(threshold=0.7)],
) )
model = SAC('MultiInputPolicy', env, replay_buffer_class=HerReplayBuffer) model = SAC('MultiInputPolicy', env, replay_buffer_class=HerReplayBuffer)
model.learn(total_timesteps=200_000) model.learn(total_timesteps=500_000)
""" """
metadata = {'render_modes': ['human']} metadata = {'render_modes': ['human']}
@ -220,9 +280,6 @@ class NuconGoalEnv(gym.Env):
seconds_per_step=5, seconds_per_step=5,
terminators=None, terminators=None,
terminate_above=0, terminate_above=0,
uncertainty_penalty_start: float = None,
uncertainty_abort: float = None,
uncertainty_penalty_scale: float = 1.0,
): ):
super().__init__() super().__init__()
@ -238,8 +295,6 @@ class NuconGoalEnv(gym.Env):
self.nucon = nucon self.nucon = nucon
all_readable = self.nucon.get_all_readable() all_readable = self.nucon.get_all_readable()
# Validate goal params and build per-param range arrays
for pid in self.goal_params: for pid in self.goal_params:
if pid not in all_readable: if pid not in all_readable:
raise ValueError(f"Goal param '{pid}' is not a readable parameter") raise ValueError(f"Goal param '{pid}' is not a readable parameter")
@ -254,13 +309,21 @@ class NuconGoalEnv(gym.Env):
for pid in self.goal_params for pid in self.goal_params
], dtype=np.float32) ], dtype=np.float32)
self._goal_range = self._goal_high - self._goal_low self._goal_range = self._goal_high - self._goal_low
self._goal_range[self._goal_range == 0] = 1.0 # avoid div-by-zero self._goal_range[self._goal_range == 0] = 1.0
self._reward_fn = reward_fn # callable(achieved_norm, desired_norm) -> float, or None # Detect reward_fn arity for backward compat (2-arg vs 3-arg)
self._reward_fn = reward_fn
if reward_fn is not None:
n_args = len(inspect.signature(reward_fn).parameters)
self._reward_fn_wants_obs = n_args >= 3
else:
self._reward_fn_wants_obs = False
# Observation subspace: all readable non-str non-goal params # Observation subspace
goal_set = set(self.goal_params) goal_set = set(self.goal_params)
obs_spaces = {'EPISODE_TIME': spaces.Box(low=0, high=np.inf, shape=(1,), dtype=np.float32)} obs_spaces = {'EPISODE_TIME': spaces.Box(low=0, high=np.inf, shape=(1,), dtype=np.float32)}
if simulator is not None:
obs_spaces['SIM_UNCERTAINTY'] = spaces.Box(low=0.0, high=1.0, shape=(1,), dtype=np.float32)
for param_id, param in all_readable.items(): for param_id, param in all_readable.items():
if param_id in goal_set: if param_id in goal_set:
continue continue
@ -275,7 +338,7 @@ class NuconGoalEnv(gym.Env):
'desired_goal': spaces.Box(low=0.0, high=1.0, shape=(n_goals,), dtype=np.float32), 'desired_goal': spaces.Box(low=0.0, high=1.0, shape=(n_goals,), dtype=np.float32),
}) })
# Action space: readable-back, non-cheat writable params # Action space
action_spaces = {} action_spaces = {}
for param_id, param in self.nucon.get_all_writable().items(): for param_id, param in self.nucon.get_all_writable().items():
if not param.is_readable or param.is_cheat: if not param.is_readable or param.is_cheat:
@ -285,26 +348,16 @@ class NuconGoalEnv(gym.Env):
action_spaces[param_id] = sp action_spaces[param_id] = sp
self.action_space = spaces.Dict(action_spaces) self.action_space = spaces.Dict(action_spaces)
# Terminators
self._terminators = terminators or [] self._terminators = terminators or []
self.uncertainty_penalty_start = uncertainty_penalty_start
self.uncertainty_abort = uncertainty_abort
self.uncertainty_penalty_scale = uncertainty_penalty_scale
self._desired_goal = np.zeros(n_goals, dtype=np.float32) self._desired_goal = np.zeros(n_goals, dtype=np.float32)
self._total_steps = 0 self._total_steps = 0
# ------------------------------------------------------------------
# GoalEnv interface
# ------------------------------------------------------------------
def compute_reward(self, achieved_goal, desired_goal, info): def compute_reward(self, achieved_goal, desired_goal, info):
""" """Dense negative L2, sparse with tolerance, or custom reward_fn."""
Dense: negative L2 in normalised goal space (each dim in [0,1]). obs = info.get('obs', {}) if isinstance(info, dict) else {}
Sparse when tolerance is set: 0 if within tolerance, -1 otherwise.
Custom reward_fn overrides both.
"""
if self._reward_fn is not None: if self._reward_fn is not None:
if self._reward_fn_wants_obs:
return self._reward_fn(achieved_goal, desired_goal, obs)
return self._reward_fn(achieved_goal, desired_goal) return self._reward_fn(achieved_goal, desired_goal)
dist = np.linalg.norm(achieved_goal - desired_goal, axis=-1) dist = np.linalg.norm(achieved_goal - desired_goal, axis=-1)
if self.tolerance is not None: if self.tolerance is not None:
@ -312,13 +365,13 @@ class NuconGoalEnv(gym.Env):
return -dist return -dist
def _read_goal_values(self): def _read_goal_values(self):
raw = np.array([ raw = np.array([self.nucon.get(pid) or 0.0 for pid in self.goal_params], dtype=np.float32)
self.nucon.get(pid) or 0.0 for pid in self.goal_params
], dtype=np.float32)
return np.clip((raw - self._goal_low) / self._goal_range, 0.0, 1.0) return np.clip((raw - self._goal_low) / self._goal_range, 0.0, 1.0)
def _get_obs_dict(self): def _get_obs_dict(self, sim_uncertainty=None):
obs = {'EPISODE_TIME': float(self._total_steps * self.seconds_per_step)} obs = {'EPISODE_TIME': float(self._total_steps * self.seconds_per_step)}
if 'SIM_UNCERTAINTY' in self.observation_space['observation'].spaces:
obs['SIM_UNCERTAINTY'] = sim_uncertainty if sim_uncertainty is not None else 0.0
goal_set = set(self.goal_params) goal_set = set(self.goal_params)
for param_id, param in self.nucon.get_all_readable().items(): for param_id, param in self.nucon.get_all_readable().items():
if param_id in goal_set or param_id not in self.observation_space['observation'].spaces: if param_id in goal_set or param_id not in self.observation_space['observation'].spaces:
@ -337,44 +390,28 @@ class NuconGoalEnv(gym.Env):
def reset(self, seed=None, options=None): def reset(self, seed=None, options=None):
super().reset(seed=seed) super().reset(seed=seed)
self._total_steps = 0 self._total_steps = 0
# Sample a new goal uniformly from the goal range
rng = np.random.default_rng(seed) rng = np.random.default_rng(seed)
self._desired_goal = rng.uniform(0.0, 1.0, size=len(self.goal_params)).astype(np.float32) self._desired_goal = rng.uniform(0.0, 1.0, size=len(self.goal_params)).astype(np.float32)
return self._get_obs_dict(), {}
obs = self._get_obs_dict()
return obs, {}
def step(self, action): def step(self, action):
for param_id, value in action.items(): _apply_action(self.nucon, action)
param = self.nucon._parameters[param_id]
if issubclass(param.param_type, Enum):
value = param.param_type(int(np.asarray(value).flat[0]))
else:
value = param.param_type(np.asarray(value).flat[0])
if param.min_val is not None and param.max_val is not None:
value = np.clip(value, param.min_val, param.max_val)
self.nucon.set(param, value)
obs = self._get_obs_dict() # Advance sim (or sleep)
reward = float(self.compute_reward(obs['achieved_goal'], obs['desired_goal'], {})) uncertainty = None
terminated = any(t(obs['observation']) > self.terminate_above for t in self._terminators)
truncated = False
info = {'achieved_goal': obs['achieved_goal'], 'desired_goal': obs['desired_goal']}
self._total_steps += 1
if self.simulator: if self.simulator:
needs_uncertainty = self.uncertainty_penalty_start is not None or self.uncertainty_abort is not None uncertainty = self.simulator.update(self.seconds_per_step, return_uncertainty=True)
uncertainty = self.simulator.update(self.seconds_per_step, return_uncertainty=needs_uncertainty)
if uncertainty is not None:
if self.uncertainty_abort is not None and uncertainty >= self.uncertainty_abort:
truncated = True
if self.uncertainty_penalty_start is not None and uncertainty > self.uncertainty_penalty_start:
reward -= self.uncertainty_penalty_scale * (uncertainty - self.uncertainty_penalty_start)
else: else:
sim_speed = self.nucon.GAME_SIM_SPEED.value or 1.0 sim_speed = self.nucon.GAME_SIM_SPEED.value or 1.0
time.sleep(self.seconds_per_step / sim_speed) time.sleep(self.seconds_per_step / sim_speed)
self._total_steps += 1
obs = self._get_obs_dict(sim_uncertainty=uncertainty)
info = {'achieved_goal': obs['achieved_goal'], 'desired_goal': obs['desired_goal'],
'obs': obs['observation']}
reward = float(self.compute_reward(obs['achieved_goal'], obs['desired_goal'], info))
terminated = any(t(obs['observation']) > self.terminate_above for t in self._terminators)
truncated = False
return obs, reward, terminated, truncated, info return obs, reward, terminated, truncated, info
def render(self): def render(self):
@ -384,6 +421,10 @@ class NuconGoalEnv(gym.Env):
pass pass
# ---------------------------------------------------------------------------
# Registration
# ---------------------------------------------------------------------------
def register_nucon_envs(): def register_nucon_envs():
gym.register( gym.register(
id='Nucon-max_power-v0', id='Nucon-max_power-v0',
@ -398,9 +439,11 @@ def register_nucon_envs():
gym.register( gym.register(
id='Nucon-safe_max_power-v0', id='Nucon-safe_max_power-v0',
entry_point='nucon.rl:NuconEnv', entry_point='nucon.rl:NuconEnv',
kwargs={'seconds_per_step': 5, 'objectives': [Parameterized_Objectives['temp_above'](min_temp=310), Parameterized_Objectives['temp_below'](max_temp=365), 'max_power'], 'objective_weights': [1, 10, 1/100_000]} kwargs={'seconds_per_step': 5,
'objectives': [Parameterized_Objectives['temp_above'](min_temp=310),
Parameterized_Objectives['temp_below'](max_temp=365), 'max_power'],
'objective_weights': [1, 10, 1/100_000]}
) )
# Goal-conditioned: target total generator output (train with HER)
gym.register( gym.register(
id='Nucon-goal_power-v0', id='Nucon-goal_power-v0',
entry_point='nucon.rl:NuconGoalEnv', entry_point='nucon.rl:NuconGoalEnv',
@ -410,7 +453,6 @@ def register_nucon_envs():
'seconds_per_step': 5, 'seconds_per_step': 5,
} }
) )
# Goal-conditioned: target core temperature (train with HER)
gym.register( gym.register(
id='Nucon-goal_temp-v0', id='Nucon-goal_temp-v0',
entry_point='nucon.rl:NuconGoalEnv', entry_point='nucon.rl:NuconGoalEnv',