fix: add objectives support to NuconGoalEnv; fix README uncertainty example
- NuconGoalEnv now accepts objectives/objective_weights; additive on top of the goal reward, same interface as NuconEnv - README: use UncertaintyPenalty/UncertaintyAbort correctly (via objectives and terminators, not as constructor params that don't exist) - Step 3 prose updated to reference composable callables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
f4d45d3cfd
commit
36a33e74e5
10
README.md
10
README.md
@ -196,10 +196,11 @@ env.close()
|
||||
HER works by relabelling past trajectories with the goal that was *actually achieved*, turning every episode into useful training signal even when the agent never reaches the intended target. This makes it much more sample-efficient than standard RL for goal-reaching tasks. This matters a lot given how slow the real game is.
|
||||
|
||||
```python
|
||||
from nucon.rl import NuconGoalEnv
|
||||
from nucon.rl import NuconGoalEnv, UncertaintyPenalty, UncertaintyAbort
|
||||
from stable_baselines3 import SAC
|
||||
from stable_baselines3.common.buffers import HerReplayBuffer
|
||||
|
||||
|
||||
env = NuconGoalEnv(
|
||||
goal_params=['GENERATOR_0_KW', 'GENERATOR_1_KW', 'GENERATOR_2_KW'],
|
||||
goal_range={
|
||||
@ -210,6 +211,11 @@ env = NuconGoalEnv(
|
||||
tolerance=0.05, # sparse: within 5% of range counts as success (recommended with HER)
|
||||
seconds_per_step=5,
|
||||
simulator=simulator, # use a pre-trained simulator for fast pre-training
|
||||
# Keep policy within the simulator's known data distribution.
|
||||
# SIM_UNCERTAINTY (kNN-GP posterior std) is injected into obs when a simulator is active.
|
||||
# Tune start/scale/threshold to taste.
|
||||
objectives=[UncertaintyPenalty(start=0.3, scale=1.0)], # L2 penalty above soft threshold
|
||||
terminators=[UncertaintyAbort(threshold=0.7)], # abort episode at hard threshold
|
||||
)
|
||||
# Or use a preset: env = gym.make('Nucon-goal_power-v0', simulator=simulator)
|
||||
|
||||
@ -405,7 +411,7 @@ The recommended end-to-end workflow for training an RL operator is an iterative
|
||||
|
||||
**Step 2 — Initial model fitting**: Fit a kNN-GP model (instant) or NN (better extrapolation with larger datasets) using `fit_knn()` or `train_model()`. Prune near-duplicate samples with `drop_redundant()` before fitting. See [Model Learning](#model-learning).
|
||||
|
||||
**Step 3 — Train RL in simulator**: Load the fitted model into `NuconSimulator`, then train a `NuconGoalEnv` policy with SAC + HER. The simulator runs far faster than the real game, allowing many trajectories in reasonable time. Use `uncertainty_penalty_start` and `uncertainty_abort` on the env to discourage the policy from wandering into regions the model hasn't seen: a linear penalty kicks in above the soft threshold, and the episode is truncated at the hard threshold. This keeps training within the reliable part of the model's knowledge. See [NuconGoalEnv + HER Usage](#nucongoalenv--her-usage).
|
||||
**Step 3 — Train RL in simulator**: Load the fitted model into `NuconSimulator`, then train a `NuconGoalEnv` policy with SAC + HER. The simulator runs far faster than the real game, allowing many trajectories in reasonable time. Pass `UncertaintyPenalty` and `UncertaintyAbort` as objectives/terminators to discourage the policy from wandering into regions the model hasn't seen; `SIM_UNCERTAINTY` is automatically injected into the obs dict when a simulator is active. See [NuconGoalEnv + HER Usage](#nucongoalenv--her-usage).
|
||||
|
||||
**Step 4 — Eval in game + collect new data**: Run the trained policy against the real game. This validates simulator accuracy and simultaneously collects new data from states the policy visits, which may be regions the original dataset missed. Run a second `NuconModelLearner` in a background thread to collect concurrently.
|
||||
|
||||
|
||||
@ -280,6 +280,8 @@ class NuconGoalEnv(gym.Env):
|
||||
seconds_per_step=5,
|
||||
terminators=None,
|
||||
terminate_above=0,
|
||||
objectives=None,
|
||||
objective_weights=None,
|
||||
):
|
||||
super().__init__()
|
||||
|
||||
@ -349,6 +351,9 @@ class NuconGoalEnv(gym.Env):
|
||||
self.action_space = spaces.Dict(action_spaces)
|
||||
|
||||
self._terminators = terminators or []
|
||||
_objs = objectives or []
|
||||
self._objectives = [Objectives[o] if isinstance(o, str) else o for o in _objs]
|
||||
self._objective_weights = objective_weights or [1.0] * len(self._objectives)
|
||||
self._desired_goal = np.zeros(n_goals, dtype=np.float32)
|
||||
self._total_steps = 0
|
||||
|
||||
@ -410,6 +415,7 @@ class NuconGoalEnv(gym.Env):
|
||||
info = {'achieved_goal': obs['achieved_goal'], 'desired_goal': obs['desired_goal'],
|
||||
'obs': obs['observation']}
|
||||
reward = float(self.compute_reward(obs['achieved_goal'], obs['desired_goal'], info))
|
||||
reward += sum(w * o(obs['observation']) for o, w in zip(self._objectives, self._objective_weights))
|
||||
terminated = any(t(obs['observation']) > self.terminate_above for t in self._terminators)
|
||||
truncated = False
|
||||
return obs, reward, terminated, truncated, info
|
||||
|
||||
Loading…
Reference in New Issue
Block a user