Add PPO example to README
This commit is contained in:
parent
5dfd85a5af
commit
e4c9f047d0
38
README.md
38
README.md
@ -138,6 +138,44 @@ env.close()
|
|||||||
|
|
||||||
Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values.
|
Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values.
|
||||||
|
|
||||||
|
You can e.g. train an PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
|
||||||
|
```python
|
||||||
|
from nucon.rl import NuconEnv
|
||||||
|
from stable_baselines3 import PPO
|
||||||
|
|
||||||
|
env = NuconEnv(objectives=['max_power'], seconds_per_step=5)
|
||||||
|
|
||||||
|
# Create the PPO (Proximal Policy Optimization) model
|
||||||
|
model = PPO(
|
||||||
|
"MlpPolicy",
|
||||||
|
env,
|
||||||
|
verbose=1,
|
||||||
|
learning_rate=3e-4, # You can adjust hyperparameters as needed
|
||||||
|
n_steps=2048,
|
||||||
|
batch_size=64,
|
||||||
|
n_epochs=10,
|
||||||
|
gamma=0.99,
|
||||||
|
gae_lambda=0.95,
|
||||||
|
clip_range=0.2,
|
||||||
|
ent_coef=0.01
|
||||||
|
)
|
||||||
|
|
||||||
|
# Train the model
|
||||||
|
model.learn(total_timesteps=100000) # Adjust total_timesteps as needed
|
||||||
|
|
||||||
|
# Test the trained model
|
||||||
|
obs, info = env.reset()
|
||||||
|
for _ in range(1000):
|
||||||
|
action, _states = model.predict(obs, deterministic=True)
|
||||||
|
obs, reward, terminated, truncated, info = env.step(action)
|
||||||
|
|
||||||
|
if terminated or truncated:
|
||||||
|
obs, info = env.reset()
|
||||||
|
|
||||||
|
# Close the environment
|
||||||
|
env.close()
|
||||||
|
```
|
||||||
|
|
||||||
But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a
|
But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a
|
||||||
|
|
||||||
## Simulator (Work in Progress)
|
## Simulator (Work in Progress)
|
||||||
|
Loading…
Reference in New Issue
Block a user