Add PPO example to README
This commit is contained in:
		
							parent
							
								
									5dfd85a5af
								
							
						
					
					
						commit
						e4c9f047d0
					
				
							
								
								
									
										38
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										38
									
								
								README.md
									
									
									
									
									
								
							@ -138,6 +138,44 @@ env.close()
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values.
 | 
					Objectives takes either strings of the name of predefined objectives, or lambda functions which take an observation and return a scalar reward. Final rewards are (weighted) summed across all objectives. `info['objectives']` contains all objectives and their values.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					You can e.g. train an PPO agent using the [sb3](https://github.com/DLR-RM/stable-baselines3) implementation:
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					from nucon.rl import NuconEnv
 | 
				
			||||||
 | 
					from stable_baselines3 import PPO
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					env = NuconEnv(objectives=['max_power'], seconds_per_step=5)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Create the PPO (Proximal Policy Optimization) model
 | 
				
			||||||
 | 
					model = PPO(
 | 
				
			||||||
 | 
					    "MlpPolicy", 
 | 
				
			||||||
 | 
					    env, 
 | 
				
			||||||
 | 
					    verbose=1,
 | 
				
			||||||
 | 
					    learning_rate=3e-4,  # You can adjust hyperparameters as needed
 | 
				
			||||||
 | 
					    n_steps=2048, 
 | 
				
			||||||
 | 
					    batch_size=64, 
 | 
				
			||||||
 | 
					    n_epochs=10, 
 | 
				
			||||||
 | 
					    gamma=0.99, 
 | 
				
			||||||
 | 
					    gae_lambda=0.95, 
 | 
				
			||||||
 | 
					    clip_range=0.2, 
 | 
				
			||||||
 | 
					    ent_coef=0.01
 | 
				
			||||||
 | 
					)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Train the model
 | 
				
			||||||
 | 
					model.learn(total_timesteps=100000)  # Adjust total_timesteps as needed
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Test the trained model
 | 
				
			||||||
 | 
					obs, info = env.reset()
 | 
				
			||||||
 | 
					for _ in range(1000):
 | 
				
			||||||
 | 
					    action, _states = model.predict(obs, deterministic=True)
 | 
				
			||||||
 | 
					    obs, reward, terminated, truncated, info = env.step(action)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    if terminated or truncated:
 | 
				
			||||||
 | 
					        obs, info = env.reset()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Close the environment
 | 
				
			||||||
 | 
					env.close()
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a
 | 
					But theres a problem: RL algorithms require a huge amount of training steps to get passable policies, and Nucleares is a very slow simulation and can not be trivially parallelized. That's why NuCon also provides a
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Simulator (Work in Progress)
 | 
					## Simulator (Work in Progress)
 | 
				
			||||||
 | 
				
			|||||||
		Loading…
	
		Reference in New Issue
	
	Block a user