metastable-baselines2/README.md

# Metastable Baselines 2

<p align='center'>
  <img src='./icon.svg'>
</p>

An extension to Stable Baselines 3. Based on Metastable Baselines 1.

This repo provides:

- An implementation of ["Differentiable Trust Region Layers for Deep Reinforcement Learning" by Fabian Otto et al. (TRPL)](https://arxiv.org/abs/2101.09207)
- Support for Contextual Covariances
- Support for Full Covariances

## Installation

#### Install dependency: Metastable Projections

Follow instructions for the [Metastable Projections](https://git.dominik-roth.eu/dodox/metastable-projections) ([GitHub Mirror](https://github.com/D-o-d-o-x/metastable-projections)).
KL Projections require ALR's ITPAL as an additional dependecy.

#### Install as a package

Then install this repo as a package:

```bash
pip install -e .
```

If you want to be able to use full / contextual covariances, install with the optional dependency 'pca':
```bash
pip install -e '.[pca]'
```
## Usage

### TRPL

TRPL can be used just like SB3's PPO:

```python
import gymnasium as gym
from metastable_baselines2 import TRPL

env_id = 'LunarLanderContinuous-v2'
projection = 'Wasserstein' # or Frobenius or KL

model = TRPL("MlpPolicy", env_id, n_steps=128, seed=0, policy_kwargs=dict(net_arch=[16]), projection_class=projection, verbose=1)

model.learn(total_timesteps=256)
```

Configure TRPL py passing `projection_kwargs` to TRPL:

```python
model = TRPL("MlpPolicy", env_id, n_steps=128, seed=0, policy_kwargs=dict(net_arch=[16]), projection_class=projection, projection_kwargs={'mean_bound': mean_bound, 'cov_bound': cov_bound}, verbose=1)
```

For available projection_kwargs have a look at [Metastable Projections](https://git.dominik-roth.eu/dodox/metastable-projections).

### Full Covariance

SB3 does not support full covariances (only diagonal). We still provide support for full covariances via the seperate [PCA](https://git.dominik-roth.eu/dodox/PriorConditionedAnnealing) package. (But since we don't actually want to use PCA ('Prior Conditioned Annealing'), we pass 'skip_conditioning=True'; this will lead to the underlying Noise being used directly.)

We therefore pass `use_pca=True` and `policy_kwargs.dist_kwargs = {'Base_Noise': 'WHITE', par_strength: 'FULL', skip_conditioning=True}`

```python
# We support PPO and TRPL, (SAC is untested, we are open to PRs fixing issues)
model = TRPL("MlpPolicy", env_id, n_steps=128, seed=0, use_pca=True, policy_kwargs=dict(net_arch=[16], dist_kwargs={'par_strength': 'FULL', 'skip_conditioning': True}), projection_class=projection, verbose=1)

model.learn(total_timesteps=256)
```

The supported values for `par_strength` are:
- `SCALAR`: We only learn a single scalar value, that is used along the whole diagonal. No covariance is modeled.

- `DIAG`: We learn a diagonal covariance matrix. (e.g. only variances).

- `FULL`: We learn a full covariance matrix, induced via Cholesky decomp (except when Wasserstein Projection is used; then we use the Cholesky of the SPD matrix sqrt of the covariance marix).

- `CONT_SCALAR`: Same as `SCALAR`, but the scalar is not global, it is parameterized by the policy net (contextual).

- `CONT_DIAG`: Same as `DIAG`, but the values are not global, they are parameterized by the policy net.

- `CONT_HYBRID`: We learn a parameric diagonal, that is scaled by the policy net.

- `CONT_FULL`: Same as `FULL`, but parameterized by the policy net.


## License

Since this Repo is an extension to [Stable Baselines 3 by DLR-RM](https://github.com/DLR-RM/stable-baselines3), it contains some of it's code. SB3 is licensed under the [MIT-License](https://github.com/DLR-RM/stable-baselines3/blob/master/LICENSE), and so are our extensions.
Implement Importance Sampling for PCA 2024-01-16 15:13:06 +01:00			`# Metastable Baselines 2`

			`<p align='center'>`
			`<img src='./icon.svg'>`
			`</p>`

			`An extension to Stable Baselines 3. Based on Metastable Baselines 1.`

Extended README 2024-03-14 17:35:07 +01:00			`This repo provides:`
Implement Importance Sampling for PCA 2024-01-16 15:13:06 +01:00
			`- An implementation of ["Differentiable Trust Region Layers for Deep Reinforcement Learning" by Fabian Otto et al. (TRPL)](https://arxiv.org/abs/2101.09207)`
Tweaked README 2024-04-03 17:53:56 +02:00			`- Support for Contextual Covariances`
			`- Support for Full Covariances`
Implement Importance Sampling for PCA 2024-01-16 15:13:06 +01:00
			`## Installation`

			`#### Install dependency: Metastable Projections`

			`Follow instructions for the [Metastable Projections](https://git.dominik-roth.eu/dodox/metastable-projections) ([GitHub Mirror](https://github.com/D-o-d-o-x/metastable-projections)).`
			`KL Projections require ALR's ITPAL as an additional dependecy.`

			`#### Install as a package`

			`Then install this repo as a package:`

Moved README.md 2024-03-30 14:41:43 +01:00			```bash
Implement Importance Sampling for PCA 2024-01-16 15:13:06 +01:00			`pip install -e .`
			```

Updated README 2024-04-01 00:18:00 +02:00			`If you want to be able to use full / contextual covariances, install with the optional dependency 'pca':`
			```bash
			`pip install -e '.[pca]'`
			```
Extended README 2024-03-14 17:35:07 +01:00			`## Usage`

Moved README.md 2024-03-30 14:41:43 +01:00			`### TRPL`

Extended README 2024-03-14 17:35:07 +01:00			`TRPL can be used just like SB3's PPO:`

Moved README.md 2024-03-30 14:41:43 +01:00			```python
Extended README 2024-03-14 17:35:07 +01:00			`import gymnasium as gym`
			`from metastable_baselines2 import TRPL`

Moved README.md 2024-03-30 14:41:43 +01:00			`env_id = 'LunarLanderContinuous-v2'`
Extended README 2024-03-14 17:35:07 +01:00			`projection = 'Wasserstein' # or Frobenius or KL`

Moved README.md 2024-03-30 14:41:43 +01:00			`model = TRPL("MlpPolicy", env_id, n_steps=128, seed=0, policy_kwargs=dict(net_arch=[16]), projection_class=projection, verbose=1)`
Extended README 2024-03-14 17:35:07 +01:00
Tweaked README 2024-04-03 17:53:56 +02:00			`model.learn(total_timesteps=256)`
Extended README 2024-03-14 17:35:07 +01:00			```

Moved README.md 2024-03-30 14:41:43 +01:00			Configure TRPL py passing `projection_kwargs` to TRPL:

			```python
			`model = TRPL("MlpPolicy", env_id, n_steps=128, seed=0, policy_kwargs=dict(net_arch=[16]), projection_class=projection, projection_kwargs={'mean_bound': mean_bound, 'cov_bound': cov_bound}, verbose=1)`
			```

Tweaked README 2024-04-03 17:53:56 +02:00			`For available projection_kwargs have a look at [Metastable Projections](https://git.dominik-roth.eu/dodox/metastable-projections).`
Extended README 2024-03-14 17:35:07 +01:00
Moved README.md 2024-03-30 14:41:43 +01:00			`### Full Covariance`

Link to PCA page from README 2024-04-03 18:01:32 +02:00			`SB3 does not support full covariances (only diagonal). We still provide support for full covariances via the seperate [PCA](https://git.dominik-roth.eu/dodox/PriorConditionedAnnealing) package. (But since we don't actually want to use PCA ('Prior Conditioned Annealing'), we pass 'skip_conditioning=True'; this will lead to the underlying Noise being used directly.)`
Moved README.md 2024-03-30 14:41:43 +01:00
			We therefore pass `use_pca=True` and `policy_kwargs.dist_kwargs = {'Base_Noise': 'WHITE', par_strength: 'FULL', skip_conditioning=True}`

			```python
			`# We support PPO and TRPL, (SAC is untested, we are open to PRs fixing issues)`
Tweaked README 2024-04-03 17:53:56 +02:00			`model = TRPL("MlpPolicy", env_id, n_steps=128, seed=0, use_pca=True, policy_kwargs=dict(net_arch=[16], dist_kwargs={'par_strength': 'FULL', 'skip_conditioning': True}), projection_class=projection, verbose=1)`
Moved README.md 2024-03-30 14:41:43 +01:00
Tweaked README 2024-04-03 17:53:56 +02:00			`model.learn(total_timesteps=256)`
Moved README.md 2024-03-30 14:41:43 +01:00			```

Tweaked README 2024-04-03 17:53:56 +02:00			The supported values for `par_strength` are:
			- `SCALAR`: We only learn a single scalar value, that is used along the whole diagonal. No covariance is modeled.
Moved README.md 2024-03-30 14:41:43 +01:00
Tweaked README 2024-04-03 17:53:56 +02:00			- `DIAG`: We learn a diagonal covariance matrix. (e.g. only variances).
Moved README.md 2024-03-30 14:41:43 +01:00
Tweaked README 2024-04-03 17:53:56 +02:00			- `FULL`: We learn a full covariance matrix, induced via Cholesky decomp (except when Wasserstein Projection is used; then we use the Cholesky of the SPD matrix sqrt of the covariance marix).
Moved README.md 2024-03-30 14:41:43 +01:00
Tweaked README 2024-04-03 17:53:56 +02:00			- `CONT_SCALAR`: Same as `SCALAR`, but the scalar is not global, it is parameterized by the policy net (contextual).
Moved README.md 2024-03-30 14:41:43 +01:00
Tweaked README 2024-04-03 17:53:56 +02:00			- `CONT_DIAG`: Same as `DIAG`, but the values are not global, they are parameterized by the policy net.
Moved README.md 2024-03-30 14:41:43 +01:00
Tweaked README 2024-04-03 17:53:56 +02:00			- `CONT_HYBRID`: We learn a parameric diagonal, that is scaled by the policy net.
Moved README.md 2024-03-30 14:41:43 +01:00
Tweaked README 2024-04-03 17:53:56 +02:00			- `CONT_FULL`: Same as `FULL`, but parameterized by the policy net.
Moved README.md 2024-03-30 14:41:43 +01:00

Implement Importance Sampling for PCA 2024-01-16 15:13:06 +01:00			`## License`

Tweaked README 2024-04-03 17:53:56 +02:00			`Since this Repo is an extension to [Stable Baselines 3 by DLR-RM](https://github.com/DLR-RM/stable-baselines3), it contains some of it's code. SB3 is licensed under the [MIT-License](https://github.com/DLR-RM/stable-baselines3/blob/master/LICENSE), and so are our extensions.`