refractoring of DMP environmets to fit gym interface better.

2021-03-26 14:05:16 +01:00 · 2021-03-26 14:05:16 +01:00 · 7ceadeff0a
commit 7ceadeff0a
parent 6233c85904
20 changed files with 661 additions and 568 deletions
--- a/README.md
+++ b/README.md
@ -1,34 +1,50 @@
-## ALR Custom Environments
+## ALR Environments
    
-This repository collects custom RL envs not included in Suits like OpenAI gym, rllab, etc. 
-Creating a custom (Mujoco) gym environement can be done according to [this guide](https://github.com/openai/gym/blob/master/docs/creating-environments.md).
-For stochastic search problems with gym interace use the Rosenbrock reference implementation.
+This repository collects custom Robotics environments not included in benchmark suites like OpenAI gym, rllab, etc. 
+Creating a custom (Mujoco) gym environment can be done according to [this guide](https://github.com/openai/gym/blob/master/docs/creating-environments.md).
+For stochastic search problems with gym interface use the `Rosenbrock-v0` reference implementation.
+We also support to solve environments with DMPs. When adding new DMP tasks check the `ViaPointReacherDMP-v0` reference implementation.
+When simply using the tasks, you can also leverage the wrapper class `DmpWrapper` to turn normal gym environments in to DMP tasks.

 ## Environments
-Currently we have the following environements: 
+Currently we have the following environments: 

 ### Mujoco

-|Name| Description|
-|---|---|
-|`ALRReacher-v0`|Modified (5 links) Mujoco gym's `Reacher-v2` (2 links)|
-|`ALRReacherSparse-v0`|Same as `ALRReacher-v0`, but the distance penalty is only provided in the last time step.|
-|`ALRReacherSparseBalanced-v0`|Same as `ALRReacherSparse-v0`, but the the end effector has to stay upright.|
-|`ALRReacherShort-v0`|Same as `ALRReacher-v0`, but the episode length is reduced to 50.|
-|`ALRReacherShortSparse-v0`|Combination of `ALRReacherSparse-v0` and `ALRReacherShort-v0`.|
-|`ALRReacher7-v0`|Modified (7 links) Mujoco gym's `Reacher-v2` (2 links)|
-|`ALRReacher7Sparse-v0`|Same as `ALRReacher7-v0`, but the distance penalty is only provided in the last time step.|
+|Name| Description|Horizon|Action Dimension|Observation Dimension
+|---|---|---|---|---|
+|`ALRReacher-v0`|Modified (5 links) Mujoco gym's `Reacher-v2` (2 links)| 200 | 5 | 21
+|`ALRReacherSparse-v0`|Same as `ALRReacher-v0`, but the distance penalty is only provided in the last time step.| 200 | 5 | 21
+|`ALRReacherSparseBalanced-v0`|Same as `ALRReacherSparse-v0`, but the end-effector has to remain upright.| 200 | 5 | 21
+|`ALRLongReacher-v0`|Modified (7 links) Mujoco gym's `Reacher-v2` (2 links)| 200 | 7 | 27
+|`ALRLongReacherSparse-v0`|Same as `ALRLongReacher-v0`, but the distance penalty is only provided in the last time step.| 200 | 7 | 27
+|`ALRLongReacherSparseBalanced-v0`|Same as `ALRLongReacherSparse-v0`, but the end-effector has to remain upright.| 200 | 7 | 27
    
 ### Classic Control

-|Name| Description|
-|---|---|
-|`SimpleReacher-v0`| Simple Reaching Task without any physics simulation. Returns no reward until 150 time steps. This allows the agent to explore the space, but requires precise actions towards the end of the trajectory.|
+|Name| Description|Horizon|Action Dimension|Observation Dimension
+|---|---|---|---|---|
+|`SimpleReacher-v0`| Simple reaching task (2 links) without any physics simulation. Provides no reward until 150 time steps. This allows the agent to explore the space, but requires precise actions towards the end of the trajectory.| 200 | 2 | 9
+|`LongSimpleReacher-v0`| Simple reaching task (5 links) without any physics simulation. Provides no reward until 150 time steps. This allows the agent to explore the space, but requires precise actions towards the end of the trajectory.| 200 | 5 | 18
+|`ViaPointReacher-v0`| Simple reaching task leveraging a via point, which supports self collision detection. Provides a reward only at 100 and 199 for reaching the viapoint and goal point, respectively.| 200 | 
+|`HoleReacher-v0`|
+
+### DMP Environments
+These environments are closer to stochastic search. They always execute a full trajectory, which is computed by a DMP and executed by a controller, e.g. a PD controller.
+The goal is to learn the parameters of this DMP to generate a suitable trajectory. 
+All environments provide the full episode reward and additional information about early terminations, e.g. due to collisions. 
+
+|Name| Description|Horizon|Action Dimension|Observation Dimension
+|---|---|---|---|---|
+|`ViaPointReacherDMP-v0`| Simple reaching task leveraging a via point, which supports self collision detection. Provides a reward only at 100 and 199 for reaching the viapoint and goal point, respectively.| 200 |
+|`HoleReacherDMP-v0`|
+|`HoleReacherFixedGoalDMP-v0`|
+|`HoleReacherDetPMP-v0`|

 ### Stochastic Search
-|Name| Description|
-|---|---|
-|`Rosenbrock{dim}-v0`| Gym interface for Rosenbrock function. `{dim}` is one of 5, 10, 25, 50 or 100. | 
+|Name| Description|Horizon|Action Dimension|Observation Dimension
+|---|---|---|---|---|
+|`Rosenbrock{dim}-v0`| Gym interface for Rosenbrock function. `{dim}` is one of 5, 10, 25, 50 or 100. | 1 | `{dim}` | 0


 ## Install
--- a/alr_envs/init.py
+++ b/alr_envs/init.py
@ -1,6 +1,9 @@
 from gym.envs.registration import register

 from alr_envs.stochastic_search.functions.f_rosenbrock import Rosenbrock
+from alr_envs.utils.wrapper.dmp_wrapper import DmpWrapper
+
+# Mujoco

 register(
    id='ALRReacher-v0',
@ -9,6 +12,7 @@ register(
    kwargs={
        "steps_before_reward": 0,
        "n_links": 5,
+        "balance": False,
    }
 )

@ -19,6 +23,7 @@ register(
    kwargs={
        "steps_before_reward": 200,
        "n_links": 5,
+        "balance": False,
    }
 )

@ -34,65 +39,40 @@ register(
 )

 register(
-    id='ALRReacherShort-v0',
-    entry_point='alr_envs.mujoco:ALRReacherEnv',
-    max_episode_steps=50,
-    kwargs={
-        "steps_before_reward": 0,
-        "n_links": 5,
-    }
-)
-
-register(
-    id='ALRReacherShortSparse-v0',
-    entry_point='alr_envs.mujoco:ALRReacherEnv',
-    max_episode_steps=50,
-    kwargs={
-        "steps_before_reward": 50,
-        "n_links": 5,
-    }
-)
-
-register(
-    id='ALRReacher7-v0',
+    id='ALRLongReacher-v0',
    entry_point='alr_envs.mujoco:ALRReacherEnv',
    max_episode_steps=200,
    kwargs={
        "steps_before_reward": 0,
        "n_links": 7,
-    }
-)
-
-# register(
-#     id='ALRReacherSparse-v0',
-#     entry_point='alr_envs.mujoco:ALRReacherEnv',
-#     max_episode_steps=200,
-#     kwargs={
-#         "steps_before_reward": 200,
-#         "n_links": 7,
-#     }
-# )
-
-register(
-    id='ALRReacher7Short-v0',
-    entry_point='alr_envs.mujoco:ALRReacherEnv',
-    max_episode_steps=50,
-    kwargs={
-        "steps_before_reward": 0,
-        "n_links": 7,
+        "balance": False,
    }
 )

 register(
-    id='ALRReacher7ShortSparse-v0',
+    id='ALRLongReacherSparse-v0',
    entry_point='alr_envs.mujoco:ALRReacherEnv',
-    max_episode_steps=50,
+    max_episode_steps=200,
    kwargs={
-        "steps_before_reward": 50,
+        "steps_before_reward": 200,
        "n_links": 7,
+        "balance": False,
    }
 )

+register(
+    id='ALRLongReacherSparseBalanced-v0',
+    entry_point='alr_envs.mujoco:ALRReacherEnv',
+    max_episode_steps=200,
+    kwargs={
+        "steps_before_reward": 200,
+        "n_links": 7,
+        "balance": True,
+    }
+)
+
+# Classic control
+
 register(
    id='SimpleReacher-v0',
    entry_point='alr_envs.classic_control:SimpleReacherEnv',
@ -103,7 +83,7 @@ register(
 )

 register(
-    id='SimpleReacher5-v0',
+    id='LongSimpleReacher-v0',
    entry_point='alr_envs.classic_control:SimpleReacherEnv',
    max_episode_steps=200,
    kwargs={
@ -111,12 +91,66 @@ register(
    }
 )

+register(
+    id='ViaPointReacher-v0',
+    entry_point='alr_envs.classic_control.viapoint_reacher:ViaPointReacher',
+    max_episode_steps=200,
+    kwargs={
+        "n_links": 5,
+        "allow_self_collision": False,
+        "collision_penalty": 1000
+    }
+)
+
+register(
+    id='HoleReacher-v0',
+    entry_point='alr_envs.classic_control.hole_reacher:HoleReacher',
+    max_episode_steps=200,
+    kwargs={
+        "n_links": 5,
+        "allow_self_collision": False,
+        "allow_wall_collision": False,
+        "hole_width": 0.15,
+        "hole_depth": 1,
+        "hole_x": 1,
+        "collision_penalty": 100,
+    }
+)
+
+# DMP environments
+
+register(
+    id='ViaPointReacherDMP-v0',
+    entry_point='alr_envs.classic_control.viapoint_reacher:viapoint_dmp',
+    # max_episode_steps=1,
+)
+
+register(
+    id='HoleReacherDMP-v0',
+    entry_point='alr_envs.classic_control.hole_reacher:holereacher_dmp',
+    # max_episode_steps=1,
+)
+
+register(
+    id='HoleReacherFixedGoalDMP-v0',
+    entry_point='alr_envs.classic_control.hole_reacher:holereacher_fix_goal_dmp',
+    # max_episode_steps=1,
+)
+
+register(
+    id='HoleReacherDetPMP-v0',
+    entry_point='alr_envs.classic_control.hole_reacher:holereacher_detpmp',
+    # max_episode_steps=1,
+)
+
+# BBO functions
+
 for dim in [5, 10, 25, 50, 100]:
    register(
        id=f'Rosenbrock{dim}-v0',
        entry_point='alr_envs.stochastic_search:StochasticSearchEnv',
        max_episode_steps=1,
        kwargs={
-            "cost_f": Rosenbrock,
+            "cost_f": Rosenbrock(dim),
        }
    )
--- a/alr_envs/classic_control/init.py
+++ b/alr_envs/classic_control/init.py
@ -1 +1,3 @@
 from alr_envs.classic_control.simple_reacher import SimpleReacherEnv
+from alr_envs.classic_control.viapoint_reacher import ViaPointReacher
+from alr_envs.classic_control.hole_reacher import HoleReacher
--- a/alr_envs/classic_control/hole_reacher.py
+++ b/alr_envs/classic_control/hole_reacher.py
@ -3,6 +3,9 @@ import numpy as np
 import matplotlib.pyplot as plt
 from matplotlib import patches

+from alr_envs import DmpWrapper
+from alr_envs.utils.wrapper.detpmp_wrapper import DetPMPWrapper
+

 def ccw(A, B, C):
    return (C[1] - A[1]) * (B[0] - A[0]) - (B[1] - A[1]) * (C[0] - A[0]) > 1e-12
@ -13,37 +16,66 @@ def intersect(A, B, C, D):
    return ccw(A, C, D) != ccw(B, C, D) and ccw(A, B, C) != ccw(A, B, D)


+def holereacher_dmp(**kwargs):
+    _env = gym.make("alr_envs:HoleReacher-v0")
+    # _env = HoleReacher(**kwargs)
+    return DmpWrapper(_env, num_dof=5, num_basis=5, duration=2, dt=_env.dt, learn_goal=True, alpha_phase=3.5,
+                      start_pos=_env.start_pos, policy_type="velocity", weights_scale=100, goal_scale=0.1)
+
+
+def holereacher_fix_goal_dmp(**kwargs):
+    _env = gym.make("alr_envs:HoleReacher-v0")
+    # _env = HoleReacher(**kwargs)
+    return DmpWrapper(_env, num_dof=5, num_basis=5, duration=2, dt=_env.dt, learn_goal=False, alpha_phase=3.5,
+                      start_pos=_env.start_pos, policy_type="velocity", weights_scale=50, goal_scale=1,
+                      final_pos=np.array([2.02669572, -1.25966385, -1.51618198, -0.80946476, 0.02012344]))
+
+
+def holereacher_detpmp(**kwargs):
+    _env = gym.make("alr_envs:HoleReacher-v0")
+    # _env = HoleReacher(**kwargs)
+    return DetPMPWrapper(_env, num_dof=5, num_basis=5, width=0.005, policy_type="velocity", start_pos=_env.start_pos,
+                         duration=2, post_traj_time=0, dt=_env.dt, weights_scale=0.25, zero_start=True, zero_goal=False)
+
+
 class HoleReacher(gym.Env):

-    def __init__(self, num_links, hole_x, hole_width, hole_depth, allow_self_collision=False,
+    def __init__(self, n_links, hole_x, hole_width, hole_depth, allow_self_collision=False,
                 allow_wall_collision=False, collision_penalty=1000):
+
+        self.n_links = n_links
+        self.link_lengths = np.ones((n_links, 1))
+
+        # task
        self.hole_x = hole_x  # x-position of center of hole
        self.hole_width = hole_width  # width of hole
        self.hole_depth = hole_depth  # depth of hole
-        self.num_links = num_links
-        self.link_lengths = np.ones((num_links, 1))
+
        self.bottom_center_of_hole = np.hstack([hole_x, -hole_depth])
        self.top_center_of_hole = np.hstack([hole_x, 0])
        self.left_wall_edge = np.hstack([hole_x - self.hole_width / 2, 0])
        self.right_wall_edge = np.hstack([hole_x + self.hole_width / 2, 0])
+
+        # collision
        self.allow_self_collision = allow_self_collision
        self.allow_wall_collision = allow_wall_collision
        self.collision_penalty = collision_penalty

+        # state
        self._joints = None
        self._joint_angles = None
        self._angle_velocity = None
-        self.start_pos = np.hstack([[np.pi/2], np.zeros(self.num_links - 1)])
-        self.start_vel = np.zeros(self.num_links)
+        self.start_pos = np.hstack([[np.pi / 2], np.zeros(self.n_links - 1)])
+        self.start_vel = np.zeros(self.n_links)

        self.dt = 0.01
-        self.time_limit = 2
+        # self.time_limit = 2

-        action_bound = np.pi * np.ones((self.num_links,))
+        action_bound = np.pi * np.ones((self.n_links,))
        state_bound = np.hstack([
-            [np.pi] * self.num_links,  # cos
-            [np.pi] * self.num_links,  # sin
-            [np.inf] * self.num_links,  # velocity
+            [np.pi] * self.n_links,  # cos
+            [np.pi] * self.n_links,  # sin
+            [np.inf] * self.n_links,  # velocity
            [np.inf] * 2,  # x-y coordinates of target distance
            [np.inf]  # env steps, because reward start after n steps TODO: Maybe
        ])
@ -51,11 +83,11 @@ class HoleReacher(gym.Env):
        self.observation_space = gym.spaces.Box(low=-state_bound, high=state_bound, shape=state_bound.shape)

        self.fig = None
-        rect_1 = patches.Rectangle((-self.num_links, -1),
-                                   self.num_links + self.hole_x - self.hole_width / 2, 1,
+        rect_1 = patches.Rectangle((-self.n_links, -1),
+                                   self.n_links + self.hole_x - self.hole_width / 2, 1,
                                   fill=True, edgecolor='k', facecolor='k')
        rect_2 = patches.Rectangle((self.hole_x + self.hole_width / 2, -1),
-                                   self.num_links - self.hole_x + self.hole_width / 2, 1,
+                                   self.n_links - self.hole_x + self.hole_width / 2, 1,
                                   fill=True, edgecolor='k', facecolor='k')
        rect_3 = patches.Rectangle((self.hole_x - self.hole_width / 2, -1), self.hole_width,
                                   1 - self.hole_depth,
@ -65,7 +97,7 @@ class HoleReacher(gym.Env):

    @property
    def end_effector(self):
-        return self._joints[self.num_links].T
+        return self._joints[self.n_links].T

    def configure(self, context):
        pass
@ -73,13 +105,13 @@ class HoleReacher(gym.Env):
    def reset(self):
        self._joint_angles = self.start_pos
        self._angle_velocity = self.start_vel
-        self._joints = np.zeros((self.num_links + 1, 2))
+        self._joints = np.zeros((self.n_links + 1, 2))
        self._update_joints()
        self._steps = 0

        return self._get_obs().copy()

-    def step(self, action):
+    def step(self, action: np.ndarray):
        """
        a single step with an action in joint velocity space
        """
@ -90,16 +122,12 @@ class HoleReacher(gym.Env):

        self._update_joints()

-        # rew = self._reward()
-
-        # compute reward directly in step function
-
        reward = 0
-        if not self._is_collided:
+        if self._is_collided:
+            reward = -self.collision_penalty
+        else:
            if self._steps == 199:
                reward = - np.linalg.norm(self.end_effector - self.bottom_center_of_hole) ** 2
-        else:
-            reward = -self.collision_penalty

        reward -= 5e-8 * np.sum(acc ** 2)

@ -107,7 +135,8 @@ class HoleReacher(gym.Env):

        self._steps += 1

-        done = self._steps * self.dt > self.time_limit or self._is_collided
+        # done = self._steps * self.dt > self.time_limit or self._is_collided
+        done = self._is_collided

        return self._get_obs().copy(), reward, done, info

@ -145,18 +174,6 @@ class HoleReacher(gym.Env):
            self._steps
        ])

-    # def _reward(self):
-    #     dist_reward = 0
-    #     if not self._is_collided:
-    #         if self._steps == 180:
-    #             dist_reward = np.linalg.norm(self.end_effector - self.bottom_center_of_hole)
-    #     else:
-    #         dist_reward = np.linalg.norm(self.end_effector - self.bottom_center_of_hole)
-    #
-    #     out = - dist_reward ** 2
-    #
-    #     return out
-
    def get_forward_kinematics(self, num_points_per_link=1):
        theta = self._joint_angles[:, None]

@ -167,7 +184,7 @@ class HoleReacher(gym.Env):

        accumulated_theta = np.cumsum(theta, axis=0)

-        endeffector = np.zeros(shape=(self.num_links, num_points_per_link, 2))
+        endeffector = np.zeros(shape=(self.n_links, num_points_per_link, 2))

        x = np.cos(accumulated_theta) * self.link_lengths * intermediate_points
        y = np.sin(accumulated_theta) * self.link_lengths * intermediate_points
@ -175,7 +192,7 @@ class HoleReacher(gym.Env):
        endeffector[0, :, 0] = x[0, :]
        endeffector[0, :, 1] = y[0, :]

-        for i in range(1, self.num_links):
+        for i in range(1, self.n_links):
            endeffector[i, :, 0] = x[i, :] + endeffector[i - 1, -1, 0]
            endeffector[i, :, 1] = y[i, :] + endeffector[i - 1, -1, 1]

@ -273,7 +290,7 @@ class HoleReacher(gym.Env):
                # Add the patch to the Axes
                [plt.gca().add_patch(rect) for rect in self.patches]

-                plt.xlim(-self.num_links, self.num_links), plt.ylim(-1, self.num_links)
+                plt.xlim(-self.n_links, self.n_links), plt.ylim(-1, self.n_links)
                # Arm
                plt.plot(self._joints[:, 0], self._joints[:, 1], 'ro-', markerfacecolor='k')

@ -287,7 +304,8 @@ class HoleReacher(gym.Env):
 if __name__ == '__main__':
    nl = 5
    render_mode = "human"  # "human" or "partial" or "final"
-    env = HoleReacher(num_links=nl, allow_self_collision=False, allow_wall_collision=False, hole_width=0.15, hole_depth=1, hole_x=1)
+    env = HoleReacher(n_links=nl, allow_self_collision=False, allow_wall_collision=False, hole_width=0.15,
+                      hole_depth=1, hole_x=1)
    env.reset()
    # env.render(mode=render_mode)

--- a/alr_envs/classic_control/simple_reacher.py
+++ b/alr_envs/classic_control/simple_reacher.py
@ -1,11 +1,9 @@
-import os
-
 import gym
-import matplotlib as mpl
 import matplotlib.pyplot as plt
 import numpy as np
 from gym import spaces
 from gym.utils import seeding
+
 from alr_envs.utils.utils import angle_normalize


@ -33,7 +31,7 @@ class SimpleReacherEnv(gym.Env):
        self._angle_velocity = None

        self.max_torque = 1  # 10
-        self.steps_before_reward = 180
+        self.steps_before_reward = 199

        action_bound = np.ones((self.n_links,))
        state_bound = np.hstack([
@ -92,7 +90,7 @@ class SimpleReacherEnv(gym.Env):

    def _update_joints(self):
        """
-        update _joints to get new end effector position. The other links are only required for rendering.
+        update joints to get new end-effector position. The other links are only required for rendering.
        Returns:

        """
@ -106,7 +104,7 @@ class SimpleReacherEnv(gym.Env):

        # TODO: Is this the best option
        if self._steps >= self.steps_before_reward:
-            reward_dist = - np.linalg.norm(diff)
+            reward_dist -= np.linalg.norm(diff)
            # reward_dist = np.exp(-0.1 * diff ** 2).mean()
            # reward_dist = - (diff ** 2).mean()

--- a/alr_envs/classic_control/utils.py
+++ b/alr_envs/classic_control/utils.py
@ -1,7 +1,7 @@
 from alr_envs.classic_control.hole_reacher import HoleReacher
 from alr_envs.classic_control.viapoint_reacher import ViaPointReacher
-from alr_envs.utils.dmp_env_wrapper import DmpEnvWrapper
-from alr_envs.utils.detpmp_env_wrapper import DetPMPEnvWrapper
+from alr_envs.utils.wrapper.dmp_wrapper import DmpWrapper
+from alr_envs.utils.wrapper.detpmp_wrapper import DetPMPWrapper
 import numpy as np


@ -17,11 +17,11 @@ def make_viapointreacher_env(rank, seed=0):
    """

    def _init():
-        _env = ViaPointReacher(num_links=5,
+        _env = ViaPointReacher(n_links=5,
                               allow_self_collision=False,
                               collision_penalty=1000)

-        _env = DmpEnvWrapper(_env,
+        _env = DmpWrapper(_env,
                          num_dof=5,
                          num_basis=5,
                          duration=2,
@ -49,7 +49,7 @@ def make_holereacher_env(rank, seed=0):
    """

    def _init():
-        _env = HoleReacher(num_links=5,
+        _env = HoleReacher(n_links=5,
                           allow_self_collision=False,
                           allow_wall_collision=False,
                           hole_width=0.15,
@ -57,7 +57,7 @@ def make_holereacher_env(rank, seed=0):
                           hole_x=1,
                           collision_penalty=100)

-        _env = DmpEnvWrapper(_env,
+        _env = DmpWrapper(_env,
                          num_dof=5,
                          num_basis=5,
                          duration=2,
@ -88,7 +88,7 @@ def make_holereacher_fix_goal_env(rank, seed=0):
    """

    def _init():
-        _env = HoleReacher(num_links=5,
+        _env = HoleReacher(n_links=5,
                           allow_self_collision=False,
                           allow_wall_collision=False,
                           hole_width=0.15,
@ -96,7 +96,7 @@ def make_holereacher_fix_goal_env(rank, seed=0):
                           hole_x=1,
                           collision_penalty=100)

-        _env = DmpEnvWrapper(_env,
+        _env = DmpWrapper(_env,
                          num_dof=5,
                          num_basis=5,
                          duration=2,
@ -128,7 +128,7 @@ def make_holereacher_env_pmp(rank, seed=0):
    """

    def _init():
-        _env = HoleReacher(num_links=5,
+        _env = HoleReacher(n_links=5,
                           allow_self_collision=False,
                           allow_wall_collision=False,
                           hole_width=0.15,
@ -136,7 +136,7 @@ def make_holereacher_env_pmp(rank, seed=0):
                           hole_x=1,
                           collision_penalty=1000)

-        _env = DetPMPEnvWrapper(_env,
+        _env = DetPMPWrapper(_env,
                             num_dof=5,
                             num_basis=5,
                             width=0.005,
--- a/alr_envs/classic_control/viapoint_reacher.py
+++ b/alr_envs/classic_control/viapoint_reacher.py
@ -1,30 +1,33 @@
 import gym
-import numpy as np
 import matplotlib.pyplot as plt
-from matplotlib import patches
+import numpy as np
+
+from alr_envs import DmpWrapper
+from alr_envs.utils.utils import check_self_collision


-def ccw(A, B, C):
-    return (C[1]-A[1]) * (B[0]-A[0]) - (B[1]-A[1]) * (C[0]-A[0]) > 1e-12
-
-
-# Return true if line segments AB and CD intersect
-def intersect(A, B, C, D):
-    return ccw(A, C, D) != ccw(B, C, D) and ccw(A, B, C) != ccw(A, B, D)
+def viapoint_dmp(**kwargs):
+    _env = gym.make("alr_envs:ViaPointReacher-v0")
+    # _env = ViaPointReacher(**kwargs)
+    return DmpWrapper(_env, num_dof=5, num_basis=5, duration=2, alpha_phase=2.5, dt=_env.dt,
+                      start_pos=_env.start_pos, learn_goal=False, policy_type="velocity", weights_scale=50)


 class ViaPointReacher(gym.Env):

-    def __init__(self, num_links, allow_self_collision=False,
-                 collision_penalty=1000):
-        self.num_links = num_links
-        self.link_lengths = np.ones((num_links, 1))
+    def __init__(self, n_links, allow_self_collision=False, collision_penalty=1000):
+        self.num_links = n_links
+        self.link_lengths = np.ones((n_links, 1))
+
+        # task
+        self.via_point = np.ones(2)
+        self.goal_point = np.array((n_links, 0))
+
+        # collision
        self.allow_self_collision = allow_self_collision
        self.collision_penalty = collision_penalty

-        self.via_point = np.ones(2)
-        self.goal_point = np.array((num_links, 0))
-
+        # state
        self._joints = None
        self._joint_angles = None
        self._angle_velocity = None
@ -32,8 +35,9 @@ class ViaPointReacher(gym.Env):
        self.start_vel = np.zeros(self.num_links)
        self.weight_matrix_scale = 1

+        self._steps = 0
        self.dt = 0.01
-        self.time_limit = 2
+        # self.time_limit = 2

        action_bound = np.pi * np.ones((self.num_links,))
        state_bound = np.hstack([
@ -64,7 +68,7 @@ class ViaPointReacher(gym.Env):

        return self._get_obs().copy()

-    def step(self, action):
+    def step(self, action: np.ndarray):
        """
        a single step with an action in joint velocity space
        """
@ -75,17 +79,14 @@ class ViaPointReacher(gym.Env):

        self._update_joints()

-        # rew = self._reward()
-
-        # compute reward directly in step function
-
        dist_reward = 0
        if not self._is_collided:
            if self._steps == 100:
                dist_reward = np.linalg.norm(self.end_effector - self.via_point)
-            if self._steps == 199:
+            elif self._steps == 199:
                dist_reward = np.linalg.norm(self.end_effector - self.goal_point)

+        # TODO: Do we need that?
        reward = - dist_reward ** 2

        reward -= 1e-6 * np.sum(acc ** 2)
@ -100,7 +101,8 @@ class ViaPointReacher(gym.Env):

        self._steps += 1

-        done = self._steps * self.dt > self.time_limit or self._is_collided
+        # done = self._steps * self.dt > self.time_limit or self._is_collided
+        done = self._is_collided

        return self._get_obs().copy(), reward, done, info

@ -118,8 +120,8 @@ class ViaPointReacher(gym.Env):
        self_collision = False

        if not self.allow_self_collision:
-            self_collision = self.check_self_collision(line_points_in_taskspace)
-            if np.any(np.abs(self._joint_angles) > np.pi) and not self.allow_self_collision:
+            self_collision = check_self_collision(line_points_in_taskspace)
+            if np.any(np.abs(self._joint_angles) > np.pi):
                self_collision = True

        self._is_collided = self_collision
@ -135,25 +137,10 @@ class ViaPointReacher(gym.Env):
            self._steps
        ])

-    # def _reward(self):
-    #     dist_reward = 0
-    #     if not self._is_collided:
-    #         if self._steps == 180:
-    #             dist_reward = np.linalg.norm(self.end_effector - self.bottom_center_of_hole)
-    #     else:
-    #         dist_reward = np.linalg.norm(self.end_effector - self.bottom_center_of_hole)
-    #
-    #     out = - dist_reward ** 2
-    #
-    #     return out
-
    def get_forward_kinematics(self, num_points_per_link=1):
        theta = self._joint_angles[:, None]

-        if num_points_per_link > 1:
-            intermediate_points = np.linspace(0, 1, num_points_per_link)
-        else:
-            intermediate_points = 1
+        intermediate_points = np.linspace(0, 1, num_points_per_link) if num_points_per_link > 1 else 1

        accumulated_theta = np.cumsum(theta, axis=0)

@ -171,46 +158,6 @@ class ViaPointReacher(gym.Env):

        return np.squeeze(endeffector + self._joints[0, :])

-    def check_self_collision(self, line_points):
-        for i, line1 in enumerate(line_points):
-            for line2 in line_points[i+2:, :, :]:
-                # if line1 != line2:
-                if intersect(line1[0], line1[-1], line2[0], line2[-1]):
-                    return True
-        return False
-
-    def check_wall_collision(self, line_points):
-
-        # all points that are before the hole in x
-        r, c = np.where(line_points[:, :, 0] < (self.hole_x - self.hole_width / 2))
-
-        # check if any of those points are below surface
-        nr_line_points_below_surface_before_hole = np.sum(line_points[r, c, 1] < 0)
-
-        if nr_line_points_below_surface_before_hole > 0:
-            return True
-
-        # all points that are after the hole in x
-        r, c = np.where(line_points[:, :, 0] > (self.hole_x + self.hole_width / 2))
-
-        # check if any of those points are below surface
-        nr_line_points_below_surface_after_hole = np.sum(line_points[r, c, 1] < 0)
-
-        if nr_line_points_below_surface_after_hole > 0:
-            return True
-
-        # all points that are above the hole
-        r, c = np.where((line_points[:, :, 0] > (self.hole_x - self.hole_width / 2)) & (
-                           line_points[:, :, 0] < (self.hole_x + self.hole_width / 2)))
-
-        # check if any of those points are below surface
-        nr_line_points_below_surface_in_hole = np.sum(line_points[r, c, 1] < -self.hole_depth)
-
-        if nr_line_points_below_surface_in_hole > 0:
-            return True
-
-        return False
-
    def render(self, mode='human'):
        if self.fig is None:
            self.fig = plt.figure()
@ -274,7 +221,7 @@ class ViaPointReacher(gym.Env):
 if __name__ == '__main__':
    nl = 5
    render_mode = "human"  # "human" or "partial" or "final"
-    env = ViaPointReacher(num_links=nl, allow_self_collision=False)
+    env = ViaPointReacher(n_links=nl, allow_self_collision=False)
    env.reset()
    env.render(mode=render_mode)

--- a/alr_envs/mujoco/ball_in_a_cup/utils.py
+++ b/alr_envs/mujoco/ball_in_a_cup/utils.py
@ -1,4 +1,4 @@
-from alr_envs.utils.detpmp_env_wrapper import DetPMPEnvWrapper
+from alr_envs.utils.wrapper.detpmp_wrapper import DetPMPWrapper
 from alr_envs.mujoco.ball_in_a_cup.ball_in_a_cup import ALRBallInACupEnv
 from alr_envs.mujoco.ball_in_a_cup.ball_in_a_cup_simple import ALRBallInACupEnv as ALRBallInACupEnvSimple

@ -17,7 +17,7 @@ def make_contextual_env(rank, seed=0):
    def _init():
        env = ALRBallInACupEnv()

-        env = DetPMPEnvWrapper(env,
+        env = DetPMPWrapper(env,
                            num_dof=7,
                            num_basis=5,
                            width=0.005,
@ -51,7 +51,7 @@ def make_env(rank, seed=0):
    def _init():
        env = ALRBallInACupEnvSimple()

-        env = DetPMPEnvWrapper(env,
+        env = DetPMPWrapper(env,
                            num_dof=7,
                            num_basis=5,
                            width=0.005,
@ -85,7 +85,7 @@ def make_simple_env(rank, seed=0):
    def _init():
        env = ALRBallInACupEnvSimple()

-        env = DetPMPEnvWrapper(env,
+        env = DetPMPWrapper(env,
                            num_dof=3,
                            num_basis=5,
                            width=0.005,
--- a/alr_envs/mujoco/beerpong/utils.py
+++ b/alr_envs/mujoco/beerpong/utils.py
@ -1,4 +1,4 @@
-from alr_envs.utils.detpmp_env_wrapper import DetPMPEnvWrapper
+from alr_envs.utils.wrapper.detpmp_wrapper import DetPMPWrapper
 from alr_envs.mujoco.beerpong.beerpong import ALRBeerpongEnv
 from alr_envs.mujoco.beerpong.beerpong_simple import ALRBeerpongEnv as ALRBeerpongEnvSimple

@ -17,7 +17,7 @@ def make_contextual_env(rank, seed=0):
    def _init():
        env = ALRBeerpongEnv()

-        env = DetPMPEnvWrapper(env,
+        env = DetPMPWrapper(env,
                            num_dof=7,
                            num_basis=5,
                            width=0.005,
@ -51,7 +51,7 @@ def make_env(rank, seed=0):
    def _init():
        env = ALRBeerpongEnvSimple()

-        env = DetPMPEnvWrapper(env,
+        env = DetPMPWrapper(env,
                            num_dof=7,
                            num_basis=5,
                            width=0.005,
@ -85,7 +85,7 @@ def make_simple_env(rank, seed=0):
    def _init():
        env = ALRBeerpongEnvSimple()

-        env = DetPMPEnvWrapper(env,
+        env = DetPMPWrapper(env,
                            num_dof=3,
                            num_basis=5,
                            width=0.005,
--- a/alr_envs/utils/detpmp_env_wrapper.py
+++ b/alr_envs/utils/detpmp_env_wrapper.py
@ -1,87 +0,0 @@
-from alr_envs.utils.policies import get_policy_class
-from mp_lib import det_promp
-import numpy as np
-import gym
-
-
-class DetPMPEnvWrapper(gym.Wrapper):
-    def __init__(self,
-                 env,
-                 num_dof,
-                 num_basis,
-                 width,
-                 start_pos=None,
-                 duration=1,
-                 dt=0.01,
-                 post_traj_time=0.,
-                 policy_type=None,
-                 weights_scale=1,
-                 zero_start=False,
-                 zero_goal=False,
-                 ):
-        super(DetPMPEnvWrapper, self).__init__(env)
-        self.num_dof = num_dof
-        self.num_basis = num_basis
-        self.dim = num_dof * num_basis
-        self.pmp = det_promp.DeterministicProMP(n_basis=num_basis, n_dof=num_dof, width=width, off=0.01,
-                                                zero_start=zero_start, zero_goal=zero_goal)
-        weights = np.zeros(shape=(num_basis, num_dof))
-        self.pmp.set_weights(duration, weights)
-        self.weights_scale = weights_scale
-
-        self.duration = duration
-        self.dt = dt
-        self.post_traj_steps = int(post_traj_time / dt)
-
-        self.start_pos = start_pos
-        self.zero_start = zero_start
-
-        policy_class = get_policy_class(policy_type)
-        self.policy = policy_class(env)
-
-    def __call__(self, params, contexts=None):
-        params = np.atleast_2d(params)
-        rewards = []
-        infos = []
-        for p, c in zip(params, contexts):
-            reward, info = self.rollout(p, c)
-            rewards.append(reward)
-            infos.append(info)
-
-        return np.array(rewards), infos
-
-    def rollout(self, params, context=None, render=False):
-        """ This function generates a trajectory based on a DMP and then does the usual loop over reset and step"""
-        params = np.reshape(params, newshape=(self.num_basis, self.num_dof)) * self.weights_scale
-        self.pmp.set_weights(self.duration, params)
-        t, des_pos, des_vel, des_acc = self.pmp.compute_trajectory(1 / self.dt, 1.)
-        if self.zero_start:
-            des_pos += self.start_pos[None, :]
-
-        if self.post_traj_steps > 0:
-            des_pos = np.vstack([des_pos, np.tile(des_pos[-1, :], [self.post_traj_steps, 1])])
-            des_vel = np.vstack([des_vel, np.zeros(shape=(self.post_traj_steps, self.num_dof))])
-
-        self._trajectory = des_pos
-        self._velocity = des_vel
-
-        rews = []
-        infos = []
-
-        self.env.configure(context)
-        self.env.reset()
-
-        for t, pos_vel in enumerate(zip(des_pos, des_vel)):
-            ac = self.policy.get_action(pos_vel[0], pos_vel[1])
-            obs, rew, done, info = self.env.step(ac)
-            rews.append(rew)
-            infos.append(info)
-            if render:
-                self.env.render(mode="human")
-            if done:
-                break
-
-        reward = np.sum(rews)
-
-        return reward, info
-
--- a/alr_envs/utils/dmp_env_wrapper.py
+++ b/alr_envs/utils/dmp_env_wrapper.py
@ -1,121 +0,0 @@
-from alr_envs.utils.policies import get_policy_class
-from mp_lib.phase import ExpDecayPhaseGenerator
-from mp_lib.basis import DMPBasisGenerator
-from mp_lib import dmps
-import numpy as np
-import gym
-
-
-class DmpEnvWrapper(gym.Wrapper):
-    def __init__(self,
-                 env,
-                 num_dof,
-                 num_basis,
-                 start_pos=None,
-                 final_pos=None,
-                 duration=1,
-                 alpha_phase=2,
-                 dt=0.01,
-                 learn_goal=False,
-                 post_traj_time=0.,
-                 policy_type=None,
-                 weights_scale=1.,
-                 goal_scale=1.,
-                 ):
-        super(DmpEnvWrapper, self).__init__(env)
-        self.num_dof = num_dof
-        self.num_basis = num_basis
-        self.dim = num_dof * num_basis
-        if learn_goal:
-            self.dim += num_dof
-        self.learn_goal = learn_goal
-        self.duration = duration   # seconds
-        time_steps = int(duration / dt)
-        self.t = np.linspace(0, duration, time_steps)
-        self.post_traj_steps = int(post_traj_time / dt)
-
-        phase_generator = ExpDecayPhaseGenerator(alpha_phase=alpha_phase, duration=duration)
-        basis_generator = DMPBasisGenerator(phase_generator, duration=duration, num_basis=self.num_basis)
-
-        self.dmp = dmps.DMP(num_dof=num_dof,
-                            basis_generator=basis_generator,
-                            phase_generator=phase_generator,
-                            num_time_steps=time_steps,
-                            dt=dt
-                            )
-
-        self.dmp.dmp_start_pos = start_pos.reshape((1, num_dof))
-
-        dmp_weights = np.zeros((num_basis, num_dof))
-        if learn_goal:
-            dmp_goal_pos = np.zeros(num_dof)
-        else:
-            dmp_goal_pos = final_pos
-
-        self.dmp.set_weights(dmp_weights, dmp_goal_pos)
-        self.weights_scale = weights_scale
-        self.goal_scale = goal_scale
-
-        policy_class = get_policy_class(policy_type)
-        self.policy = policy_class(env)
-
-    def __call__(self, params, contexts=None):
-        params = np.atleast_2d(params)
-        rewards = []
-        infos = []
-        for p, c in zip(params, contexts):
-            reward, info = self.rollout(p, c)
-            rewards.append(reward)
-            infos.append(info)
-
-        return np.array(rewards), infos
-
-    def goal_and_weights(self, params):
-        if len(params.shape) > 1:
-            assert params.shape[1] == self.dim
-        else:
-            assert len(params) == self.dim
-            params = np.reshape(params, [1, self.dim])
-
-        if self.learn_goal:
-            goal_pos = params[0, -self.num_dof:]
-            weight_matrix = np.reshape(params[:, :-self.num_dof], [self.num_basis, self.num_dof])
-        else:
-            goal_pos = self.dmp.dmp_goal_pos.flatten()
-            assert goal_pos is not None
-            weight_matrix = np.reshape(params, [self.num_basis, self.num_dof])
-
-        return goal_pos * self.goal_scale, weight_matrix * self.weights_scale
-
-    def rollout(self, params, context=None, render=False):
-        """ This function generates a trajectory based on a DMP and then does the usual loop over reset and step"""
-        goal_pos, weight_matrix = self.goal_and_weights(params)
-        self.dmp.set_weights(weight_matrix, goal_pos)
-        trajectory, velocity = self.dmp.reference_trajectory(self.t)
-
-        if self.post_traj_steps > 0:
-            trajectory = np.vstack([trajectory, np.tile(trajectory[-1, :], [self.post_traj_steps, 1])])
-            velocity = np.vstack([velocity, np.zeros(shape=(self.post_traj_steps, self.num_dof))])
-
-        self._trajectory = trajectory
-        self._velocity = velocity
-
-        rews = []
-        infos = []
-
-        self.env.configure(context)
-        self.env.reset()
-
-        for t, pos_vel in enumerate(zip(trajectory, velocity)):
-            ac = self.policy.get_action(pos_vel[0], pos_vel[1])
-            obs, rew, done, info = self.env.step(ac)
-            rews.append(rew)
-            infos.append(info)
-            if render:
-                self.env.render(mode="human")
-            if done:
-                break
-
-        reward = np.sum(rews)
-
-        return reward, info
--- a/alr_envs/utils/policies.py
+++ b/alr_envs/utils/policies.py
@ -1,8 +1,10 @@
+from gym import Env
+
 from alr_envs.mujoco.alr_mujoco_env import AlrMujocoEnv


 class BaseController:
-    def __init__(self, env: AlrMujocoEnv):
+    def __init__(self, env: Env):
        self.env = env

    def get_action(self, des_pos, des_vel):
@ -20,7 +22,7 @@ class VelController(BaseController):


 class PDController(BaseController):
-    def __init__(self, env):
+    def __init__(self, env: AlrMujocoEnv):
        self.p_gains = env.p_gains
        self.d_gains = env.d_gains
        super(PDController, self).__init__(env)
--- a/alr_envs/utils/utils.py
+++ b/alr_envs/utils/utils.py
@ -18,3 +18,31 @@ def angle_normalize(x, type="deg"):
        return x - two_pi * np.floor((x + np.pi) / two_pi)
    else:
        raise ValueError(f"Invalid type {type}. Choose on of 'deg' or 'rad'.")
+
+
+def ccw(A, B, C):
+    return (C[1] - A[1]) * (B[0] - A[0]) - (B[1] - A[1]) * (C[0] - A[0]) > 1e-12
+
+
+def intersect(A, B, C, D):
+    """
+    Return true if line segments AB and CD intersects
+    Args:
+        A: start point line one
+        B: end point line one
+        C: start point line two
+        D: end point line two
+
+    Returns:
+
+    """
+    return ccw(A, C, D) != ccw(B, C, D) and ccw(A, B, C) != ccw(A, B, D)
+
+
+def check_self_collision(line_points):
+    for i, line1 in enumerate(line_points):
+        for line2 in line_points[i + 2:, :, :]:
+            # if line1 != line2:
+            if intersect(line1[0], line1[-1], line2[0], line2[-1]):
+                return True
+    return False
--- a/alr_envs/utils/wrapper/init.py
+++ b/alr_envs/utils/wrapper/init.py
--- a/alr_envs/utils/wrapper/detpmp_wrapper.py
+++ b/alr_envs/utils/wrapper/detpmp_wrapper.py
@ -0,0 +1,40 @@
+import gym
+import numpy as np
+from mp_lib import det_promp
+
+from alr_envs.utils.wrapper.mp_wrapper import MPWrapper
+
+
+class DetPMPWrapper(MPWrapper):
+    def __init__(self, env, num_dof, num_basis, width, start_pos=None, duration=1, dt=0.01, post_traj_time=0.,
+                 policy_type=None, weights_scale=1, zero_start=False, zero_goal=False, **mp_kwargs):
+        # self.duration = duration  # seconds
+
+        super().__init__(env, num_dof, duration, dt, post_traj_time, policy_type, weights_scale,
+                         num_basis=num_basis, width=width, start_pos=start_pos, zero_start=zero_start,
+                         zero_goal=zero_goal)
+
+        action_bounds = np.inf * np.ones((self.mp.n_basis * self.mp.n_dof))
+        self.action_space = gym.spaces.Box(low=-action_bounds, high=action_bounds, dtype=np.float32)
+
+        self.start_pos = start_pos
+        self.dt = dt
+
+    def initialize_mp(self, num_dof: int, duration: int, dt: float, num_basis: int = 5, width: float = None,
+                      start_pos: np.ndarray = None, zero_start: bool = False, zero_goal: bool = False):
+        pmp = det_promp.DeterministicProMP(n_basis=num_basis, n_dof=num_dof, width=width, off=0.01,
+                                           zero_start=zero_start, zero_goal=zero_goal)
+
+        weights = np.zeros(shape=(num_basis, num_dof))
+        pmp.set_weights(duration, weights)
+
+        return pmp
+
+    def mp_rollout(self, action):
+        params = np.reshape(action, (self.mp.n_basis, self.mp.n_dof)) * self.weights_scale
+        self.mp.set_weights(self.duration, params)
+        _, des_pos, des_vel, _ = self.mp.compute_trajectory(1 / self.dt, 1.)
+        if self.mp.zero_start:
+            des_pos += self.start_pos[None, :]
+
+        return des_pos, des_vel
--- a/alr_envs/utils/wrapper/dmp_wrapper.py
+++ b/alr_envs/utils/wrapper/dmp_wrapper.py
@ -0,0 +1,81 @@
+from alr_envs.utils.policies import get_policy_class
+from mp_lib.phase import ExpDecayPhaseGenerator
+from mp_lib.basis import DMPBasisGenerator
+from mp_lib import dmps
+import numpy as np
+import gym
+
+from alr_envs.utils.wrapper.mp_wrapper import MPWrapper
+
+
+class DmpWrapper(MPWrapper):
+
+    def __init__(self, env: gym.Env, num_dof: int, num_basis: int, start_pos: np.ndarray = None,
+                 final_pos: np.ndarray = None, duration: int = 1, alpha_phase: float = 2., dt: float = 0.01,
+                 learn_goal: bool = False, post_traj_time: float = 0., policy_type: str = None,
+                 weights_scale: float = 1., goal_scale: float = 1.):
+
+        """
+        This Wrapper generates a trajectory based on a DMP and will only return episodic performances.
+        Args:
+            env:
+            num_dof:
+            num_basis:
+            start_pos:
+            final_pos:
+            duration:
+            alpha_phase:
+            dt:
+            learn_goal:
+            post_traj_time:
+            policy_type:
+            weights_scale:
+            goal_scale:
+        """
+        self.learn_goal = learn_goal
+        self.t = np.linspace(0, duration, int(duration / dt))
+        self.goal_scale = goal_scale
+
+        super().__init__(env, num_dof, duration, dt, post_traj_time, policy_type, weights_scale,
+                         num_basis=num_basis, start_pos=start_pos, final_pos=final_pos, alpha_phase=alpha_phase)
+
+        action_bounds = np.inf * np.ones((np.prod(self.mp.dmp_weights.shape) + (num_dof if learn_goal else 0)))
+        self.action_space = gym.spaces.Box(low=-action_bounds, high=action_bounds, dtype=np.float32)
+
+    def initialize_mp(self, num_dof: int, duration: int, dt: float, num_basis: int = 5, start_pos: np.ndarray = None,
+                      final_pos: np.ndarray = None, alpha_phase: float = 2.):
+
+        phase_generator = ExpDecayPhaseGenerator(alpha_phase=alpha_phase, duration=duration)
+        basis_generator = DMPBasisGenerator(phase_generator, duration=duration, num_basis=num_basis)
+
+        dmp = dmps.DMP(num_dof=num_dof, basis_generator=basis_generator, phase_generator=phase_generator,
+                       num_time_steps=int(duration / dt), dt=dt)
+
+        dmp.dmp_start_pos = start_pos.reshape((1, num_dof))
+
+        weights = np.zeros((num_basis, num_dof))
+        goal_pos = np.zeros(num_dof) if self.learn_goal else final_pos
+
+        dmp.set_weights(weights, goal_pos)
+        return dmp
+
+    def goal_and_weights(self, params):
+        assert params.shape[-1] == self.action_space.shape[0]
+        params = np.atleast_2d(params)
+
+        if self.learn_goal:
+            goal_pos = params[0, -self.mp.num_dimensions:]  # [num_dof]
+            params = params[:, :-self.mp.num_dimensions]  # [1,num_dof]
+            # weight_matrix = np.reshape(params[:, :-self.num_dof], [self.num_basis, self.num_dof])
+        else:
+            goal_pos = self.mp.dmp_goal_pos.flatten()
+            assert goal_pos is not None
+            # weight_matrix = np.reshape(params, [self.num_basis, self.num_dof])
+
+        weight_matrix = np.reshape(params, self.mp.dmp_weights.shape)
+        return goal_pos * self.goal_scale, weight_matrix * self.weights_scale
+
+    def mp_rollout(self, action):
+        goal_pos, weight_matrix = self.goal_and_weights(action)
+        self.mp.set_weights(weight_matrix, goal_pos)
+        return self.mp.reference_trajectory(self.t)
--- a/alr_envs/utils/wrapper/mp_wrapper.py
+++ b/alr_envs/utils/wrapper/mp_wrapper.py
@ -0,0 +1,106 @@
+from abc import ABC, abstractmethod
+
+import gym
+import numpy as np
+
+from alr_envs.utils.policies import get_policy_class
+
+
+class MPWrapper(gym.Wrapper, ABC):
+
+    def __init__(self,
+                 env: gym.Env,
+                 num_dof: int,
+                 duration: int = 1,
+                 dt: float = 0.01,
+                 # learn_goal: bool = False,
+                 post_traj_time: float = 0.,
+                 policy_type: str = None,
+                 weights_scale: float = 1.,
+                 **mp_kwargs
+
+                 ):
+        super().__init__(env)
+
+        # self.num_dof = num_dof
+        # self.num_basis = num_basis
+        # self.duration = duration  # seconds
+        self.post_traj_steps = int(post_traj_time / dt)
+
+        self.mp = self.initialize_mp(num_dof, duration, dt, **mp_kwargs)
+        self.weights_scale = weights_scale
+
+        policy_class = get_policy_class(policy_type)
+        self.policy = policy_class(env)
+
+        # rendering
+        self.render_mode = None
+        self.render_kwargs = None
+
+    def step(self, action: np.ndarray):
+        """ This function generates a trajectory based on a DMP and then does the usual loop over reset and step"""
+        trajectory, velocity = self.mp_rollout(action)
+
+        if self.post_traj_steps > 0:
+            trajectory = np.vstack([trajectory, np.tile(trajectory[-1, :], [self.post_traj_steps, 1])])
+            velocity = np.vstack([velocity, np.zeros(shape=(self.post_traj_steps, self.dmp.num_dimensions))])
+
+        # self._trajectory = trajectory
+        # self._velocity = velocity
+
+        rewards = 0
+        infos = []
+
+        # TODO: @Max Why do we need this configure, states should be part of the model
+        # self.env.configure(context)
+        obs = self.env.reset()
+
+        for t, pos_vel in enumerate(zip(trajectory, velocity)):
+            ac = self.policy.get_action(pos_vel[0], pos_vel[1])
+            obs, rew, done, info = self.env.step(ac)
+            rewards += rew
+            infos.append(info)
+            if self.render_mode:
+                self.env.render(mode=self.render_mode, **self.render_kwargs)
+            if done:
+                break
+
+        done = True
+        return obs, rewards, done, infos
+
+    def render(self, mode='human', **kwargs):
+        """Only set render options here, such that they can be used during the rollout.
+        This only needs to be called once"""
+        self.render_mode = mode
+        self.render_kwargs = kwargs
+
+    def __call__(self, actions):
+        return self.step(actions)
+        # params = np.atleast_2d(params)
+        # rewards = []
+        # infos = []
+        # for p, c in zip(params, contexts):
+        #     reward, info = self.rollout(p, c)
+        #     rewards.append(reward)
+        #     infos.append(info)
+        #
+        # return np.array(rewards), infos
+
+    @abstractmethod
+    def mp_rollout(self, action):
+        """
+        Generate trajectory and velocity based on the MP
+        Returns:
+            trajectory/positions, velocity
+        """
+        raise NotImplementedError()
+
+    @abstractmethod
+    def initialize_mp(self, num_dof: int, duration: int, dt: float, **kwargs):
+        """
+        Create respective instance of MP
+        Returns:
+             MP instance
+        """
+
+        raise NotImplementedError
--- a/dmp_env_wrapper_example.py
+++ b/dmp_env_wrapper_example.py
@ -3,9 +3,7 @@ from alr_envs.classic_control.utils import make_holereacher_env, make_holereache
 from alr_envs.utils.dmp_async_vec_env import DmpAsyncVectorEnv
 import numpy as np

-
 if __name__ == "__main__":
-
    n_samples = 1
    n_cpus = 4
    dim = 30
--- a/example.py
+++ b/example.py
@ -1,20 +1,48 @@
-import time
-
 import gym

-if __name__ == '__main__':

+def example_mujoco():
    env = gym.make('alr_envs:ALRReacher-v0')
-    # env = gym.make('alr_envs:SimpleReacher-v0')
-    # env = gym.make('alr_envs:ALRReacher7-v0')
-    state = env.reset()
+    rewards = 0
+    obs = env.reset()

+    # number of environment steps
    for i in range(10000):
-        state, reward, done, info = env.step(env.action_space.sample())
+        obs, reward, done, info = env.step(env.action_space.sample())
+        rewards += reward
+
        if i % 1 == 0:
            env.render()

-        # if done:
-        state = env.reset()
+        if done:
+            print(rewards)
+            rewards = 0
+            obs = env.reset()

-        time.sleep(0.5)
+
+def example_dmp():
+    # env = gym.make("alr_envs:ViaPointReacherDMP-v0")
+    env = gym.make("alr_envs:HoleReacherDMP-v0")
+    rewards = 0
+    # env.render(mode=None)
+    obs = env.reset()
+
+    # number of samples/full trajectories (multiple environment steps)
+    for i in range(10):
+        obs, reward, done, info = env.step(env.action_space.sample())
+        rewards += reward
+
+        if i % 1 == 0:
+            # render full DMP trajectory
+            # render can only be called once in the beginning as well. That would render every trajectory
+            # Calling it after every trajectory allows to modify the mode. mode=None, disables rendering.
+            env.render(mode="partial")
+
+        if done:
+            print(rewards)
+            rewards = 0
+            obs = env.reset()
+
+
+if __name__ == '__main__':
+    example_dmp()
--- a/setup.py
+++ b/setup.py
@ -2,6 +2,9 @@ from setuptools import setup

 setup(name='alr_envs',
      version='0.0.1',
-      install_requires=['gym', 'PyQt5', 'matplotlib',
-                        'mp_lib @ git+https://git@github.com/maxhuettenrauch/mp_lib@master#egg=mp_lib',],  # And any other dependencies foo needs
+      install_requires=['gym',
+                        'PyQt5',
+                        'matplotlib',
+                        'mp_lib @ git+https://git@github.com/maxhuettenrauch/mp_lib@master#egg=mp_lib',
+                        'mujoco_py'],  # And any other dependencies foo needs
      )