merge branches

This commit is contained in:
Maximilian Huettenrauch 2021-04-19 11:53:30 +02:00
commit f3d75b9a60
23 changed files with 796 additions and 622 deletions

View File

@ -1,34 +1,50 @@
## ALR Custom Environments ## ALR Environments
This repository collects custom RL envs not included in Suits like OpenAI gym, rllab, etc. This repository collects custom Robotics environments not included in benchmark suites like OpenAI gym, rllab, etc.
Creating a custom (Mujoco) gym environement can be done according to [this guide](https://github.com/openai/gym/blob/master/docs/creating-environments.md). Creating a custom (Mujoco) gym environment can be done according to [this guide](https://github.com/openai/gym/blob/master/docs/creating-environments.md).
For stochastic search problems with gym interace use the Rosenbrock reference implementation. For stochastic search problems with gym interface use the `Rosenbrock-v0` reference implementation.
We also support to solve environments with DMPs. When adding new DMP tasks check the `ViaPointReacherDMP-v0` reference implementation.
When simply using the tasks, you can also leverage the wrapper class `DmpWrapper` to turn normal gym environments in to DMP tasks.
## Environments ## Environments
Currently we have the following environements: Currently we have the following environments:
### Mujoco ### Mujoco
|Name| Description| |Name| Description|Horizon|Action Dimension|Observation Dimension
|---|---| |---|---|---|---|---|
|`ALRReacher-v0`|Modified (5 links) Mujoco gym's `Reacher-v2` (2 links)| |`ALRReacher-v0`|Modified (5 links) Mujoco gym's `Reacher-v2` (2 links)| 200 | 5 | 21
|`ALRReacherSparse-v0`|Same as `ALRReacher-v0`, but the distance penalty is only provided in the last time step.| |`ALRReacherSparse-v0`|Same as `ALRReacher-v0`, but the distance penalty is only provided in the last time step.| 200 | 5 | 21
|`ALRReacherSparseBalanced-v0`|Same as `ALRReacherSparse-v0`, but the the end effector has to stay upright.| |`ALRReacherSparseBalanced-v0`|Same as `ALRReacherSparse-v0`, but the end-effector has to remain upright.| 200 | 5 | 21
|`ALRReacherShort-v0`|Same as `ALRReacher-v0`, but the episode length is reduced to 50.| |`ALRLongReacher-v0`|Modified (7 links) Mujoco gym's `Reacher-v2` (2 links)| 200 | 7 | 27
|`ALRReacherShortSparse-v0`|Combination of `ALRReacherSparse-v0` and `ALRReacherShort-v0`.| |`ALRLongReacherSparse-v0`|Same as `ALRLongReacher-v0`, but the distance penalty is only provided in the last time step.| 200 | 7 | 27
|`ALRReacher7-v0`|Modified (7 links) Mujoco gym's `Reacher-v2` (2 links)| |`ALRLongReacherSparseBalanced-v0`|Same as `ALRLongReacherSparse-v0`, but the end-effector has to remain upright.| 200 | 7 | 27
|`ALRReacher7Sparse-v0`|Same as `ALRReacher7-v0`, but the distance penalty is only provided in the last time step.|
### Classic Control ### Classic Control
|Name| Description| |Name| Description|Horizon|Action Dimension|Observation Dimension
|---|---| |---|---|---|---|---|
|`SimpleReacher-v0`| Simple Reaching Task without any physics simulation. Returns no reward until 150 time steps. This allows the agent to explore the space, but requires precise actions towards the end of the trajectory.| |`SimpleReacher-v0`| Simple reaching task (2 links) without any physics simulation. Provides no reward until 150 time steps. This allows the agent to explore the space, but requires precise actions towards the end of the trajectory.| 200 | 2 | 9
|`LongSimpleReacher-v0`| Simple reaching task (5 links) without any physics simulation. Provides no reward until 150 time steps. This allows the agent to explore the space, but requires precise actions towards the end of the trajectory.| 200 | 5 | 18
|`ViaPointReacher-v0`| Simple reaching task leveraging a via point, which supports self collision detection. Provides a reward only at 100 and 199 for reaching the viapoint and goal point, respectively.| 200 |
|`HoleReacher-v0`|
### DMP Environments
These environments are closer to stochastic search. They always execute a full trajectory, which is computed by a DMP and executed by a controller, e.g. a PD controller.
The goal is to learn the parameters of this DMP to generate a suitable trajectory.
All environments provide the full episode reward and additional information about early terminations, e.g. due to collisions.
|Name| Description|Horizon|Action Dimension|Observation Dimension
|---|---|---|---|---|
|`ViaPointReacherDMP-v0`| Simple reaching task leveraging a via point, which supports self collision detection. Provides a reward only at 100 and 199 for reaching the viapoint and goal point, respectively.| 200 |
|`HoleReacherDMP-v0`|
|`HoleReacherFixedGoalDMP-v0`|
|`HoleReacherDetPMP-v0`|
### Stochastic Search ### Stochastic Search
|Name| Description| |Name| Description|Horizon|Action Dimension|Observation Dimension
|---|---| |---|---|---|---|---|
|`Rosenbrock{dim}-v0`| Gym interface for Rosenbrock function. `{dim}` is one of 5, 10, 25, 50 or 100. | |`Rosenbrock{dim}-v0`| Gym interface for Rosenbrock function. `{dim}` is one of 5, 10, 25, 50 or 100. | 1 | `{dim}` | 0
## Install ## Install

View File

@ -1,6 +1,9 @@
from gym.envs.registration import register from gym.envs.registration import register
from alr_envs.stochastic_search.functions.f_rosenbrock import Rosenbrock from alr_envs.stochastic_search.functions.f_rosenbrock import Rosenbrock
from alr_envs.utils.wrapper.dmp_wrapper import DmpWrapper
# Mujoco
register( register(
id='ALRReacher-v0', id='ALRReacher-v0',
@ -9,6 +12,7 @@ register(
kwargs={ kwargs={
"steps_before_reward": 0, "steps_before_reward": 0,
"n_links": 5, "n_links": 5,
"balance": False,
} }
) )
@ -19,6 +23,7 @@ register(
kwargs={ kwargs={
"steps_before_reward": 200, "steps_before_reward": 200,
"n_links": 5, "n_links": 5,
"balance": False,
} }
) )
@ -34,62 +39,46 @@ register(
) )
register( register(
id='ALRReacherShort-v0', id='ALRLongReacher-v0',
entry_point='alr_envs.mujoco:ALRReacherEnv',
max_episode_steps=50,
kwargs={
"steps_before_reward": 0,
"n_links": 5,
}
)
register(
id='ALRReacherShortSparse-v0',
entry_point='alr_envs.mujoco:ALRReacherEnv',
max_episode_steps=50,
kwargs={
"steps_before_reward": 50,
"n_links": 5,
}
)
register(
id='ALRReacher7-v0',
entry_point='alr_envs.mujoco:ALRReacherEnv', entry_point='alr_envs.mujoco:ALRReacherEnv',
max_episode_steps=200, max_episode_steps=200,
kwargs={ kwargs={
"steps_before_reward": 0, "steps_before_reward": 0,
"n_links": 7, "n_links": 7,
} "balance": False,
)
# register(
# id='ALRReacherSparse-v0',
# entry_point='alr_envs.mujoco:ALRReacherEnv',
# max_episode_steps=200,
# kwargs={
# "steps_before_reward": 200,
# "n_links": 7,
# }
# )
register(
id='ALRReacher7Short-v0',
entry_point='alr_envs.mujoco:ALRReacherEnv',
max_episode_steps=50,
kwargs={
"steps_before_reward": 0,
"n_links": 7,
} }
) )
register( register(
id='ALRReacher7ShortSparse-v0', id='ALRLongReacherSparse-v0',
entry_point='alr_envs.mujoco:ALRReacherEnv', entry_point='alr_envs.mujoco:ALRReacherEnv',
max_episode_steps=50, max_episode_steps=200,
kwargs={ kwargs={
"steps_before_reward": 50, "steps_before_reward": 200,
"n_links": 7, "n_links": 7,
"balance": False,
}
)
register(
id='ALRLongReacherSparseBalanced-v0',
entry_point='alr_envs.mujoco:ALRReacherEnv',
max_episode_steps=200,
kwargs={
"steps_before_reward": 200,
"n_links": 7,
"balance": True,
}
)
# Classic control
register(
id='Balancing-v0',
entry_point='alr_envs.mujoco:BalancingEnv',
max_episode_steps=200,
kwargs={
"n_links": 5,
} }
) )
@ -103,7 +92,7 @@ register(
) )
register( register(
id='SimpleReacher5-v0', id='LongSimpleReacher-v0',
entry_point='alr_envs.classic_control:SimpleReacherEnv', entry_point='alr_envs.classic_control:SimpleReacherEnv',
max_episode_steps=200, max_episode_steps=200,
kwargs={ kwargs={
@ -111,12 +100,66 @@ register(
} }
) )
register(
id='ViaPointReacher-v0',
entry_point='alr_envs.classic_control.viapoint_reacher:ViaPointReacher',
max_episode_steps=200,
kwargs={
"n_links": 5,
"allow_self_collision": False,
"collision_penalty": 1000
}
)
register(
id='HoleReacher-v0',
entry_point='alr_envs.classic_control.hole_reacher:HoleReacher',
max_episode_steps=200,
kwargs={
"n_links": 5,
"allow_self_collision": False,
"allow_wall_collision": False,
"hole_width": 0.15,
"hole_depth": 1,
"hole_x": 1,
"collision_penalty": 100,
}
)
# DMP environments
register(
id='ViaPointReacherDMP-v0',
entry_point='alr_envs.classic_control.viapoint_reacher:viapoint_dmp',
# max_episode_steps=1,
)
register(
id='HoleReacherDMP-v0',
entry_point='alr_envs.classic_control.hole_reacher:holereacher_dmp',
# max_episode_steps=1,
)
register(
id='HoleReacherFixedGoalDMP-v0',
entry_point='alr_envs.classic_control.hole_reacher:holereacher_fix_goal_dmp',
# max_episode_steps=1,
)
register(
id='HoleReacherDetPMP-v0',
entry_point='alr_envs.classic_control.hole_reacher:holereacher_detpmp',
# max_episode_steps=1,
)
# BBO functions
for dim in [5, 10, 25, 50, 100]: for dim in [5, 10, 25, 50, 100]:
register( register(
id=f'Rosenbrock{dim}-v0', id=f'Rosenbrock{dim}-v0',
entry_point='alr_envs.stochastic_search:StochasticSearchEnv', entry_point='alr_envs.stochastic_search:StochasticSearchEnv',
max_episode_steps=1, max_episode_steps=1,
kwargs={ kwargs={
"cost_f": Rosenbrock, "cost_f": Rosenbrock(dim),
} }
) )

View File

@ -1 +1,3 @@
from alr_envs.classic_control.simple_reacher import SimpleReacherEnv from alr_envs.classic_control.simple_reacher import SimpleReacherEnv
from alr_envs.classic_control.viapoint_reacher import ViaPointReacher
from alr_envs.classic_control.hole_reacher import HoleReacher

View File

@ -3,9 +3,12 @@ import numpy as np
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
from matplotlib import patches from matplotlib import patches
from alr_envs import DmpWrapper
from alr_envs.utils.wrapper.detpmp_wrapper import DetPMPWrapper
def ccw(A, B, C): def ccw(A, B, C):
return (C[1]-A[1]) * (B[0]-A[0]) - (B[1]-A[1]) * (C[0]-A[0]) > 1e-12 return (C[1] - A[1]) * (B[0] - A[0]) - (B[1] - A[1]) * (C[0] - A[0]) > 1e-12
# Return true if line segments AB and CD intersect # Return true if line segments AB and CD intersect
@ -13,37 +16,66 @@ def intersect(A, B, C, D):
return ccw(A, C, D) != ccw(B, C, D) and ccw(A, B, C) != ccw(A, B, D) return ccw(A, C, D) != ccw(B, C, D) and ccw(A, B, C) != ccw(A, B, D)
def holereacher_dmp(**kwargs):
_env = gym.make("alr_envs:HoleReacher-v0")
# _env = HoleReacher(**kwargs)
return DmpWrapper(_env, num_dof=5, num_basis=5, duration=2, dt=_env.dt, learn_goal=True, alpha_phase=3.5,
start_pos=_env.start_pos, policy_type="velocity", weights_scale=100, goal_scale=0.1)
def holereacher_fix_goal_dmp(**kwargs):
_env = gym.make("alr_envs:HoleReacher-v0")
# _env = HoleReacher(**kwargs)
return DmpWrapper(_env, num_dof=5, num_basis=5, duration=2, dt=_env.dt, learn_goal=False, alpha_phase=3.5,
start_pos=_env.start_pos, policy_type="velocity", weights_scale=50, goal_scale=1,
final_pos=np.array([2.02669572, -1.25966385, -1.51618198, -0.80946476, 0.02012344]))
def holereacher_detpmp(**kwargs):
_env = gym.make("alr_envs:HoleReacher-v0")
# _env = HoleReacher(**kwargs)
return DetPMPWrapper(_env, num_dof=5, num_basis=5, width=0.005, policy_type="velocity", start_pos=_env.start_pos,
duration=2, post_traj_time=0, dt=_env.dt, weights_scale=0.25, zero_start=True, zero_goal=False)
class HoleReacher(gym.Env): class HoleReacher(gym.Env):
def __init__(self, num_links, hole_x, hole_width, hole_depth, allow_self_collision=False, def __init__(self, n_links, hole_x, hole_width, hole_depth, allow_self_collision=False,
allow_wall_collision=False, collision_penalty=1000): allow_wall_collision=False, collision_penalty=1000):
self.n_links = n_links
self.link_lengths = np.ones((n_links, 1))
# task
self.hole_x = hole_x # x-position of center of hole self.hole_x = hole_x # x-position of center of hole
self.hole_width = hole_width # width of hole self.hole_width = hole_width # width of hole
self.hole_depth = hole_depth # depth of hole self.hole_depth = hole_depth # depth of hole
self.num_links = num_links
self.link_lengths = np.ones((num_links, 1))
self.bottom_center_of_hole = np.hstack([hole_x, -hole_depth]) self.bottom_center_of_hole = np.hstack([hole_x, -hole_depth])
self.top_center_of_hole = np.hstack([hole_x, 0]) self.top_center_of_hole = np.hstack([hole_x, 0])
self.left_wall_edge = np.hstack([hole_x - self.hole_width/2, 0]) self.left_wall_edge = np.hstack([hole_x - self.hole_width / 2, 0])
self.right_wall_edge = np.hstack([hole_x + self.hole_width / 2, 0]) self.right_wall_edge = np.hstack([hole_x + self.hole_width / 2, 0])
# collision
self.allow_self_collision = allow_self_collision self.allow_self_collision = allow_self_collision
self.allow_wall_collision = allow_wall_collision self.allow_wall_collision = allow_wall_collision
self.collision_penalty = collision_penalty self.collision_penalty = collision_penalty
# state
self._joints = None self._joints = None
self._joint_angles = None self._joint_angles = None
self._angle_velocity = None self._angle_velocity = None
self.start_pos = np.hstack([[np.pi/2], np.zeros(self.num_links - 1)]) self.start_pos = np.hstack([[np.pi / 2], np.zeros(self.n_links - 1)])
self.start_vel = np.zeros(self.num_links) self.start_vel = np.zeros(self.n_links)
self.dt = 0.01 self.dt = 0.01
self.time_limit = 2 # self.time_limit = 2
action_bound = np.pi * np.ones((self.num_links,)) action_bound = np.pi * np.ones((self.n_links,))
state_bound = np.hstack([ state_bound = np.hstack([
[np.pi] * self.num_links, # cos [np.pi] * self.n_links, # cos
[np.pi] * self.num_links, # sin [np.pi] * self.n_links, # sin
[np.inf] * self.num_links, # velocity [np.inf] * self.n_links, # velocity
[np.inf] * 2, # x-y coordinates of target distance [np.inf] * 2, # x-y coordinates of target distance
[np.inf] # env steps, because reward start after n steps TODO: Maybe [np.inf] # env steps, because reward start after n steps TODO: Maybe
]) ])
@ -51,11 +83,11 @@ class HoleReacher(gym.Env):
self.observation_space = gym.spaces.Box(low=-state_bound, high=state_bound, shape=state_bound.shape) self.observation_space = gym.spaces.Box(low=-state_bound, high=state_bound, shape=state_bound.shape)
self.fig = None self.fig = None
rect_1 = patches.Rectangle((-self.num_links, -1), rect_1 = patches.Rectangle((-self.n_links, -1),
self.num_links + self.hole_x - self.hole_width / 2, 1, self.n_links + self.hole_x - self.hole_width / 2, 1,
fill=True, edgecolor='k', facecolor='k') fill=True, edgecolor='k', facecolor='k')
rect_2 = patches.Rectangle((self.hole_x + self.hole_width / 2, -1), rect_2 = patches.Rectangle((self.hole_x + self.hole_width / 2, -1),
self.num_links - self.hole_x + self.hole_width / 2, 1, self.n_links - self.hole_x + self.hole_width / 2, 1,
fill=True, edgecolor='k', facecolor='k') fill=True, edgecolor='k', facecolor='k')
rect_3 = patches.Rectangle((self.hole_x - self.hole_width / 2, -1), self.hole_width, rect_3 = patches.Rectangle((self.hole_x - self.hole_width / 2, -1), self.hole_width,
1 - self.hole_depth, 1 - self.hole_depth,
@ -65,7 +97,7 @@ class HoleReacher(gym.Env):
@property @property
def end_effector(self): def end_effector(self):
return self._joints[self.num_links].T return self._joints[self.n_links].T
def configure(self, context): def configure(self, context):
pass pass
@ -73,13 +105,13 @@ class HoleReacher(gym.Env):
def reset(self): def reset(self):
self._joint_angles = self.start_pos self._joint_angles = self.start_pos
self._angle_velocity = self.start_vel self._angle_velocity = self.start_vel
self._joints = np.zeros((self.num_links + 1, 2)) self._joints = np.zeros((self.n_links + 1, 2))
self._update_joints() self._update_joints()
self._steps = 0 self._steps = 0
return self._get_obs().copy() return self._get_obs().copy()
def step(self, action): def step(self, action: np.ndarray):
""" """
a single step with an action in joint velocity space a single step with an action in joint velocity space
""" """
@ -114,7 +146,8 @@ class HoleReacher(gym.Env):
self._steps += 1 self._steps += 1
done = self._steps * self.dt > self.time_limit or self._is_collided # done = self._steps * self.dt > self.time_limit or self._is_collided
done = self._is_collided
return self._get_obs().copy(), reward, done, info return self._get_obs().copy(), reward, done, info
@ -152,18 +185,6 @@ class HoleReacher(gym.Env):
self._steps self._steps
]) ])
# def _reward(self):
# dist_reward = 0
# if not self._is_collided:
# if self._steps == 180:
# dist_reward = np.linalg.norm(self.end_effector - self.bottom_center_of_hole)
# else:
# dist_reward = np.linalg.norm(self.end_effector - self.bottom_center_of_hole)
#
# out = - dist_reward ** 2
#
# return out
def get_forward_kinematics(self, num_points_per_link=1): def get_forward_kinematics(self, num_points_per_link=1):
theta = self._joint_angles[:, None] theta = self._joint_angles[:, None]
@ -174,7 +195,7 @@ class HoleReacher(gym.Env):
accumulated_theta = np.cumsum(theta, axis=0) accumulated_theta = np.cumsum(theta, axis=0)
endeffector = np.zeros(shape=(self.num_links, num_points_per_link, 2)) endeffector = np.zeros(shape=(self.n_links, num_points_per_link, 2))
x = np.cos(accumulated_theta) * self.link_lengths * intermediate_points x = np.cos(accumulated_theta) * self.link_lengths * intermediate_points
y = np.sin(accumulated_theta) * self.link_lengths * intermediate_points y = np.sin(accumulated_theta) * self.link_lengths * intermediate_points
@ -182,7 +203,7 @@ class HoleReacher(gym.Env):
endeffector[0, :, 0] = x[0, :] endeffector[0, :, 0] = x[0, :]
endeffector[0, :, 1] = y[0, :] endeffector[0, :, 1] = y[0, :]
for i in range(1, self.num_links): for i in range(1, self.n_links):
endeffector[i, :, 0] = x[i, :] + endeffector[i - 1, -1, 0] endeffector[i, :, 0] = x[i, :] + endeffector[i - 1, -1, 0]
endeffector[i, :, 1] = y[i, :] + endeffector[i - 1, -1, 1] endeffector[i, :, 1] = y[i, :] + endeffector[i - 1, -1, 1]
@ -190,7 +211,7 @@ class HoleReacher(gym.Env):
def check_self_collision(self, line_points): def check_self_collision(self, line_points):
for i, line1 in enumerate(line_points): for i, line1 in enumerate(line_points):
for line2 in line_points[i+2:, :, :]: for line2 in line_points[i + 2:, :, :]:
# if line1 != line2: # if line1 != line2:
if intersect(line1[0], line1[-1], line2[0], line2[-1]): if intersect(line1[0], line1[-1], line2[0], line2[-1]):
return True return True
@ -280,7 +301,7 @@ class HoleReacher(gym.Env):
# Add the patch to the Axes # Add the patch to the Axes
[plt.gca().add_patch(rect) for rect in self.patches] [plt.gca().add_patch(rect) for rect in self.patches]
plt.xlim(-self.num_links, self.num_links), plt.ylim(-1, self.num_links) plt.xlim(-self.n_links, self.n_links), plt.ylim(-1, self.n_links)
# Arm # Arm
plt.plot(self._joints[:, 0], self._joints[:, 1], 'ro-', markerfacecolor='k') plt.plot(self._joints[:, 0], self._joints[:, 1], 'ro-', markerfacecolor='k')
@ -294,7 +315,8 @@ class HoleReacher(gym.Env):
if __name__ == '__main__': if __name__ == '__main__':
nl = 5 nl = 5
render_mode = "human" # "human" or "partial" or "final" render_mode = "human" # "human" or "partial" or "final"
env = HoleReacher(num_links=nl, allow_self_collision=False, allow_wall_collision=False, hole_width=0.15, hole_depth=1, hole_x=1) env = HoleReacher(n_links=nl, allow_self_collision=False, allow_wall_collision=False, hole_width=0.15,
hole_depth=1, hole_x=1)
env.reset() env.reset()
# env.render(mode=render_mode) # env.render(mode=render_mode)

View File

@ -1,11 +1,9 @@
import os
import gym import gym
import matplotlib as mpl
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import numpy as np import numpy as np
from gym import spaces from gym import spaces
from gym.utils import seeding from gym.utils import seeding
from alr_envs.utils.utils import angle_normalize from alr_envs.utils.utils import angle_normalize
@ -33,7 +31,7 @@ class SimpleReacherEnv(gym.Env):
self._angle_velocity = None self._angle_velocity = None
self.max_torque = 1 # 10 self.max_torque = 1 # 10
self.steps_before_reward = 180 self.steps_before_reward = 199
action_bound = np.ones((self.n_links,)) action_bound = np.ones((self.n_links,))
state_bound = np.hstack([ state_bound = np.hstack([
@ -92,7 +90,7 @@ class SimpleReacherEnv(gym.Env):
def _update_joints(self): def _update_joints(self):
""" """
update _joints to get new end effector position. The other links are only required for rendering. update joints to get new end-effector position. The other links are only required for rendering.
Returns: Returns:
""" """
@ -106,7 +104,7 @@ class SimpleReacherEnv(gym.Env):
# TODO: Is this the best option # TODO: Is this the best option
if self._steps >= self.steps_before_reward: if self._steps >= self.steps_before_reward:
reward_dist = - np.linalg.norm(diff) reward_dist -= np.linalg.norm(diff)
# reward_dist = np.exp(-0.1 * diff ** 2).mean() # reward_dist = np.exp(-0.1 * diff ** 2).mean()
# reward_dist = - (diff ** 2).mean() # reward_dist = - (diff ** 2).mean()

View File

@ -1,7 +1,7 @@
from alr_envs.classic_control.hole_reacher import HoleReacher from alr_envs.classic_control.hole_reacher import HoleReacher
from alr_envs.classic_control.viapoint_reacher import ViaPointReacher from alr_envs.classic_control.viapoint_reacher import ViaPointReacher
from alr_envs.utils.dmp_env_wrapper import DmpEnvWrapper from alr_envs.utils.wrapper.dmp_wrapper import DmpWrapper
from alr_envs.utils.detpmp_env_wrapper import DetPMPEnvWrapper from alr_envs.utils.wrapper.detpmp_wrapper import DetPMPWrapper
import numpy as np import numpy as np
@ -17,11 +17,11 @@ def make_viapointreacher_env(rank, seed=0):
""" """
def _init(): def _init():
_env = ViaPointReacher(num_links=5, _env = ViaPointReacher(n_links=5,
allow_self_collision=False, allow_self_collision=False,
collision_penalty=1000) collision_penalty=1000)
_env = DmpEnvWrapper(_env, _env = DmpWrapper(_env,
num_dof=5, num_dof=5,
num_basis=5, num_basis=5,
duration=2, duration=2,
@ -49,7 +49,7 @@ def make_holereacher_env(rank, seed=0):
""" """
def _init(): def _init():
_env = HoleReacher(num_links=5, _env = HoleReacher(n_links=5,
allow_self_collision=False, allow_self_collision=False,
allow_wall_collision=False, allow_wall_collision=False,
hole_width=0.25, hole_width=0.25,
@ -57,11 +57,10 @@ def make_holereacher_env(rank, seed=0):
hole_x=2, hole_x=2,
collision_penalty=100) collision_penalty=100)
_env = DmpEnvWrapper(_env, _env = DmpWrapper(_env,
num_dof=5, num_dof=5,
num_basis=5, num_basis=5,
duration=2, duration=2,
bandwidth_factor=2,
dt=_env.dt, dt=_env.dt,
learn_goal=True, learn_goal=True,
alpha_phase=2, alpha_phase=2,
@ -89,7 +88,7 @@ def make_holereacher_fix_goal_env(rank, seed=0):
""" """
def _init(): def _init():
_env = HoleReacher(num_links=5, _env = HoleReacher(n_links=5,
allow_self_collision=False, allow_self_collision=False,
allow_wall_collision=False, allow_wall_collision=False,
hole_width=0.15, hole_width=0.15,
@ -97,14 +96,14 @@ def make_holereacher_fix_goal_env(rank, seed=0):
hole_x=1, hole_x=1,
collision_penalty=100) collision_penalty=100)
_env = DmpEnvWrapper(_env, _env = DmpWrapper(_env,
num_dof=5, num_dof=5,
num_basis=5, num_basis=5,
duration=2, duration=2,
dt=_env.dt, dt=_env.dt,
learn_goal=False, learn_goal=False,
final_pos=np.array([2.02669572, -1.25966385, -1.51618198, -0.80946476, 0.02012344]), final_pos=np.array([2.02669572, -1.25966385, -1.51618198, -0.80946476, 0.02012344]),
alpha_phase=3, alpha_phase=2,
start_pos=_env.start_pos, start_pos=_env.start_pos,
policy_type="velocity", policy_type="velocity",
weights_scale=50, weights_scale=50,
@ -129,26 +128,25 @@ def make_holereacher_env_pmp(rank, seed=0):
""" """
def _init(): def _init():
_env = HoleReacher(num_links=5, _env = HoleReacher(n_links=5,
allow_self_collision=False, allow_self_collision=False,
allow_wall_collision=False, allow_wall_collision=False,
hole_width=0.15, hole_width=0.15,
hole_depth=1, hole_depth=1,
hole_x=1, hole_x=1,
collision_penalty=100) collision_penalty=1000)
_env = DetPMPEnvWrapper(_env, _env = DetPMPWrapper(_env,
num_dof=5, num_dof=5,
num_basis=5, num_basis=5,
width=0.02, width=0.02,
off=0.,
policy_type="velocity", policy_type="velocity",
start_pos=_env.start_pos, start_pos=_env.start_pos,
duration=2, duration=2,
post_traj_time=0, post_traj_time=0,
dt=_env.dt, dt=_env.dt,
weights_scale=0.2, weights_scale=0.2,
zero_start=False, zero_start=True,
zero_goal=False zero_goal=False
) )
_env.seed(seed + rank) _env.seed(seed + rank)

View File

@ -1,39 +1,43 @@
import gym import gym
import numpy as np
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
from matplotlib import patches import numpy as np
from alr_envs import DmpWrapper
from alr_envs.utils.utils import check_self_collision
def ccw(A, B, C): def viapoint_dmp(**kwargs):
return (C[1]-A[1]) * (B[0]-A[0]) - (B[1]-A[1]) * (C[0]-A[0]) > 1e-12 _env = gym.make("alr_envs:ViaPointReacher-v0")
# _env = ViaPointReacher(**kwargs)
return DmpWrapper(_env, num_dof=5, num_basis=5, duration=2, alpha_phase=2.5, dt=_env.dt,
# Return true if line segments AB and CD intersect start_pos=_env.start_pos, learn_goal=False, policy_type="velocity", weights_scale=50)
def intersect(A, B, C, D):
return ccw(A, C, D) != ccw(B, C, D) and ccw(A, B, C) != ccw(A, B, D)
class ViaPointReacher(gym.Env): class ViaPointReacher(gym.Env):
def __init__(self, num_links, allow_self_collision=False, def __init__(self, n_links, allow_self_collision=False, collision_penalty=1000):
collision_penalty=1000): self.num_links = n_links
self.num_links = num_links self.link_lengths = np.ones((n_links, 1))
self.link_lengths = np.ones((num_links, 1))
# task
self.via_point = np.ones(2)
self.goal_point = np.array((n_links, 0))
# collision
self.allow_self_collision = allow_self_collision self.allow_self_collision = allow_self_collision
self.collision_penalty = collision_penalty self.collision_penalty = collision_penalty
self.via_point = np.ones(2) # state
self.goal_point = np.array((num_links, 0))
self._joints = None self._joints = None
self._joint_angles = None self._joint_angles = None
self._angle_velocity = None self._angle_velocity = None
self.start_pos = np.hstack([[np.pi/2], np.zeros(self.num_links - 1)]) self.start_pos = np.hstack([[np.pi / 2], np.zeros(self.num_links - 1)])
self.start_vel = np.zeros(self.num_links) self.start_vel = np.zeros(self.num_links)
self.weight_matrix_scale = 1 self.weight_matrix_scale = 1
self._steps = 0
self.dt = 0.01 self.dt = 0.01
self.time_limit = 2 # self.time_limit = 2
action_bound = np.pi * np.ones((self.num_links,)) action_bound = np.pi * np.ones((self.num_links,))
state_bound = np.hstack([ state_bound = np.hstack([
@ -64,7 +68,7 @@ class ViaPointReacher(gym.Env):
return self._get_obs().copy() return self._get_obs().copy()
def step(self, action): def step(self, action: np.ndarray):
""" """
a single step with an action in joint velocity space a single step with an action in joint velocity space
""" """
@ -75,24 +79,18 @@ class ViaPointReacher(gym.Env):
self._update_joints() self._update_joints()
# rew = self._reward()
# compute reward directly in step function
dist_reward = 0 dist_reward = 0
if not self._is_collided: if not self._is_collided:
if self._steps == 100: if self._steps == 100:
dist_reward = np.linalg.norm(self.end_effector - self.via_point) dist_reward = np.linalg.norm(self.end_effector - self.via_point)
if self._steps == 199: elif self._steps == 199:
dist_reward = np.linalg.norm(self.end_effector - self.goal_point) dist_reward = np.linalg.norm(self.end_effector - self.goal_point)
# TODO: Do we need that?
reward = - dist_reward ** 2 reward = - dist_reward ** 2
reward -= 5e-8 * np.sum(acc**2) reward -= 5e-8 * np.sum(acc**2)
if self._steps == 200:
reward -= 0.1 * np.sum(vel**2) ** 2
if self._is_collided: if self._is_collided:
reward -= self.collision_penalty reward -= self.collision_penalty
@ -100,7 +98,8 @@ class ViaPointReacher(gym.Env):
self._steps += 1 self._steps += 1
done = self._steps * self.dt > self.time_limit or self._is_collided # done = self._steps * self.dt > self.time_limit or self._is_collided
done = self._is_collided
return self._get_obs().copy(), reward, done, info return self._get_obs().copy(), reward, done, info
@ -118,8 +117,8 @@ class ViaPointReacher(gym.Env):
self_collision = False self_collision = False
if not self.allow_self_collision: if not self.allow_self_collision:
self_collision = self.check_self_collision(line_points_in_taskspace) self_collision = check_self_collision(line_points_in_taskspace)
if np.any(np.abs(self._joint_angles) > np.pi) and not self.allow_self_collision: if np.any(np.abs(self._joint_angles) > np.pi):
self_collision = True self_collision = True
self._is_collided = self_collision self._is_collided = self_collision
@ -135,25 +134,10 @@ class ViaPointReacher(gym.Env):
self._steps self._steps
]) ])
# def _reward(self):
# dist_reward = 0
# if not self._is_collided:
# if self._steps == 180:
# dist_reward = np.linalg.norm(self.end_effector - self.bottom_center_of_hole)
# else:
# dist_reward = np.linalg.norm(self.end_effector - self.bottom_center_of_hole)
#
# out = - dist_reward ** 2
#
# return out
def get_forward_kinematics(self, num_points_per_link=1): def get_forward_kinematics(self, num_points_per_link=1):
theta = self._joint_angles[:, None] theta = self._joint_angles[:, None]
if num_points_per_link > 1: intermediate_points = np.linspace(0, 1, num_points_per_link) if num_points_per_link > 1 else 1
intermediate_points = np.linspace(0, 1, num_points_per_link)
else:
intermediate_points = 1
accumulated_theta = np.cumsum(theta, axis=0) accumulated_theta = np.cumsum(theta, axis=0)
@ -171,46 +155,6 @@ class ViaPointReacher(gym.Env):
return np.squeeze(endeffector + self._joints[0, :]) return np.squeeze(endeffector + self._joints[0, :])
def check_self_collision(self, line_points):
for i, line1 in enumerate(line_points):
for line2 in line_points[i+2:, :, :]:
# if line1 != line2:
if intersect(line1[0], line1[-1], line2[0], line2[-1]):
return True
return False
def check_wall_collision(self, line_points):
# all points that are before the hole in x
r, c = np.where(line_points[:, :, 0] < (self.hole_x - self.hole_width / 2))
# check if any of those points are below surface
nr_line_points_below_surface_before_hole = np.sum(line_points[r, c, 1] < 0)
if nr_line_points_below_surface_before_hole > 0:
return True
# all points that are after the hole in x
r, c = np.where(line_points[:, :, 0] > (self.hole_x + self.hole_width / 2))
# check if any of those points are below surface
nr_line_points_below_surface_after_hole = np.sum(line_points[r, c, 1] < 0)
if nr_line_points_below_surface_after_hole > 0:
return True
# all points that are above the hole
r, c = np.where((line_points[:, :, 0] > (self.hole_x - self.hole_width / 2)) & (
line_points[:, :, 0] < (self.hole_x + self.hole_width / 2)))
# check if any of those points are below surface
nr_line_points_below_surface_in_hole = np.sum(line_points[r, c, 1] < -self.hole_depth)
if nr_line_points_below_surface_in_hole > 0:
return True
return False
def render(self, mode='human'): def render(self, mode='human'):
if self.fig is None: if self.fig is None:
self.fig = plt.figure() self.fig = plt.figure()
@ -274,7 +218,7 @@ class ViaPointReacher(gym.Env):
if __name__ == '__main__': if __name__ == '__main__':
nl = 5 nl = 5
render_mode = "human" # "human" or "partial" or "final" render_mode = "human" # "human" or "partial" or "final"
env = ViaPointReacher(num_links=nl, allow_self_collision=False) env = ViaPointReacher(n_links=nl, allow_self_collision=False)
env.reset() env.reset()
env.render(mode=render_mode) env.render(mode=render_mode)

View File

@ -1 +1,2 @@
from alr_envs.mujoco.reacher.alr_reacher import ALRReacherEnv from alr_envs.mujoco.reacher.alr_reacher import ALRReacherEnv
from alr_envs.mujoco.balancing import BalancingEnv

View File

@ -0,0 +1,53 @@
import os
import numpy as np
from gym import utils
from gym.envs.mujoco import mujoco_env
from alr_envs.utils.utils import angle_normalize
class BalancingEnv(mujoco_env.MujocoEnv, utils.EzPickle):
def __init__(self, n_links=5):
utils.EzPickle.__init__(**locals())
self.n_links = n_links
if n_links == 5:
file_name = 'reacher_5links.xml'
elif n_links == 7:
file_name = 'reacher_7links.xml'
else:
raise ValueError(f"Invalid number of links {n_links}, only 5 or 7 allowed.")
mujoco_env.MujocoEnv.__init__(self, os.path.join(os.path.dirname(__file__), "assets", file_name), 2)
def step(self, a):
angle = angle_normalize(np.sum(self.sim.data.qpos.flat[:self.n_links]), type="rad")
reward = - np.abs(angle)
self.do_simulation(a, self.frame_skip)
ob = self._get_obs()
done = False
return ob, reward, done, dict(angle=angle, end_effector=self.get_body_com("fingertip").copy())
def viewer_setup(self):
self.viewer.cam.trackbodyid = 1
def reset_model(self):
# This also generates a goal, we however do not need/use it
qpos = self.np_random.uniform(low=-0.1, high=0.1, size=self.model.nq) + self.init_qpos
qpos[-2:] = 0
qvel = self.init_qvel + self.np_random.uniform(low=-.005, high=.005, size=self.model.nv)
qvel[-2:] = 0
self.set_state(qpos, qvel)
return self._get_obs()
def _get_obs(self):
theta = self.sim.data.qpos.flat[:self.n_links]
return np.concatenate([
np.cos(theta),
np.sin(theta),
self.sim.data.qvel.flat[:self.n_links], # this is angular velocity
])

View File

@ -1,5 +1,5 @@
from alr_envs.utils.detpmp_env_wrapper import DetPMPEnvWrapper from alr_envs.utils.wrapper.detpmp_wrapper import DetPMPWrapper
from alr_envs.utils.dmp_env_wrapper import DmpEnvWrapper from alr_envs.utils.wrapper.dmp_wrapper import DmpWrapper
from alr_envs.mujoco.ball_in_a_cup.ball_in_a_cup import ALRBallInACupEnv from alr_envs.mujoco.ball_in_a_cup.ball_in_a_cup import ALRBallInACupEnv
from alr_envs.mujoco.ball_in_a_cup.ball_in_a_cup_simple import ALRBallInACupEnv as ALRBallInACupEnvSimple from alr_envs.mujoco.ball_in_a_cup.ball_in_a_cup_simple import ALRBallInACupEnv as ALRBallInACupEnvSimple
@ -18,7 +18,7 @@ def make_contextual_env(rank, seed=0):
def _init(): def _init():
env = ALRBallInACupEnv() env = ALRBallInACupEnv()
env = DetPMPEnvWrapper(env, env = DetPMPWrapper(env,
num_dof=7, num_dof=7,
num_basis=5, num_basis=5,
width=0.005, width=0.005,
@ -52,7 +52,7 @@ def make_env(rank, seed=0):
def _init(): def _init():
env = ALRBallInACupEnvSimple() env = ALRBallInACupEnvSimple()
env = DetPMPEnvWrapper(env, env = DetPMPWrapper(env,
num_dof=7, num_dof=7,
num_basis=5, num_basis=5,
width=0.005, width=0.005,
@ -86,7 +86,7 @@ def make_simple_env(rank, seed=0):
def _init(): def _init():
env = ALRBallInACupEnvSimple() env = ALRBallInACupEnvSimple()
env = DetPMPEnvWrapper(env, env = DetPMPWrapper(env,
num_dof=3, num_dof=3,
num_basis=5, num_basis=5,
width=0.005, width=0.005,
@ -121,7 +121,7 @@ def make_simple_dmp_env(rank, seed=0):
def _init(): def _init():
_env = ALRBallInACupEnvSimple() _env = ALRBallInACupEnvSimple()
_env = DmpEnvWrapper(_env, _env = DmpWrapper(_env,
num_dof=3, num_dof=3,
num_basis=5, num_basis=5,
duration=3.5, duration=3.5,

View File

@ -1,4 +1,4 @@
from alr_envs.utils.detpmp_env_wrapper import DetPMPEnvWrapper from alr_envs.utils.wrapper.detpmp_wrapper import DetPMPWrapper
from alr_envs.mujoco.beerpong.beerpong import ALRBeerpongEnv from alr_envs.mujoco.beerpong.beerpong import ALRBeerpongEnv
from alr_envs.mujoco.beerpong.beerpong_simple import ALRBeerpongEnv as ALRBeerpongEnvSimple from alr_envs.mujoco.beerpong.beerpong_simple import ALRBeerpongEnv as ALRBeerpongEnvSimple
@ -17,7 +17,7 @@ def make_contextual_env(rank, seed=0):
def _init(): def _init():
env = ALRBeerpongEnv() env = ALRBeerpongEnv()
env = DetPMPEnvWrapper(env, env = DetPMPWrapper(env,
num_dof=7, num_dof=7,
num_basis=5, num_basis=5,
width=0.005, width=0.005,
@ -51,7 +51,7 @@ def make_env(rank, seed=0):
def _init(): def _init():
env = ALRBeerpongEnvSimple() env = ALRBeerpongEnvSimple()
env = DetPMPEnvWrapper(env, env = DetPMPWrapper(env,
num_dof=7, num_dof=7,
num_basis=5, num_basis=5,
width=0.005, width=0.005,
@ -85,7 +85,7 @@ def make_simple_env(rank, seed=0):
def _init(): def _init():
env = ALRBeerpongEnvSimple() env = ALRBeerpongEnvSimple()
env = DetPMPEnvWrapper(env, env = DetPMPWrapper(env,
num_dof=3, num_dof=3,
num_basis=5, num_basis=5,
width=0.005, width=0.005,

View File

@ -9,6 +9,8 @@ from alr_envs.utils.utils import angle_normalize
class ALRReacherEnv(mujoco_env.MujocoEnv, utils.EzPickle): class ALRReacherEnv(mujoco_env.MujocoEnv, utils.EzPickle):
def __init__(self, steps_before_reward=200, n_links=5, balance=False): def __init__(self, steps_before_reward=200, n_links=5, balance=False):
utils.EzPickle.__init__(**locals())
self._steps = 0 self._steps = 0
self.steps_before_reward = steps_before_reward self.steps_before_reward = steps_before_reward
self.n_links = n_links self.n_links = n_links
@ -29,65 +31,48 @@ class ALRReacherEnv(mujoco_env.MujocoEnv, utils.EzPickle):
else: else:
raise ValueError(f"Invalid number of links {n_links}, only 5 or 7 allowed.") raise ValueError(f"Invalid number of links {n_links}, only 5 or 7 allowed.")
self._q_pos = []
self._q_vel = []
utils.EzPickle.__init__(self)
mujoco_env.MujocoEnv.__init__(self, os.path.join(os.path.dirname(__file__), "assets", file_name), 2) mujoco_env.MujocoEnv.__init__(self, os.path.join(os.path.dirname(__file__), "assets", file_name), 2)
@property
def current_pos(self):
return self.sim.data.qpos[0:5].copy()
@property
def current_vel(self):
return self.sim.data.qvel[0:5].copy()
def step(self, a): def step(self, a):
self._steps += 1 self._steps += 1
reward_dist = 0.0 reward_dist = 0.0
angular_vel = 0.0 angular_vel = 0.0
reward_balance = 0.0
if self._steps >= self.steps_before_reward: if self._steps >= self.steps_before_reward:
vec = self.get_body_com("fingertip") - self.get_body_com("target") vec = self.get_body_com("fingertip") - self.get_body_com("target")
reward_dist -= self.reward_weight * np.linalg.norm(vec) reward_dist -= self.reward_weight * np.linalg.norm(vec)
angular_vel -= np.linalg.norm(self.sim.data.qvel.flat[:self.n_links]) angular_vel -= np.linalg.norm(self.sim.data.qvel.flat[:self.n_links])
reward_ctrl = - np.square(a).sum() reward_ctrl = - np.square(a).sum()
reward_balance = - self.balance_weight * np.abs(
if self.balance:
reward_balance -= self.balance_weight * np.abs(
angle_normalize(np.sum(self.sim.data.qpos.flat[:self.n_links]), type="rad")) angle_normalize(np.sum(self.sim.data.qpos.flat[:self.n_links]), type="rad"))
reward = reward_dist + reward_ctrl + angular_vel + reward_balance reward = reward_dist + reward_ctrl + angular_vel + reward_balance
self.do_simulation(a, self.frame_skip) self.do_simulation(a, self.frame_skip)
self._q_pos.append(self.sim.data.qpos[0:5].ravel().copy())
self._q_vel.append(self.sim.data.qvel[0:5].ravel().copy())
ob = self._get_obs() ob = self._get_obs()
done = False done = False
return ob, reward, done, dict(reward_dist=reward_dist, reward_ctrl=reward_ctrl, return ob, reward, done, dict(reward_dist=reward_dist, reward_ctrl=reward_ctrl,
velocity=angular_vel, reward_balance=reward_balance, velocity=angular_vel, reward_balance=reward_balance,
end_effector=self.get_body_com("fingertip").copy(), end_effector=self.get_body_com("fingertip").copy(),
goal=self.goal if hasattr(self, "goal") else None, goal=self.goal if hasattr(self, "goal") else None)
traj=self._q_pos, vel=self._q_vel)
def viewer_setup(self): def viewer_setup(self):
self.viewer.cam.trackbodyid = 0 self.viewer.cam.trackbodyid = 0
def reset_model(self): def reset_model(self):
qpos = self.init_qpos # self.np_random.uniform(low=-0.1, high=0.1, size=self.model.nq) + self.init_qpos qpos = self.np_random.uniform(low=-0.1, high=0.1, size=self.model.nq) + self.init_qpos
while True: while True:
self.goal = self.np_random.uniform(low=-self.n_links / 10, high=self.n_links / 10, size=2) self.goal = self.np_random.uniform(low=-self.n_links / 10, high=self.n_links / 10, size=2)
if np.linalg.norm(self.goal) < self.n_links / 10: if np.linalg.norm(self.goal) < self.n_links / 10:
break break
qpos[-2:] = self.goal qpos[-2:] = self.goal
qvel = self.init_qvel # + self.np_random.uniform(low=-.005, high=.005, size=self.model.nv) qvel = self.init_qvel + self.np_random.uniform(low=-.005, high=.005, size=self.model.nv)
qvel[-2:] = 0 qvel[-2:] = 0
self.set_state(qpos, qvel) self.set_state(qpos, qvel)
self._steps = 0 self._steps = 0
self._q_pos = []
self._q_vel = []
return self._get_obs() return self._get_obs()
def _get_obs(self): def _get_obs(self):

View File

@ -1,88 +0,0 @@
from alr_envs.utils.policies import get_policy_class
from mp_lib import det_promp
import numpy as np
import gym
class DetPMPEnvWrapper(gym.Wrapper):
def __init__(self,
env,
num_dof,
num_basis,
width,
off=0.01,
start_pos=None,
duration=1,
dt=0.01,
post_traj_time=0.,
policy_type=None,
weights_scale=1,
zero_start=False,
zero_goal=False,
):
super(DetPMPEnvWrapper, self).__init__(env)
self.num_dof = num_dof
self.num_basis = num_basis
self.dim = num_dof * num_basis
self.pmp = det_promp.DeterministicProMP(n_basis=num_basis, n_dof=num_dof, width=width, off=off,
zero_start=zero_start, zero_goal=zero_goal)
weights = np.zeros(shape=(num_basis, num_dof))
self.pmp.set_weights(duration, weights)
self.weights_scale = weights_scale
self.duration = duration
self.dt = dt
self.post_traj_steps = int(post_traj_time / dt)
self.start_pos = start_pos
self.zero_start = zero_start
policy_class = get_policy_class(policy_type)
self.policy = policy_class(env)
def __call__(self, params, contexts=None):
params = np.atleast_2d(params)
rewards = []
infos = []
for p, c in zip(params, contexts):
reward, info = self.rollout(p, c)
rewards.append(reward)
infos.append(info)
return np.array(rewards), infos
def rollout(self, params, context=None, render=False):
""" This function generates a trajectory based on a DMP and then does the usual loop over reset and step"""
params = np.reshape(params, newshape=(self.num_basis, self.num_dof)) * self.weights_scale
self.pmp.set_weights(self.duration, params)
t, des_pos, des_vel, des_acc = self.pmp.compute_trajectory(1 / self.dt, 1.)
if self.zero_start:
des_pos += self.start_pos[None, :]
if self.post_traj_steps > 0:
des_pos = np.vstack([des_pos, np.tile(des_pos[-1, :], [self.post_traj_steps, 1])])
des_vel = np.vstack([des_vel, np.zeros(shape=(self.post_traj_steps, self.num_dof))])
self._trajectory = des_pos
self._velocity = des_vel
rews = []
infos = []
self.env.configure(context)
self.env.reset()
for t, pos_vel in enumerate(zip(des_pos, des_vel)):
ac = self.policy.get_action(pos_vel[0], pos_vel[1])
obs, rew, done, info = self.env.step(ac)
rews.append(rew)
infos.append(info)
if render:
self.env.render(mode="human")
if done:
break
reward = np.sum(rews)
return reward, info

View File

@ -1,125 +0,0 @@
from alr_envs.utils.policies import get_policy_class
from mp_lib.phase import ExpDecayPhaseGenerator
from mp_lib.basis import DMPBasisGenerator
from mp_lib import dmps
import numpy as np
import gym
class DmpEnvWrapper(gym.Wrapper):
def __init__(self,
env,
num_dof,
num_basis,
start_pos=None,
final_pos=None,
duration=1,
dt=0.01,
alpha_phase=2,
bandwidth_factor=3,
learn_goal=False,
post_traj_time=0.,
policy_type=None,
weights_scale=1.,
goal_scale=1.,
):
super(DmpEnvWrapper, self).__init__(env)
self.num_dof = num_dof
self.num_basis = num_basis
self.dim = num_dof * num_basis
if learn_goal:
self.dim += num_dof
self.learn_goal = learn_goal
self.duration = duration # seconds
time_steps = int(duration / dt)
self.t = np.linspace(0, duration, time_steps)
self.post_traj_steps = int(post_traj_time / dt)
phase_generator = ExpDecayPhaseGenerator(alpha_phase=alpha_phase, duration=duration)
basis_generator = DMPBasisGenerator(phase_generator,
duration=duration,
num_basis=self.num_basis,
basis_bandwidth_factor=bandwidth_factor)
self.dmp = dmps.DMP(num_dof=num_dof,
basis_generator=basis_generator,
phase_generator=phase_generator,
num_time_steps=time_steps,
dt=dt
)
self.dmp.dmp_start_pos = start_pos.reshape((1, num_dof))
dmp_weights = np.zeros((num_basis, num_dof))
if learn_goal:
dmp_goal_pos = np.zeros(num_dof)
else:
dmp_goal_pos = final_pos
self.dmp.set_weights(dmp_weights, dmp_goal_pos)
self.weights_scale = weights_scale
self.goal_scale = goal_scale
policy_class = get_policy_class(policy_type)
self.policy = policy_class(env)
def __call__(self, params, contexts=None):
params = np.atleast_2d(params)
rewards = []
infos = []
for p, c in zip(params, contexts):
reward, info = self.rollout(p, c)
rewards.append(reward)
infos.append(info)
return np.array(rewards), infos
def goal_and_weights(self, params):
if len(params.shape) > 1:
assert params.shape[1] == self.dim
else:
assert len(params) == self.dim
params = np.reshape(params, [1, self.dim])
if self.learn_goal:
goal_pos = params[0, -self.num_dof:]
weight_matrix = np.reshape(params[:, :-self.num_dof], [self.num_basis, self.num_dof])
else:
goal_pos = self.dmp.dmp_goal_pos.flatten()
assert goal_pos is not None
weight_matrix = np.reshape(params, [self.num_basis, self.num_dof])
return goal_pos * self.goal_scale, weight_matrix * self.weights_scale
def rollout(self, params, context=None, render=False):
""" This function generates a trajectory based on a DMP and then does the usual loop over reset and step"""
goal_pos, weight_matrix = self.goal_and_weights(params)
self.dmp.set_weights(weight_matrix, goal_pos)
trajectory, velocity = self.dmp.reference_trajectory(self.t)
if self.post_traj_steps > 0:
trajectory = np.vstack([trajectory, np.tile(trajectory[-1, :], [self.post_traj_steps, 1])])
velocity = np.vstack([velocity, np.zeros(shape=(self.post_traj_steps, self.num_dof))])
self._trajectory = trajectory
self._velocity = velocity
rews = []
infos = []
self.env.configure(context)
self.env.reset()
for t, pos_vel in enumerate(zip(trajectory, velocity)):
ac = self.policy.get_action(pos_vel[0], pos_vel[1])
obs, rew, done, info = self.env.step(ac)
rews.append(rew)
infos.append(info)
if render:
self.env.render(mode="human")
if done:
break
reward = np.sum(rews)
return reward, info

View File

@ -1,8 +1,10 @@
from gym import Env
from alr_envs.mujoco.alr_mujoco_env import AlrMujocoEnv from alr_envs.mujoco.alr_mujoco_env import AlrMujocoEnv
class BaseController: class BaseController:
def __init__(self, env: AlrMujocoEnv): def __init__(self, env: Env):
self.env = env self.env = env
def get_action(self, des_pos, des_vel): def get_action(self, des_pos, des_vel):
@ -20,7 +22,7 @@ class VelController(BaseController):
class PDController(BaseController): class PDController(BaseController):
def __init__(self, env): def __init__(self, env: AlrMujocoEnv):
self.p_gains = env.p_gains self.p_gains = env.p_gains
self.d_gains = env.d_gains self.d_gains = env.d_gains
super(PDController, self).__init__(env) super(PDController, self).__init__(env)

View File

@ -11,10 +11,39 @@ def angle_normalize(x, type="deg"):
Returns: Returns:
""" """
if type not in ["deg", "rad"]: raise ValueError(f"Invalid type {type}. Choose one of 'deg' or 'rad'.")
if type == "deg": if type == "deg":
return ((x + np.pi) % (2 * np.pi)) - np.pi x = np.deg2rad(x) # x * pi / 180
elif type == "rad":
two_pi = 2 * np.pi two_pi = 2 * np.pi
return x - two_pi * np.floor((x + np.pi) / two_pi) return x - two_pi * np.floor((x + np.pi) / two_pi)
else:
raise ValueError(f"Invalid type {type}. Choose on of 'deg' or 'rad'.")
def ccw(A, B, C):
return (C[1] - A[1]) * (B[0] - A[0]) - (B[1] - A[1]) * (C[0] - A[0]) > 1e-12
def intersect(A, B, C, D):
"""
Return true if line segments AB and CD intersects
Args:
A: start point line one
B: end point line one
C: start point line two
D: end point line two
Returns:
"""
return ccw(A, C, D) != ccw(B, C, D) and ccw(A, B, C) != ccw(A, B, D)
def check_self_collision(line_points):
for i, line1 in enumerate(line_points):
for line2 in line_points[i + 2:, :, :]:
# if line1 != line2:
if intersect(line1[0], line1[-1], line2[0], line2[-1]):
return True
return False

View File

View File

@ -0,0 +1,40 @@
import gym
import numpy as np
from mp_lib import det_promp
from alr_envs.utils.wrapper.mp_wrapper import MPWrapper
class DetPMPWrapper(MPWrapper):
def __init__(self, env, num_dof, num_basis, width, start_pos=None, duration=1, dt=0.01, post_traj_time=0.,
policy_type=None, weights_scale=1, zero_start=False, zero_goal=False, **mp_kwargs):
# self.duration = duration # seconds
super().__init__(env, num_dof, duration, dt, post_traj_time, policy_type, weights_scale,
num_basis=num_basis, width=width, start_pos=start_pos, zero_start=zero_start,
zero_goal=zero_goal)
action_bounds = np.inf * np.ones((self.mp.n_basis * self.mp.n_dof))
self.action_space = gym.spaces.Box(low=-action_bounds, high=action_bounds, dtype=np.float32)
self.start_pos = start_pos
self.dt = dt
def initialize_mp(self, num_dof: int, duration: int, dt: float, num_basis: int = 5, width: float = None,
start_pos: np.ndarray = None, zero_start: bool = False, zero_goal: bool = False):
pmp = det_promp.DeterministicProMP(n_basis=num_basis, n_dof=num_dof, width=width, off=0.01,
zero_start=zero_start, zero_goal=zero_goal)
weights = np.zeros(shape=(num_basis, num_dof))
pmp.set_weights(duration, weights)
return pmp
def mp_rollout(self, action):
params = np.reshape(action, (self.mp.n_basis, self.mp.n_dof)) * self.weights_scale
self.mp.set_weights(self.duration, params)
_, des_pos, des_vel, _ = self.mp.compute_trajectory(1 / self.dt, 1.)
if self.mp.zero_start:
des_pos += self.start_pos[None, :]
return des_pos, des_vel

View File

@ -0,0 +1,81 @@
from alr_envs.utils.policies import get_policy_class
from mp_lib.phase import ExpDecayPhaseGenerator
from mp_lib.basis import DMPBasisGenerator
from mp_lib import dmps
import numpy as np
import gym
from alr_envs.utils.wrapper.mp_wrapper import MPWrapper
class DmpWrapper(MPWrapper):
def __init__(self, env: gym.Env, num_dof: int, num_basis: int, start_pos: np.ndarray = None,
final_pos: np.ndarray = None, duration: int = 1, alpha_phase: float = 2., dt: float = 0.01,
learn_goal: bool = False, post_traj_time: float = 0., policy_type: str = None,
weights_scale: float = 1., goal_scale: float = 1.):
"""
This Wrapper generates a trajectory based on a DMP and will only return episodic performances.
Args:
env:
num_dof:
num_basis:
start_pos:
final_pos:
duration:
alpha_phase:
dt:
learn_goal:
post_traj_time:
policy_type:
weights_scale:
goal_scale:
"""
self.learn_goal = learn_goal
self.t = np.linspace(0, duration, int(duration / dt))
self.goal_scale = goal_scale
super().__init__(env, num_dof, duration, dt, post_traj_time, policy_type, weights_scale,
num_basis=num_basis, start_pos=start_pos, final_pos=final_pos, alpha_phase=alpha_phase)
action_bounds = np.inf * np.ones((np.prod(self.mp.dmp_weights.shape) + (num_dof if learn_goal else 0)))
self.action_space = gym.spaces.Box(low=-action_bounds, high=action_bounds, dtype=np.float32)
def initialize_mp(self, num_dof: int, duration: int, dt: float, num_basis: int = 5, start_pos: np.ndarray = None,
final_pos: np.ndarray = None, alpha_phase: float = 2.):
phase_generator = ExpDecayPhaseGenerator(alpha_phase=alpha_phase, duration=duration)
basis_generator = DMPBasisGenerator(phase_generator, duration=duration, num_basis=num_basis)
dmp = dmps.DMP(num_dof=num_dof, basis_generator=basis_generator, phase_generator=phase_generator,
num_time_steps=int(duration / dt), dt=dt)
dmp.dmp_start_pos = start_pos.reshape((1, num_dof))
weights = np.zeros((num_basis, num_dof))
goal_pos = np.zeros(num_dof) if self.learn_goal else final_pos
dmp.set_weights(weights, goal_pos)
return dmp
def goal_and_weights(self, params):
assert params.shape[-1] == self.action_space.shape[0]
params = np.atleast_2d(params)
if self.learn_goal:
goal_pos = params[0, -self.mp.num_dimensions:] # [num_dof]
params = params[:, :-self.mp.num_dimensions] # [1,num_dof]
# weight_matrix = np.reshape(params[:, :-self.num_dof], [self.num_basis, self.num_dof])
else:
goal_pos = self.mp.dmp_goal_pos.flatten()
assert goal_pos is not None
# weight_matrix = np.reshape(params, [self.num_basis, self.num_dof])
weight_matrix = np.reshape(params, self.mp.dmp_weights.shape)
return goal_pos * self.goal_scale, weight_matrix * self.weights_scale
def mp_rollout(self, action):
goal_pos, weight_matrix = self.goal_and_weights(action)
self.mp.set_weights(weight_matrix, goal_pos)
return self.mp.reference_trajectory(self.t)

View File

@ -0,0 +1,109 @@
from abc import ABC, abstractmethod
from collections import defaultdict
import gym
import numpy as np
from alr_envs.utils.policies import get_policy_class
class MPWrapper(gym.Wrapper, ABC):
def __init__(self,
env: gym.Env,
num_dof: int,
duration: int = 1,
dt: float = 0.01,
# learn_goal: bool = False,
post_traj_time: float = 0.,
policy_type: str = None,
weights_scale: float = 1.,
**mp_kwargs
):
super().__init__(env)
# self.num_dof = num_dof
# self.num_basis = num_basis
# self.duration = duration # seconds
self.post_traj_steps = int(post_traj_time / dt)
self.mp = self.initialize_mp(num_dof, duration, dt, **mp_kwargs)
self.weights_scale = weights_scale
policy_class = get_policy_class(policy_type)
self.policy = policy_class(env)
# rendering
self.render_mode = None
self.render_kwargs = None
def step(self, action: np.ndarray):
""" This function generates a trajectory based on a DMP and then does the usual loop over reset and step"""
trajectory, velocity = self.mp_rollout(action)
if self.post_traj_steps > 0:
trajectory = np.vstack([trajectory, np.tile(trajectory[-1, :], [self.post_traj_steps, 1])])
velocity = np.vstack([velocity, np.zeros(shape=(self.post_traj_steps, self.dmp.num_dimensions))])
# self._trajectory = trajectory
# self._velocity = velocity
rewards = 0
# infos = defaultdict(list)
# TODO: @Max Why do we need this configure, states should be part of the model
# self.env.configure(context)
obs = self.env.reset()
info = {}
for t, pos_vel in enumerate(zip(trajectory, velocity)):
ac = self.policy.get_action(pos_vel[0], pos_vel[1])
obs, rew, done, info = self.env.step(ac)
rewards += rew
# TODO return all dicts?
# [infos[k].append(v) for k, v in info.items()]
if self.render_mode:
self.env.render(mode=self.render_mode, **self.render_kwargs)
if done:
break
done = True
return obs, rewards, done, info
def render(self, mode='human', **kwargs):
"""Only set render options here, such that they can be used during the rollout.
This only needs to be called once"""
self.render_mode = mode
self.render_kwargs = kwargs
def __call__(self, actions):
return self.step(actions)
# params = np.atleast_2d(params)
# rewards = []
# infos = []
# for p, c in zip(params, contexts):
# reward, info = self.rollout(p, c)
# rewards.append(reward)
# infos.append(info)
#
# return np.array(rewards), infos
@abstractmethod
def mp_rollout(self, action):
"""
Generate trajectory and velocity based on the MP
Returns:
trajectory/positions, velocity
"""
raise NotImplementedError()
@abstractmethod
def initialize_mp(self, num_dof: int, duration: int, dt: float, **kwargs):
"""
Create respective instance of MP
Returns:
MP instance
"""
raise NotImplementedError

View File

@ -1,11 +1,9 @@
from alr_envs.classic_control.utils import make_viapointreacher_env from alr_envs.classic_control.utils import make_viapointreacher_env
from alr_envs.classic_control.utils import make_holereacher_env, make_holereacher_fix_goal_env, make_holereacher_env_pmp from alr_envs.classic_control.utils import make_holereacher_env, make_holereacher_fix_goal_env
from alr_envs.utils.dmp_async_vec_env import DmpAsyncVectorEnv from alr_envs.utils.dmp_async_vec_env import DmpAsyncVectorEnv
import numpy as np import numpy as np
if __name__ == "__main__": if __name__ == "__main__":
n_samples = 1 n_samples = 1
n_cpus = 4 n_cpus = 4
dim = 30 dim = 30
@ -15,16 +13,13 @@ if __name__ == "__main__":
test_env = make_holereacher_env(0)() test_env = make_holereacher_env(0)()
# params = 1 * np.random.randn(dim) # params = np.random.randn(n_samples, dim)
params = np.array([ -1.09434772, 7.09294269, 0.98756352, 1.61950682, params = np.array([[1.386102, -3.29980525, 4.70402733, 1.3966668, 0.73774902,
2.66567135, 1.71267901, 8.20010847, 2.50496653, 3.14676681, -4.98644416, 6.20303193, 1.30502127, -0.09330522,
-0.34886972, 2.07807773, 8.68615904, 3.66578556, 7.62656797, -5.76893033, 3.4706711, -0.6944142, -3.33442788,
5.24572097, -3.21506848, -0.28593896, 17.03756855, 12.31421548, -0.72760271, -6.9090723, 7.02903814, -8.7236836,
-5.88445032, 6.02197609, -3.73457261, -4.24772663, 1.4805914, 0.53185824, -5.46626893, 0.69692163, 13.58472666,
8.69382861, -10.99939646, 5.31356886, 8.57420996, 0.77199316, 2.02906724, -3.0203244, -1.00533159, -0.57417351]])
1.05616879, 19.79831628, -23.53288774, -3.32974082,
-5.86463784, -9.68133089])
# params = np.hstack([50 * np.random.randn(n_samples, 25), np.tile(np.array([np.pi/2, -np.pi/4, -np.pi/4, -np.pi/4, -np.pi/4]), [n_samples, 1])]) # params = np.hstack([50 * np.random.randn(n_samples, 25), np.tile(np.array([np.pi/2, -np.pi/4, -np.pi/4, -np.pi/4, -np.pi/4]), [n_samples, 1])])

View File

@ -1,20 +1,86 @@
import time from collections import defaultdict
import gym import gym
import numpy as np
if __name__ == '__main__':
def example_mujoco():
env = gym.make('alr_envs:ALRReacher-v0') env = gym.make('alr_envs:ALRReacher-v0')
# env = gym.make('alr_envs:SimpleReacher-v0') rewards = 0
# env = gym.make('alr_envs:ALRReacher7-v0') obs = env.reset()
state = env.reset()
# number of environment steps
for i in range(10000): for i in range(10000):
state, reward, done, info = env.step(env.action_space.sample()) obs, reward, done, info = env.step(env.action_space.sample())
rewards += reward
if i % 1 == 0: if i % 1 == 0:
env.render() env.render()
# if done: if done:
state = env.reset() print(rewards)
rewards = 0
obs = env.reset()
time.sleep(0.5)
def example_dmp():
# env = gym.make("alr_envs:ViaPointReacherDMP-v0")
env = gym.make("alr_envs:HoleReacherDMP-v0")
rewards = 0
# env.render(mode=None)
obs = env.reset()
# number of samples/full trajectories (multiple environment steps)
for i in range(10):
obs, reward, done, info = env.step(env.action_space.sample())
rewards += reward
if i % 1 == 0:
# render full DMP trajectory
# render can only be called once in the beginning as well. That would render every trajectory
# Calling it after every trajectory allows to modify the mode. mode=None, disables rendering.
env.render(mode="human")
if done:
print(rewards)
rewards = 0
obs = env.reset()
def example_async(n_cpu=4, seed=int('533D', 16)):
def make_env(env_id, seed, rank):
env = gym.make(env_id)
env.seed(seed + rank)
return lambda: env
def sample(env: gym.vector.VectorEnv, n_samples=100):
# for plotting
rewards = np.zeros(n_cpu)
# this would generate more samples than requested if n_samples % num_envs != 0
repeat = int(np.ceil(n_samples / env.num_envs))
vals = defaultdict(list)
for i in range(repeat):
obs, reward, done, info = envs.step(envs.action_space.sample())
vals['obs'].append(obs)
vals['reward'].append(reward)
vals['done'].append(done)
vals['info'].append(info)
rewards += reward
if np.any(done):
print(rewards[done])
rewards[done] = 0
# do not return values above threshold
return (*map(lambda v: np.stack(v)[:n_samples], vals.values()),)
envs = gym.vector.AsyncVectorEnv([make_env("alr_envs:HoleReacherDMP-v0", seed, i) for i in range(n_cpu)])
obs = envs.reset()
print(sample(envs, 16))
if __name__ == '__main__':
# example_mujoco()
# example_dmp()
example_async()

View File

@ -2,6 +2,9 @@ from setuptools import setup
setup(name='alr_envs', setup(name='alr_envs',
version='0.0.1', version='0.0.1',
install_requires=['gym', 'PyQt5', 'matplotlib', install_requires=['gym',
'mp_lib @ git+https://git@github.com/maxhuettenrauch/mp_lib@master#egg=mp_lib',], # And any other dependencies foo needs 'PyQt5',
'matplotlib',
'mp_lib @ git+https://git@github.com/maxhuettenrauch/mp_lib@master#egg=mp_lib',
'mujoco_py'], # And any other dependencies foo needs
) )