Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.

Commit

Permalink
Robosuite exploration (#478)
Browse files Browse the repository at this point in the history
* Add Robosuite parameters for all env types + initialize env flow

* Init flow done

* Rest of Environment API complete for RobosuiteEnvironment

* RobosuiteEnvironment changes

* Observation stacking filter
* Add proper frame_skip in addition to control_freq
* Hardcode Coach rendering to 'frontview' camera

* Robosuite_Lift_DDPG preset + Robosuite env updates

* Move observation stacking filter from env to preset
* Pre-process observation - concatenate depth map (if exists)
  to image and object state (if exists) to robot state
* Preset parameters based on Surreal DDPG parameters, taken from:
  https://github.com/SurrealAI/surreal/blob/master/surreal/main/ddpg_configs.py

* RobosuiteEnvironment fixes - working now with PyGame rendering

* Preset minor modifications

* ObservationStackingFilter - option to concat non-vector observations

* Consider frame skip when setting horizon in robosuite env

* Robosuite lift preset - update heatup length and training interval

* Robosuite env - change control_freq to 10 to match Surreal usage

* Robosuite clipped PPO preset

* Distribute multiple workers (-n #) over multiple GPUs

* Clipped PPO memory optimization from @shadiendrawis

* Fixes to evaluation only workers

* RoboSuite_ClippedPPO: Update training interval

* Undo last commit (update training interval)

* Fix "doube-negative" if conditions

* multi-agent single-trainer clipped ppo training with cartpole

* cleanups (not done yet) + ~tuned hyper-params for mast

* Switch to Robosuite v1 APIs

* Change presets to IK controller

* more cleanups + enabling evaluation worker + better logging

* RoboSuite_Lift_ClippedPPO updates

* Fix major bug in obs normalization filter setup

* Reduce coupling between Robosuite API and Coach environment

* Now only non task-specific parameters are explicitly defined
  in Coach
* Removed a bunch of enums of Robosuite elements, using simple
  strings instead
* With this change new environments/robots/controllers in Robosuite
  can be used immediately in Coach

* MAST: better logging of actor-trainer interaction + bug fixes + performance improvements.

Still missing: fixed pubsub for obs normalization running stats + logging for trainer signals

* lstm support for ppo

* setting JOINT VELOCITY action space by default + fix for EveryNEpisodes video dump filter + new TaskIDDumpFilter + allowing or between video dump filters

* Separate Robosuite clipped PPO preset for the non-MAST case

* Add flatten layer to architectures and use it in Robosuite presets

This is required for embedders that mix conv and dense

TODO: Add MXNet implementation

* publishing running_stats together with the published policy + hyper-param for when to publish a policy + cleanups

* bug-fix for memory leak in MAST

* Bugfix: Return value in TF BatchnormActivationDropout.to_tf_instance

* Explicit activations in embedder scheme so there's no ReLU after flatten

* Add clipped PPO heads with configurable dense layers at the beginning

* This is a workaround needed to mimic Surreal-PPO, where the CNN and
  LSTM are shared between actor and critic but the FC layers are not
  shared
* Added a "SchemeBuilder" class, currently only used for the new heads
  but we can change Middleware and Embedder implementations to use it
  as well

* Video dump setting fix in basic preset

* logging screen output to file

* coach to start the redis-server for a MAST run

* trainer drops off-policy data + old policy in ClippedPPO updates only after policy was published + logging free memory stats + actors check for a new policy only at the beginning of a new episode + fixed a bug where the trainer was logging "Training Reward = 0", causing dashboard to incorrectly display the signal

* Add missing set_internal_state function in TFSharedRunningStats

* Robosuite preset - use SingleLevelSelect instead of hard-coded level

* policy ID published directly on Redis

* Small fix when writing to log file

* Major bugfix in Robosuite presets - pass dense sizes to heads

* RoboSuite_Lift_ClippedPPO hyper-params update

* add horizon and value bootstrap to GAE calculation, fix A3C with LSTM

* adam hyper-params from mujoco

* updated MAST preset with IK_POSE_POS controller

* configurable initialization for policy stdev + custom extra noise per actor + logging of policy stdev to dashboard

* values loss weighting of 0.5

* minor fixes + presets

* bug-fix for MAST  where the old policy in the trainer had kept updating every training iter while it should only update after every policy publish

* bug-fix: reset_internal_state was not called by the trainer

* bug-fixes in the lstm flow + some hyper-param adjustments for CartPole_ClippedPPO_LSTM -> training and sometimes reaches 200

* adding back the horizon hyper-param - a messy commit

* another bug-fix missing from prev commit

* set control_freq=2 to match action_scale 0.125

* ClippedPPO with MAST cleanups and some preps for TD3 with MAST

* TD3 presets. RoboSuite_Lift_TD3 seems to work well with multi-process runs (-n 8)

* setting termination on collision to be on by default

* bug-fix following prev-prev commit

* initial cube exploration environment with TD3 commit

* bug fix + minor refactoring

* several parameter changes and RND debugging

* Robosuite Gym wrapper + Rename TD3_Random* -> Random*

* algorithm update

* Add RoboSuite v1 env + presets (to eventually replace non-v1 ones)

* Remove grasping presets, keep only V1 exp. presets (w/o V1 tag)

* Keep just robosuite V1 env as the 'robosuite_environment' module

* Exclude Robosuite and MAST presets from integration tests

* Exclude LSTM and MAST presets from golden tests

* Fix mistakenly removed import

* Revert debug changes in ReaderWriterLock

* Try another way to exclude LSTM/MAST golden tests

* Remove debug prints

* Remove PreDense heads, unused in the end

* Missed removing an instance of PreDense head

* Remove MAST, not required for this PR

* Undo unused concat option in ObservationStackingFilter

* Remove LSTM updates, not required in this PR

* Update README.md

* code changes for the exploration flow to work with robosuite master branch

* code cleanup + documentation

* jupyter tutorial for the goal-based exploration + scatter plot

* typo fix

* Update README.md

* seprate parameter for the obs-goal observation + small fixes

* code clarity fixes

* adjustment in tutorial 5

* Update tutorial

* Update tutorial

Co-authored-by: Guy Jacob <[email protected]>
Co-authored-by: Gal Leibovich <[email protected]>
Co-authored-by: shadi.endrawis <[email protected]>
  • Loading branch information
4 people authored May 31, 2021
1 parent 235a259 commit 0896f43
Show file tree
Hide file tree
Showing 25 changed files with 1,905 additions and 46 deletions.
15 changes: 14 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ coach -p CartPole_DQN -r
* [Distributed Multi-Node Coach](#distributed-multi-node-coach)
* [Batch Reinforcement Learning](#batch-reinforcement-learning)
- [Supported Environments](#supported-environments)
* [Note on MuJoCo version](#note-on-mujoco-version)
- [Supported Algorithms](#supported-algorithms)
- [Citation](#citation)
- [Contact](#contact)
Expand Down Expand Up @@ -202,7 +203,7 @@ There are [example](https://github.com/IntelLabs/coach/blob/master/rl_coach/pres

* *OpenAI Gym:*

Installed by default by Coach's installer
Installed by default by Coach's installer (see note on MuJoCo version [below](#note-on-mujoco-version)).

* *ViZDoom:*

Expand Down Expand Up @@ -258,6 +259,18 @@ There are [example](https://github.com/IntelLabs/coach/blob/master/rl_coach/pres
https://github.com/deepmind/dm_control
* *Robosuite:*<a name="robosuite"></a>
**__Note:__ To use Robosuite-based environments, please install Coach from the latest cloned repository. It is not yet available as part of the `rl_coach` package on PyPI.**
Follow the instructions described in the [robosuite documentation](https://robosuite.ai/docs/installation.html) (see note on MuJoCo version [below](#note-on-mujoco-version)).
### Note on MuJoCo version
OpenAI Gym supports MuJoCo only up to version 1.5 (and corresponding mujoco-py version 1.50.x.x). The Robosuite simulation framework, however, requires MuJoCo version 2.0 (and corresponding mujoco-py version 2.0.2.9, as of robosuite version 1.2). Therefore, if you wish to run both Gym-based MuJoCo environments and Robosuite environments, it's recommended to have a separate virtual environment for each.
Please note that all Gym-Based MuJoCo presets in Coach (`rl_coach/presets/Mujoco_*.py`) have been validated _**only**_ with MuJoCo 1.5 (including the reported [benchmark results](benchmarks)).
## Supported Algorithms
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ redis>=2.10.6
minio>=4.0.5
pytest>=3.8.2
psutil>=5.5.0
joblib>=0.17.0
25 changes: 13 additions & 12 deletions rl_coach/agents/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,6 @@ def initialize_session_dependent_components(self):
:return: None
"""

# Loading a memory from a CSV file, requires an input filter to filter through the data.
# The filter needs a session before it can be used.
if self.ap.memory.load_memory_from_file_path:
Expand Down Expand Up @@ -418,10 +417,11 @@ def reset_evaluation_state(self, val: RunPhase) -> None:
self.num_successes_across_evaluation_episodes = 0
self.num_evaluation_episodes_completed = 0

# TODO verbosity was mistakenly removed from task_parameters on release 0.11.0, need to bring it back
# if self.ap.is_a_highest_level_agent or self.ap.task_parameters.verbosity == "high":
if self.ap.is_a_highest_level_agent:
screen.log_title("{}: Starting evaluation phase".format(self.name))
if self.ap.task_parameters.evaluate_only is None:
# TODO verbosity was mistakenly removed from task_parameters on release 0.11.0, need to bring it back
# if self.ap.is_a_highest_level_agent or self.ap.task_parameters.verbosity == "high":
if self.ap.is_a_highest_level_agent:
screen.log_title("{}: Starting evaluation phase".format(self.name))

elif ending_evaluation:
# we write to the next episode, because it could be that the current episode was already written
Expand All @@ -439,11 +439,12 @@ def reset_evaluation_state(self, val: RunPhase) -> None:
"Success Rate",
success_rate)

# TODO verbosity was mistakenly removed from task_parameters on release 0.11.0, need to bring it back
# if self.ap.is_a_highest_level_agent or self.ap.task_parameters.verbosity == "high":
if self.ap.is_a_highest_level_agent:
screen.log_title("{}: Finished evaluation phase. Success rate = {}, Avg Total Reward = {}"
.format(self.name, np.round(success_rate, 2), np.round(evaluation_reward, 2)))
if self.ap.task_parameters.evaluate_only is None:
# TODO verbosity was mistakenly removed from task_parameters on release 0.11.0, need to bring it back
# if self.ap.is_a_highest_level_agent or self.ap.task_parameters.verbosity == "high":
if self.ap.is_a_highest_level_agent:
screen.log_title("{}: Finished evaluation phase. Success rate = {}, Avg Total Reward = {}"
.format(self.name, np.round(success_rate, 2), np.round(evaluation_reward, 2)))

def call_memory(self, func, args=()):
"""
Expand Down Expand Up @@ -568,7 +569,7 @@ def handle_episode_ended(self) -> None:
for transition in self.current_episode_buffer.transitions:
self.discounted_return.add_sample(transition.n_step_discounted_rewards)

if self.phase != RunPhase.TEST or self.ap.task_parameters.evaluate_only:
if self.phase != RunPhase.TEST or self.ap.task_parameters.evaluate_only is not None:
self.current_episode += 1

if self.phase != RunPhase.TEST:
Expand Down Expand Up @@ -828,7 +829,7 @@ def act(self, action: Union[None, ActionType]=None) -> ActionInfo:
return None

# count steps (only when training or if we are in the evaluation worker)
if self.phase != RunPhase.TEST or self.ap.task_parameters.evaluate_only:
if self.phase != RunPhase.TEST or self.ap.task_parameters.evaluate_only is not None:
self.total_steps_counter += 1
self.current_episode_steps_counter += 1

Expand Down
18 changes: 13 additions & 5 deletions rl_coach/agents/clipped_ppo_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
#

import copy
import math
from collections import OrderedDict
from random import shuffle
from typing import Union
Expand Down Expand Up @@ -156,8 +157,17 @@ def set_session(self, sess):
def fill_advantages(self, batch):
network_keys = self.ap.network_wrappers['main'].input_embedders_parameters.keys()

current_state_values = self.networks['main'].online_network.predict(batch.states(network_keys))[0]
current_state_values = current_state_values.squeeze()
state_values = []
for i in range(int(batch.size / self.ap.network_wrappers['main'].batch_size) + 1):
start = i * self.ap.network_wrappers['main'].batch_size
end = (i + 1) * self.ap.network_wrappers['main'].batch_size
if start == batch.size:
break

state_values.append(self.networks['main'].online_network.predict(
{k: v[start:end] for k, v in batch.states(network_keys).items()})[0])

current_state_values = np.concatenate(state_values)
self.state_values.add_sample(current_state_values)

# calculate advantages
Expand Down Expand Up @@ -213,9 +223,7 @@ def train_network(self, batch, epochs):
self.networks['main'].online_network.output_heads[1].likelihood_ratio,
self.networks['main'].online_network.output_heads[1].clipped_likelihood_ratio]

# TODO-fixme if batch.size / self.ap.network_wrappers['main'].batch_size is not an integer, we do not train on
# some of the data
for i in range(int(batch.size / self.ap.network_wrappers['main'].batch_size)):
for i in range(math.ceil(batch.size / self.ap.network_wrappers['main'].batch_size)):
start = i * self.ap.network_wrappers['main'].batch_size
end = (i + 1) * self.ap.network_wrappers['main'].batch_size

Expand Down
Loading

0 comments on commit 0896f43

Please sign in to comment.