Unsupervised skill learning methods are a form of unsupervised pre-training for reinforcement learning (RL) that has the potential to improve the sample efficiency of solving downstream tasks. Prior work has proposed several methods for unsupervised skill discovery based on mutual information (MI) objectives, with different methods varying in how this mutual information is estimated and opti- mized. This paper studies how different design decisions in skill learning algorithms affect the sample efficiency of solving downstream tasks. Our key findings are that the sample efficiency of downstream adaptation under off-policy backbones is better than their on-policy counterparts. In contrast, on-policy backbones result in better state coverage, moreover, regularizing the discriminator gives better downstream results, and careful choice of the mutual information lower bound and the discriminator architecture yields significant improvements in downstream returns, also, we show empirically that the learned representations during the pre-training step correspond to the controllable aspects of the environment.
git clone https://github.com/FaisalAhmed0/SLUSD
cd SLUSD
conda create -n slusd python=3.8
conda activate slusd
git clone https://github.com/FaisalAhmed0/stable-baselines3.git
cd stable-baselines3
pip install -e .
git clone https://github.com/facebookresearch/mbrl-lib.git
cd ../mbrl-lib
pip install -e .
Make sure that mujoco is installed by following the instruction in https://github.com/openai/mujoco-py
cd ../SLUSD
pip install -r requirements.txt
pip3 install -e .
python src/finetune.py --run_all True
When running with run_all is True, all random seeds will run on parallel.
Arg | Description | Supported values | Default Value |
---|---|---|---|
env | Environment name | All OpenAI gym environment with continuous actions and state vectros | "MountainCarContinuous-v0" |
alg | Deep RL algorithm | "sac" for soft-Actor critic, "ppo" for Proximal Policy Optimization | "ppo" |
skills | Number of skills to learn | Positive integers | 6 |
presteps | Number of pretraining steps | Positive integers | 1000000 |
lb | Mutual Information lower bound | "ba" for |
ba |
pm | Discriminator parameterization | "MLP" for a feed forward neural network, "Seprabale" for the seperable architecture, and "Concat" for the concatenation architecture and "linear" for the linear parametrization | MLP |
python src/experiments/scalability_exper.py --run_all True
python src/experiments/regularization_exper.py --run_all True
tensorboard --logdir ./logs_finetune
python record.py --env <env_name> --stamp <timestamp> --skills <no. skills> --cls <pm> --lb <mi lower bound>
Where stamp is the timestamp for the experiment you can copy it from the experiment's folder name. If you are running this code on a server make sure xvfb is installed.
sudo apt-get install xvfb
And run your the recording script.
xvfb-run -a python record.py --env <env_name> --stamp <timestamp> --skills <no. skills> --cls <pm> --lb <mi lower bound>