Skip to content

RLlib kestrel update #597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
.*
!/.gitignore
**.pyc
**/__pycache__
110 changes: 110 additions & 0 deletions languages/python/RLlib/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
## Use this tutorial

This tutorial provides examples on how to use [RLlib](https://docs.ray.io/en/master/rllib/) for reinforcement learning, with an emphasis on building customized environments for your own optimal control problems. The tutorial is developed assuming using NREL HPC system Kestrel, but it can be easily modified to be able to run on a local computer.

We suggest follow this tutorial with the following order:

1. Understand how to build a custom environment for your problem. Detailed guidelines are provided [here](custom_gym_env/README.md).

2. Train the RL agent/policy/controller by following [this guideline](train/README.md).

3. Test the trained RL agent as explained [here](test/README.md).

But before that, please follow the instructions below to set up a Python Conda environment.

## Create Anaconda environment

Follow the following steps to create an Anaconda environment for this experiment:

### 1st step: Log in on Kestrel (Can be skipped if work on local computer)

Login on Kestrel with:
```
ssh kestrel
```
if you have hostname configured, or
```
ssh <username>@kestrel.hpc.nrel.gov
```

### 2nd step: Set up Anaconda environment

To use `conda` on Kestrel (different from Eagle), the Anaconda module needs to be loaded.
```
module purge
module load anaconda3
```

We suggest creating a conda environment on your `\projects` rather than `\home` or `\scratch`.

***Example:***

Use the following script to create a conda environment:
```
mkdir -p /projects/$HPC_HANDLE/$USER/conda_envs
conda create --prefix=/projects/$HPC_HANDLE/$USER/conda_envs/rl_hpc python=3.10
```

Here, `$HPC_HANDLE` is the project handle and `$USER` is your HPC user name.

Activate the conda environment and install packages:

```
conda activate /projects/$HPC_HANDLE/$USER/conda_envs/rl_hpc

pip install -r requirements.txt
```

### 3rd step: Test OpenAI Gym API

After installation is complete, make sure everything is working correctly. You can test your installation by running a small example using one of the standard Gym environments (e.g. `CartPole-v1`).

Activate the enironment and start a Python session
```
module purge
module load anaconda3
conda activate /projects/$HPC_HANDLE/$USER/conda_envs/rl_hpc
python
```
Request an interactive session on Kestrel, and then, run the following:
```python
import gymnasium as gym

env = gym.make("CartPole-v1")
obs, info = env.reset()

done = False

while not done:
action = env.action_space.sample()
obs, rew, terminated, truncated, info = env.step(action)
done = (terminated or truncated)
print(action, obs, rew, done)
```
If everything works correctly, you will see an output similar to:
```
0 [-0.04506794 -0.22440939 -0.00831435 0.26149667] 1.0 False
1 [-0.04955613 -0.02916975 -0.00308441 -0.03379707] 1.0 False
0 [-0.05013952 -0.22424733 -0.00376036 0.2579111 ] 1.0 False
0 [-0.05462447 -0.4193154 0.00139787 0.54940559] 1.0 False
0 [-0.06301078 -0.61445696 0.01238598 0.84252861] 1.0 False
1 [-0.07529992 -0.41950623 0.02923655 0.55376634] 1.0 False
0 [-0.08369004 -0.61502627 0.04031188 0.85551538] 1.0 False
0 [-0.09599057 -0.8106737 0.05742218 1.16059658] 1.0 False
0 [-0.11220404 -1.00649474 0.08063412 1.47071687] 1.0 False
1 [-0.13233393 -0.81244634 0.11004845 1.20427076] 1.0 False
1 [-0.14858286 -0.61890536 0.13413387 0.94800442] 1.0 False
0 [-0.16096097 -0.8155534 0.15309396 1.27964413] 1.0 False
1 [-0.17727204 -0.62267747 0.17868684 1.03854806] 1.0 False
0 [-0.18972559 -0.81966549 0.1994578 1.38158021] 1.0 False
0 [-0.2061189 -1.0166379 0.22708941 1.72943365] 1.0 True
```

### 4th step: Test other libraries
The following libraries should also be imported without an error.

```
import ray
import torch
```

100 changes: 100 additions & 0 deletions languages/python/RLlib/custom_gym_env/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Create a customized Gym environment

This section demonstrate how one can create their own Gym environment, carefully tailor-made to one's needs. At NREL, this could be for an optimal control problem in grid operation, building energy management or traffic control.

## High-level overview

To facilitate the deep RL implementation and tests of new algorithms, OpenAI Gym has been the standard interface to connect RL agent/algorithms with the problems. Given such a standard interface, RL training and experiments can be done in a plug-and-play manner, as shown in the figure below. We can use any RL agent implementation (e.g., RLlib and Stable-Baseline3) with different RL algorithms (e.g., PPO, SAC, A3C, and DDPG) to learn a policy for different problems (e.g., cart-pole, lunar landing or the problem of your interest) via the standard interface.

<p align="center">
<img src="../tutorial_img/gym_and_agent.png" alt="Running Tensorboard" width="70%"><br>
<em>The interaction between Gym environment and RL agent.</em>
</p>

To allow using RL training framework such as RLlib in this tutorial to train an optimal policy for the problem of our interest, the customized environment should follow the Gym API standard.

## Gym environment API structure

After the latest release of 0.26.2, in the [OpenAI Gym repo](https://github.com/openai/gym), it is [announced](https://github.com/openai/gym?tab=readme-ov-file#important-notice) that all future maintenance and updates are moved to [Gymnasium](https://github.com/Farama-Foundation/Gymnasium). So this tutorial will follow the Gymnasium API guidelines, but we will still refer the environment as "gym" environment for simplicity.

This tutorial also focuses on __episodic__ environment, meaning the optimal control is implemented on a limited step control horizon. The episode ends either when a fixed number of steps are reached or terminal states are reached.

As the first step, import gymnasium
```
import gymnasium as gym
```

The cusmtom-made gym environment should follow the following structure with three core functions:

```
class CustomEnv(gym.Env):

def __init__(self):
# Initialized the environment.
# Called only once when instantiate this class.
...

def reset(self, seed=None, options={}):
# Reset the environment to the beginning of the control episode.
# Called once an episode is complete.
...
return obs, info

def step(self, action):
# Implement the control using the provided action.
# Called at each step/control interval within the episode.
...
return obs, reward, terminated, truncated, info
```

Next, we explain more details:

Typically, the environment should inherit the `gym.Env` class by definining `class CustomEnv(gym.Env)`. The three core gym functions are:

* `def __init__(self)`: Initializes the environment. More specifically, it generally involves the following three tasks:
- Defining necessary variables/hyperparameters.
- Defining the dimensionality of the observation and action space for the problem, which are given using the parameters `self.observation_space` and `self.action_space`, respectively. Depending on their nature, they can take discrete, continuous, or a combination of values. See [this link](https://gymnasium.farama.org/api/spaces/) for more details.
- [Optional] Initializing the external simulator if you need one.
* `def reset(self, seed=None, options={})`: RL training requires repeated trial-and-error. Once an episode ends, either terminated or truncated, the `reset` function allows the agent to reset the environment back to the initial state, start over and try again.
Specific things could include:
- Resetting the state of the environment
- Resetting other utilities, such as setting the step counter back to zero, i.e., `self.step_count = 0`.
- Resetting the simulator if you have one.

Argument `seed` is used to set the random seed and `options` can pass in some desired configuration for resetting, i.e., resetting to a specific inital state if otherwise generated randomly.

Outputs of the reset function are `obs` and `info`, `obs` is the observation that the RL agent uses as a policy input, while `info` can be a dictionary including some auxilary information (not necessary for learning, but could be used for other purposes such as debugging.)

* `def step(self, action)`: It defines the inner mechanics of the environment, and moves the simulation one step ahead using the provided `action`. Specific tasks inside this function can be categorized into the following four:
- Advance the system for one step through the state transition function $s_{t+1} = f(s_t, a_t)$ defined by the Markov decision process. If you have external simulator, take one step in the simulator as well. Obtain the new observation and remember to increase to the step counter if necessary: `self.step_count += 1`.
- Calculate the reward based on this step's control.
- Determine if current episode should end or not, either terminated or truncated. For example, if `self.step_count >= MAX_STEP_ALLOWED`, set `truncated=True`.
- [Optional] Prepare additional information and put them in the `info` dictionary.

Outputs of this function are:
- `obs`: Usually a Numpy array collecting the new observation after the control of this step.
- `reward`: A scalar reward reflecting the performance of this step's control.
- `terminated`: A Boolean reflecting if terminal states are reached. E.g., the goal area is reached.
- `truncated`: A Boolean reflecting if time limit is reached or other conditions that require stops the simulation.
- `info`: A python dictionary containing auxilary information. If none, use `info={}`.

## Example: create a customized environment

This tutorial provides an example of customized environment called "CarPass", letting an RL agent to learn how to control a moving car to the target way point as fast as possible while avoid colliding with a parked stationary car and be out of bound. The figure below shows an illustration.

<p align="center">
<img src="../tutorial_img/custom_env_illustration.png" alt="Running Tensorboard" width="40%"><br>
<em>Illustration of the car pass environment.</em>
</p>

See the [full implementation](custom_env.py) for details, and the comments and docstrings should provide some explanations.

One thing to notice is that there is a `render()` function in the customized environment, which is used to render the 2D image as shown above for visualization purpose. It is optional, so no need to implement `render()` if not necessary.

## More examples

Grid:
- Distribution System critical load restoration environment, see [this repo](https://github.com/NREL/rlc4clr/tree/main/rlc4clr/clr_envs/envs).

Building Control:
- Five-zone building HVAC Control, see [this file](https://github.com/NREL/learning-building-control/blob/main/lbc/building_env.py). This example still uses OpenAI Gym API instead of Gymnasium API though.
Loading