cartpole_rl

This project tackles the cartpole problem from OpenAi Gym using reinforcement learning techniques.

Dependencies

All you need (for now):

Python 2/3
Numpy
OpenAI Gym
Keras

Description

After going through the first six RL lectures by David Silver and first ten chapters of "Reinforcement Learning: An Introduction (Sutton and Barto)", I implemented the following three methods:

SARSA and Q-learning algorithms with eligibility traces (adjustable lambda) for the tabular.
SARSA and Q-learning algorithms with linear value function approximation (LVFA). (No Eligibility traces)
Q-Learning algorithm with a neural network as the action-value function approximator (DQN). (Experience Replay)

Details of Each Method

Tabular Method: Since the cartpole problem is composed of continuous state-space features, for this method, I discretize it using the create_bins method. The bins have been arbitrarily chosen; the only thing I wanted to maintain was to keep them centered around zero.
LVFA Method: I create a matrix of size (num_of_actions, size_of_state_signal) initialized randomly. When you take the dot product of this matrix with the state signal, we get the value for each action. Updating is done with stochastic gradient descent. As per the theory, the learning rate is decayed up to a minimum value. In the coming days, I may try to implement the TD(lambda) version of each algorithm.
DQN Method: Here, I create a small 2-layer neural network, using keras, with an Adam optimizer and mean-squared error loss function. I maintain a 100 timesteps of the latest experience and train the agent every 2 timesteps on 20 randomly sampled timesteps. Of course, these values can (and should) be played with. Otherwise, this follows the Q-learning algorithm for action-selection.

Playing with Hyperparameters (Empirically)

In all implemented methods, I follow the rules set by OpenAI, i.e. "Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.". Therefore, when an agent achieves a mean score of 195.0 over 100 consecutive trials, I break from the training loop and test the agent using the final_agent_test() function. However, in the testing phase, I require that the agent survives 500 timesteps without failing 10 episodes.

After playing with all the hyperparameters, the following table shows the best hyperparameters that meet the 500-timestep goal.

Algorithm	Method	Alpha	Lambda	Number of Episodes before success
SARSA	Tabular	0.45 Constant	0.0	420
Q-learning	Tabular	0.46 Constant	0.0	450
SARSA	Tabular	0.46 Constant	0.37	360
Q-learning	Tabular	0.5-->0.1 (0.9 decay)	0.5	280
SARSA	LVFA	0.2-->0.1 (0.9 decay)	N/A	890
Q-learning	LVFA	0.5-->0.4 (0.99 decay)	N/A	2010
Q-learning	Neural Network	0.001 (Adam)	N/A	320

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
cartpole_LVFA_Qlearn.py		cartpole_LVFA_Qlearn.py
cartpole_LVFA_SARSA.py		cartpole_LVFA_SARSA.py
cartpole_approx.py		cartpole_approx.py
cartpole_general.py		cartpole_general.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cartpole_rl

Dependencies

Description

Details of Each Method

Playing with Hyperparameters (Empirically)

About

Releases

Packages

Languages

roggirg/cartpole_rl

Folders and files

Latest commit

History

Repository files navigation

cartpole_rl

Dependencies

Description

Details of Each Method

Playing with Hyperparameters (Empirically)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages