Skip to content

Latest commit

 

History

History
57 lines (38 loc) · 4.7 KB

README.md

File metadata and controls

57 lines (38 loc) · 4.7 KB

Using the Laplacian in Reinforcement Learning


This repository is based on the following two papers:

where an attempt has been made to explore the results. Although the 2017 paper states that the proto value functions used are the eigenvectors of the Laplacian corresponding to the largest eigenvalues, the 2018 paper contradicts this by stating:

"smallest eigenvectors of the Laplacian are well-known to provide succinct representation of the geometry of a weighted graph."

Basic Concept

By using the Laplacian framework, we are interested in learning an option. This is defined as a sequence of actions to satisfy some purpose or task. An option is characterised by three components, that is, a set of starting states where the option is available, a policy used to obtain an action at every timestep during an option, and a set of terminating states. Ideally, an option should be made available at any state but the challenge is how to define the policy attached to the option and defining the set of terminal states.

The 2017 paper explains the use of eigenvectors of the Laplacian matrix to obtain some eigenpurpose which characterises an intrinsic reward signal and by defining a terminal state to be a state that has a negative Q value no matter the action, they discover eigenoptions. The policy for these options are computed by optimising the Value function using the intrinsic reward signals.

Laplace's equation for some function f is a second-order partial differential equation that satisfies the sum of all the second-order partial derivatives of f is equal to zero. By solving this equation, we obtain functions that describe the topology of our domain space. If the domain space is discretised, we instead consider the Laplacian matrix and obtain vectors in place of functions. As the 2017 paper considers the gridworld, we first obtain solutions through the Laplacian matrix and not by solving Laplace's equation for a function.

It is desirable to obtain solutions that are independent so once we have constructed the Laplacian matrix \ \mathbf{L}, we perform the eigenvalue decomposition i.e. \ \mathbf{L} = \mathbf{V\Lambda V}^{-1} where the column vectors of \mathbf{V} satisfy \mathbf{v}_i^\text{T}\mathbf{v}j=\delta{ij} and \delta_{ij} is the Kronecker delta function. It is these vectors that when used as intrinsic reward functions, yields interesting policies and are referred to as eigenoptions.

Results

The implementation in this repository takes the second paper to have the more correct definition as the geometries are more stable with the general exception for the eigenvector with 0 as the corresponding eigenvalue.

In the four rooms environment as first shown in the 2017 paper, the first PVF is the option to either go to the closest corner of the room or go to one of the doorways i.e. a bottle-neck state.

As the eigenvectors can be interpreted as both positive or negative, we can follow the complete opposite of a given option to get the opposite result e.g. go to the far edge of either room or go to the doorway.

For the hard maze gridworld, the first few options describe how to get to the dead-ends.

Interestingly, the three rooms gridworld shares the same property as the four rooms in the sense that the first PVF describes how to get to the corner of a room or a doorway.

Future Works

Implement the 2018 paper for a continuous state space like cartpole.