RL_Maze

Applying physical principles for reinforcement learning task of maze passage finding

Inspired by Frinston's active inference model, which aims to minimize the free energy function by Markov decision process. In this repo the same function is minimized using mechanical approach instead of Bayesian inference.

Task

Initially the reinforcement agent (the blue ball) is placed at the upper left corner of the labirint. We want the agent to learn the shortest path to the destination, which is a red square at bottom right corner.

At every step the agent decides on 4 acts: go north, south, east or west. The environment provides the agent's position and whether the task is completed after each step as a respond.

Environment

We use gym to emulate the maze environment - this framework provides GUI for training visualization, maze samples generation and a q-learning baseline. The environment is completely independent on decision making algorithm.

Theory

What is the free energy F in physics?

In statistical physics and thermodynamics complicated systems are often described by a variety of thermodynamical potentials, such as entropy, enthalpy or free energy function. The important feature about this potentials is that all of them reach an extremum at the equilibrium state of the system. In out case the equilibrium state is defined as the agent reaching bottom right corner of the maze, because at this point the agent does not have to do anything and can stay there forever, which is an equilibrium by definition.

Statistically the free energy F can be expressed using the statistical sum Z:

$Z = \sum_j e^{E_j}$ ,

where E denotes the energy of each state. Then, free energy is:

,

where k is the Boltzmann's constant and T is temperature.

Hence the statistical description is acquired.

Thermodynamical or statistical description of a physical system is a powerfull tool, suitable to describe complicated many-particle systems. However, in a one-particle system such as ours it's much more convenient to use mechanical approach.

The mechanical equivallent of minimizing the free energy

It is showed in general course of physics, that the state with minimal free energy is fully equivallent to the state, which minimizes action functional J:

$J = \int_0^t L(\dot{q}, q, t) dt$

$J \rightarrow min \Leftrightarrow F \rightarrow min$

where $L(\dot{q}, q, t)$ is the Lagrange function of physical system, defined as the differenece between kinetic and potential energy. It depends on impulses $\dot{q}$ , coordinates q and time t.

$L = T - U = \frac{\dot{q}^2}{2m} - U(q)$

where U(q) denotes the potential - an important function, which will be explained further. m denotes the mass of the agent, which will be set to 1.

The conventional way to minimize any functional, including J, is to use the Euler–Lagrange equation, which leads to:

$\frac{dL}{dq} = \frac{d}{dt} \frac{dL}{d \dot{q}}$

Substituting, we finally derive the simple Newton equation:

$\ddot{q} = - \frac{dU}{dq}$

Let's provide some insight for the last equation. $\ddot{q}$ is the second derivative of the coordinate, or, taking into account m=1, the force. The force represents the action that the agent takes on each step - whether to go north, south, east or west.

Where the force should be directed? It should be directed into minimizing the potential U(q) at each point q. In contrast with the initial free energy F, the potential U(q) is a mechanical term - free energy at each state is a single value for the whole system, while potential is a function of coordinates, fixed for any system state.

What is the physical equivallent for this task? Let's imagine a slide at the waterpark. The potential U(q) is exactly the form of that slide. At each moment a human inside the slope moves towards the gradient descent of U(q), guided by the principle of least action, or principle of minimal free energy, which is equivallent. It might be shown that this trajectory is the fastest way to get to the final point.

What the problem of shortest path in the maze is transformed into in this notation?

We figured out that the shortest path for the agent to complete it's goal is to follow the potential U(q) descent. We basically want it to roll down the slope in a waterpark. The only problem is that the agent does not know the environment, and, hence, does not know the potential. We are gonna make the agent to learn it interacting with the environment. In order to learn U(q) efficiently we represent it as the sum of three different potentials:

$U(q) = U_{border} + U_{optimal} + U_{slope}$

$U_{border}$ is needed to avoid walls and borders of the maze. Each time the agent tries to move from one cell to another and fails, we interpret it as an infinite potential wall between this cells. It automatically makes the agent not to go there the next time.
$U_{optimal}$ represents the agent's approximation of potential.
$U_{slope}$ is set to 1 at each point where agent has already been at each iteration. This part saves us from random walking back and forth at early stages and speeds up the training.

The inference is simple: at each point the agent should go to the minimal potential in it's current poistion's neighbourhood. In order to increase exploration ability, sometimes the agent goes to random direction instead of known opimal. After the goal is reached, the potential is updated.

Results

Here is the example of how the learned potential (left part) is derived after 150 iterations on the maze, depicted on the right part:

Black spots mean that the potential is low, hence, the agent seeks to go that direction. Red lines denote $U_{border}$ . Check out the whitest spot on the picture - it's the closest incorrect turn for the agent. It accumulates every further mistake, and that is why the potential there is much higher.

Try to manually follow the gray path from left upper corner - it will lead you to the lower right corner.

Our approach called 'Mechanical' (blue line) is implemented and compared with q-learning baseline provided by ai-gym (orange line). We also compare both of the algorithms with random walking (green line).

The next graph is a learning curve averaged between 500 different random initializations on the same environment. It is shown that in that setup our approach converges much faster at first, but it takes approximately as many steps to find a more optimal solution for us as for q-learning baseline. Both algorithms are much better than random walking.

How to interpret the free energy from this graphs?

The initial idea is to use the free energy minimization principle, so it would be nice to check out how it changes during training. Let's revisit the free energy definition from thermodynamics:

F - free energy, E - inner energy, T - temperature, S - entropy.

It's easy to derive energy:

$E = - \Delta U = \int_{0}^{q_{final}} F dq$

The change of agent's inner energy is equal to drop of the potential and proportional to number of steps with the precision of technical constant.

Let's talk about the second part of the free energy definition. We interpret F here as in a mechanical system, but, however, temperature is also stitched inside the algorithm. Nonzero temperature of a system means that at any moment the agent may do something random, which is a popular method to increase RL-model generalization. In physics it means that while moving towards global optimum of free energy any thermodynamical particle may fluctuate in an intractable way. In this code the probability of random choice is set as explore_rate in both q-learning baseline and mechanical approach. Using probabilities is more of a quantum way to describe a physical system, and we may resolve this issue coming up with a way to connect probability of fluctuation with it's energy (temperature). It's also possible to derive entropy from the statistical sum of every state (see the beginning of Theory part).

However, lucky for us nonzero temperature plays a very small role during training. In the code explore_rate is expressed as max(MIN_EXPLORE_RATE, min(0.8, 1.0 - math.log10((epoch+1)/decay_factor))), which approaches zero at epoch=25 for 5x5 maze (it is possible that playing with explore rate may improve the result, but we decided to leave the same rool as in q-learning baseline to ensure honest competittion).

To conclude the answer to the latest question, free energy at each step after 25 epoch (approx. 2000 iterations) is proportional to the number of steps with very high precision and may be observed at the latest graph. For iterations before 2000 free energy is (number of steps) + (logarithmically decreasing uncertain function).

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
README.md		README.md
main.py		main.py
maze_lagranzh.py		maze_lagranzh.py
maze_q_learning.py		maze_q_learning.py
picture.jpg		picture.jpg
result_500.jpg		result_500.jpg
solver.py		solver.py
statistics.py		statistics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL_Maze

Applying physical principles for reinforcement learning task of maze passage finding

Task

Environment

Theory

Results

About

Releases

Packages

Languages

Andrey885/RL_Maze

Folders and files

Latest commit

History

Repository files navigation

RL_Maze

Applying physical principles for reinforcement learning task of maze passage finding

Task

Environment

Theory

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages