A glance at Reinforcement Learning theory

Basics:

7 min readOct 11, 2021

In this section, I explain some of the fundamentals of RL theory. But if you are already familiar with RL, just skip this section.

Reinforcement learning, is a subfield of AI. That consists of training an agent to perform actions with high reward in a given environment.

The process could be depicted as following:

You might noticed the time dependency of the reinforcement learning process, S_{t+1} depends on A_{t}, so performing different actions at time t would yield to different observations. That means in RL we deal with censored observations.

Also one crucial assumption in RL, is the markov property. It’s basically one step dependency in time. And Since we are dealing with different processes at a time, a fancy mathematical object to work with is the Markov Decision Process MDP.

Where the agent is typically represented by:

A policy π, it is basically a probability distribution over action given states.

We often choose deterministic policies, hence:

V, Q the value, and state-value functions:

where :

It represents the discounted reward from time t, choosing gamma small will lead to a short sighted learning, and vice versa.

A model that perceives the environment.

So training an agent is formally finding the best policy π* that maximizes the sum of rewards.

If your are confused about finding the best policy, while we can start from different initial states…your are doing good! 😉. This is the reason behind the expectation that hangs out everywhere leading to value and state-value functions definition.

How can we find the optimal policy?

Bellman equations:

Those formulations of the value and state-value function plus the fixed-point theorem lead on A finite MDP to the existence of at least one optimal policy, and a unique optimal value, and state-value functions.

Bellman optimality equations :

So we have two choices to get an optimal policy:

either using the value function:

or the state-value function:

For this article we will adopt the second one, since it’s model free in the sense that we don’t need to know the transition kernel but just need to have a good estimate of Q*.

Estimating the Q function with a neural network:

The deep Q learning approach as it’s name says, it estimates the optimal state-value function through a deep neural network. It is basically optimizing the following Loss :

L = E ((Q(s,a,theta) -Q*(s,a))²) over(s,a) drawn from (S,A)

But here again our target value Q* is unknown. Several approaches have been tested over time, we may instantly want to estimate it using classic approaches Monte Carlo….. but they tend to be less performant compared to the target-main networks approach! which simply estimates the Q* using an independent network.

Now our target value is fixed during training, changed only after some predefined steps, which makes the learning process more stable.

The dueling networks :

This architecture came as an improvement of the classic DQN. The main difference is that the dueling network separates the learning of Q into the learning of an advanatage function A and the value function V, where :

To understand the intuition behind this separation, Let’s take a look at the intuition behind each function in the last equation :

V: the value function estimates how good it is for the agent to be in state s.
Q: the state-value function estimates how good it is to perform a given action in a given state.

So the main intuition behind this separation is to allow the network to learn which state is valuable, without having to learn the effect of each action for each state.

From a practical point of vue, two fully-connected layers would be used to calculate an estimate of the state-value function V(s,theta, beta) , and an estimate A(s,a,theta, alpha) for the advantage function. while beta and alpha are the weights of the two streams of fully-connected layers. After that an aggreation layers would combines the two estimates to compute the Q function.

Nota Bene: The aggregation layer is not a simple sum of V value to each element of the vector A, it’s more complicated than it seems, and that’s because of the non uniqueness of the tuple (A,V) that form Q. (more details could be found in the original paper).

The double deep Q network:

The max operator in standard Q-learning and DQN, uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation. This is the idea behind Double Q-learning (van Hasselt, 2010).

So the estimated target Q function would look like :

Where θ- is the target network parameters.

Replay Memory :

It ‘s a training techniques usually employed when training a network for Q learning. It consists of storing tuples of experiences e_{t} =(s_{t}, a_{t}, r_{t+1}, s_{t+1}) in a replay memory list, practically speaking there are couple of ways to store these experiences, one of the most used is queue list, it has the awesome property of poping automatically the oldest experiences and appending the newest ones in order to keep a fixed size all along the training.

Once we have the experiences stored in the replay memory, we sample uniformly at random training mini-batches from it. And we keep updating it little by little when we get new experiences.

Another approach uses preoritized sampling rather than random sampling, the intuition behind as cited in the original paper:

… is that an RL agent can learn more effectively from some transitions than from others. Transitions may be more or less surprising, redundant, or task-relevant. Some transitions may not be immediately useful to the agent, but might become so when the agent competence increases…

The main idea of preoritized buffer replay is adding a probabilty p of sampling to each of the experiences in the memory. In a way that transitions that lead to big errors get affected to high probability. But linking the error directly to the probability has some limitations (details in the original paper). instead authors propose a stochastic sampling method that interpolates between pure greedy prioritization and uniform random sampling.

Formally :

where :

the α exponent determines how much prioritization is used.
the epsilon is added to make sure transitions with zero error got visited.
the delta is the temporal difference error (aka: ….)

But optimizing our loss with non uniformly sampled data will induce some bias, to tackle this we may use importance sampling when updating the weights of the network by simply multiplying the gradient by a coefficient omega. In the original paper they use a weighted version of Importance sampling, as following: