Understanding the basic concepts behind Reinforcement Learning


Reinforcement Learning (RL) is a Machine Learning field which gained much more attention since 2015 after Google’s Deep Mind team demonstrated self taught DQN agents learning to walk, mastering Atari games (like breakout in the bellow) and beating pro-human players in the game Go.

RL is the science of self-teaching software agents to master a previously unknown world, and excel in that world. It is incredible how you can introduce an agent to a hostile and utterly unknown environment, where the agent only knows what available actions it can perform, but nothing else, and watch through trial and error to come up with policies that improve with each consecutive move, reaching a super-human performance.

Are we not all agents exploring our world since birth? Aren’t babies agents that are completely helpless at birth with no understanding of their surroundings, and use their brain to process memories and experiences to increase their success (reward) in the world and progressively master specific tasks in life?

This article will attempt to provide some basic understanding (leaving out the math), so you can start your journey towards the science of Reinforcement Learning and explore this fantastic continuously-evolving science field.

Biological Analogy to Reinforcement Learning

Reinforcement Learning is based on trial and error by evaluating different actions from those we already think that work well.

Over time, after thousands of failed attempts to produce beneficial variations to the agent’s state/action policy, we might discover one that is rewarding (literally) to the agent’s performance on the environment.

In biology, “this policy variation” is known as… “mutation”.

Life on earth is the living proof of what you can achieve with mutations when you have billions of years available to play with your genes.

Most mutations are harmful (even fatal), but now and then some mutation can turn to be beneficial and lead to the improvement/evolvement of a species.

It is thanks to such mutations; our DNA formed new branches that distinguished us from prior ancestor species.

Occasional “exploration” in our genome (DNA mutations) is the building block of species evolution

Similarly, in RL, the “evolution analogy” is the agent’s improvement over the choices of the actions to take for each situation (state).

Us, together with all current and past organisms are nothing but evolutionary mutational branches of the very first organism. Our universal common ancestor… a single ancient procaryotic cell.

Just 1.5% of our genes make us human!

We share 98.5% of our DNA with chimpanzees, 3/4 with dogs, 1/2 with fruit flies and 1/3 with daffodils, and yet we are so different in every aspect!

If in biology DNA mutation dictates alteration on gene expression, equivalently in RL, Q-Table modifications dictate alteration on the actions sequence for the environment’s conditions (states)

Reinforcement Learning — Basic Understanding

Scenario: Let’s assume you wake up at some foreign city, you exit your hotel, nobody speaks English, your cellphone just ran out of battery (so you cannot use maps), and you are starving. You need to get to some restaurant as soon as possible!

Environment: All city blocks are identical (grid world). The environment is what you are going to interact with. In a nutshell, you will start at some state of the environment, perform actions on the environment’s state, and the environment will respond with a new state and with some reward that measures how beneficial (or not) your action was for the previous state.

Actions: In this grid world the entire set of actions (Action Set) you can perform is moving [North, South, East, West]. Initially, you have no reason to favor any of action over another. You don’t know for your given state (the block you stand) which action (eg, going east or west) can get you closer to your goal (restaurant).

You live in an unknown world. You are only given the available actions. Its time to explore!

So you inevitably start exploring. You pick random actions (equiprobable policy) and wander around the city blocks until you end up at some restaurant.

Terminal State: So after visiting several blocks, you ended up at a restaurant. When you visit such a terminal state the episode terminates, and you can now sit back and evaluate how you can improve your strategy (policy) so that in some future episode you can get more effectively to your goal.

Episode: An episode is your journey every time you are reborn into that world until you reach a terminal state. You might need thousands of episodes to improve your strategy (policy) and come up with the best one.

Reward: While wandering in the grid world, you get a negative reward of -1 for every minute you spent walking. Normal roadblocks gave you a negative reward of -1 and crowded ones gave you a negative reward of -3, but reaching the restaurant you got a +10 reward.

Your cumulative reward sums up all positive and negative rewards together for each episode.

The negative reward in non-terminal states is necessary because if you only got a positive reward at the end, your agent wouldn’t care to get as fast as possible to the terminal state since it would get the same total reward no matter how long it wandered in the city. Without motivation, it could wonder in the city for hours.

So providing a negative reward for as long as we have not “reached the terminal state,” we motivate the agent to find the fastest way out.

Thus, increasing the Cumulative Reward should correspond to solving the Reinforcement Learning problem!

This logic, by which we assume that by increasing the total reward leads to the solution of RL, we call “Reward Hypothesis.

Exploring an unknown world

The State: Every time you move to a new block, you explore what the environment (the world) allows you to see. Most of the times that “view” of the world is quite limited. Usually, you cannot know what lies ten blocks ahead, so the state is the information that the environment provides you so you can estimate what your next action should be.

The state can be thought of as your observation of the environment at a given world location. But the state is not only information about the world, but also about yourself.

For example, if you are “learning to walk,” the state would be your location in the world, ground info etc. But the same state will also contain the directions about your joints, your velocity, orientation, and more.

So the State is your “senses” at a given time.

The state is the “view” that the environment allows you to perceive

Q-Table: To learn from these experiences, we need to keep track of some log about how rewarding (or not) each state/action pair appears.

The Q-Table can be visualized as a two-dimensional matrix, where rows reflect states, and columns reflect actions. The scores in this matrix represent how positively or negatively each action influences your goal for all possible states.

A simple Q-Table where two possible actions are graded for 4 possible states

So the next time you are at the same state, it might be useful to look at this log as an indicator of what action seems to work best for that state. That way, you will build your strategy (policy) over choosing actions depending on your current state.

This log contains the average of your expected total reward from all your previous experiences from that very state and on.

Should we trust our Q-Table from the very beginning? No! Initially, the scores at your Q-Table don’t reflect good choices! This is because you started with the equiprobable policy (exploring by taking random actions and log their performance to the Q-Table), and obviously, these reward scores should not be considered accurate just yet.

So initially we should explore more and keep populating the Q-Table. The more the episodes, the more we should trust our Q-Table and progressively lower the randomness of choices while trusting more our q-table (exploitation).

However, even after several thousand episodes, we should still take some random actions now and then to keep exploring and perhaps discover some action that might lean to be more beneficial for a given state.

By gradually lowering the exploration toward our states (but never zeroing it) after each episode, our q-table will adjust progressively letting us come up with an optimal policy (π*) that leads us to the restaurant at the shortest possible time.

This is the epsilon-greedy-policy. So the epsilon-greedy-policy says that by most part, we will follow our q-table proposal for selecting the highest-rated action for each state, but always allow for a small chance to pick a random action instead. After we select such a random choice (instead of the q-table proposal) we should then stick to the policy for all the remaining steps to check if the total reward improves at any degree by that change.

If this “one action” that we changed provided some higher overall reward by the end of the episode, then since we followed all the remaining steps using the policy, it may mean that this “one action” that we changed seems to improve our policy. So we log this change by slightly increasing this action’s score for our state at the q-table. Equivalently if this “one action” seemed to worsen our overall reward, we decrease the overall score for that state/action at our q-table.

The greedy-policy is always following the directions of the q-table blindly, while epsilon-greedy-policy follows mostly the q-table, but allows for some “random choice” now and then to see how it affects the agent’s score.

That means that the greedy-policy (following the q-table without exploring) is not beneficial as we will never explore any further (thus stop learning), while the epsilon-greedy-policy will follow most of the time our policy, but occasionally will pick a random action to check if it favors our agent to its goal.

A “greedy-policy” is like blind-walking — You need “epsilon-greedy policy” instead to allow for some exploration!
Epsilon-Greedy-Policy does not blind-lead the agent (like Greedy-Policy does).

Epsilon-Greedy-Policy allows for random discoveries over the original policy, introducing new strategies that increase the cumulative reward and eventually lead to the optimal policy (π*).

State+Action → Reward+next State (SARS)

As we have seen so far, the projection that the environment provides us, (which includes our mechanics — like our joints or velocity) is the state that comes to our senses.

For each state, we said there is a set of available actions.

By performing some action at a given state, we progress to a Next State, while at the same time we are given some positive or negative Reward, so we can evaluate how beneficial or not our action was at the previous state.

This is the “loop” in which the agent progresses in the environment until the reaching of the terminal state.

  1. S1+A1 →R1+S2
  2. S2+A2 →R2+S3
  3. S3+A3 →R3+S4
  4. … all the way to the St (terminal state)

The sum of all rewards (R1+R2+R3…+Rt) on each episode is the Cumulative Reward of that episode, which we want it as high as possible.

Based on the current State the Agent performs “Action” — The environment returns “Next Sate” &“Reward”

State/Action/Reward: So if we learn to walk, and our left foot is at front (that’s our state) and we act by moving our right foot at front as well (our action), then we end up falling down (our new state), and we get a negative reward in order to understand that our action of moving both foots front, affects us negatively on our effort to learn how to walk.

Thus this experience for that given state (where the left foot is at the front) even if it was negative, provided some valuable (not to do) information.

So, let’s write down that bad score at our Q-Table for this state/action pair!

Episodic or Continuous?

So far what we have seen is an episodic task. Such tasks have a well-defined ending state (in our example the ending state was the visit to the restaurant — equivalently to the exit from the maze).

Termination States (ending states) might not always mean something good (like reaching our goal). In a self-driving car analogy, we could terminate our simulation by getting in our destination (positive final state reward), or by crashing (negative final state reward).

Some tasks, however, might not have an ending point, but continue forever. Those are known as continuous tasks.

Discount Rate γ

The discount rate gamma (γ), is something you will always apply to continuous tasks, but not necessarily on episodic tasks.

As we know, in our q-table, we log the expected total reward for all actions of every state.

That means that for most episodic tasks it is possible just to average the total rewards from this state/action, and log it to our q-table. But to log the total reward from that state on, all the way to the end, this automatically implies that we have reached some terminal state St, otherwise there would be no defined “total reward”.

But in continuous tasks (tasks that never end), we cannot keep such log if the future is infinite, so there is no “total reward”. Now the discount rate γ is a decaying factor, by which we take into account less and less future expected rewards until they vanish. We use that to our benefit to “see” up to some point in the future and be able to deal with continuous tasks, rather than looking at an infinite future, and thus infinitely freeze our agent from the very beginning.

It is possible to use discount rate γ in episodic tasks, but many times we can just do without it, especially for simple tasks that don’t involve too much of coinciding states.

Markov Decision Process (MDP)

The States, Actions, Rewards, their mechanics (known as One-Step Dynamics), together with the discount rate (γ) together define a Markov Decision Process (MDP).

The agent is provided only with the set of available actions and the discount rate γ. All the rest is up to the agent to discover via exploration. We can assume the agent tries to estimate the dynamics by populating its q-table.

A Finite MDP is the probabilistic (stochastic) equivalent of Finite State Automata (or Finite State Machines) — in case you come from Computer Science field.

MDP Diagram of a Recycling Robot (*Source: Udacity)

In the example above we see an MDP Diagram of a Recycling Robot. We are given two possible battery states here [high, low]. We are also given an action set of [search, wait, recharge]. In this “stochastic” model, we see also by what chance an action can be favored over another, and with what Reward.

For example, by the above MDP diagram, at high state, we see that searching has 70% (.7) chance to keep the battery at the same high state and provide a reward of 4, while it has a 30% (.3) chance to change the robot’s battery state to low with a reward of 4.

Deep Reinforcement Learning (DRL)

Is that all? Definitely not! We have only touched some of the fundamental parts of reinforcement learning (intentionally avoiding mathematics). There are plenty more to see (including math) before we can even start talking about Deep Reinforcement Learning.

However, many should wonder where do neural networks fit in RL. I mean we can populate a q-table with state/action log and use it to choose actions depending on our existing state! So where do the Neural Networks fit?

Well, usually the tasks are way more complex. This will either mean a q-table that is unrealistically large, or that our problem cannot be expressed with a classic q-table at all!

Imagine you play Tennis. The ball cannot come just on your right or your left, so it cannot be represented by a left/right state! It can come anywhere between your right and left, and top to bottom. This means that we deal with real numbers that can be any floating-point value to represent each state, turning this to an “infinite valued q-table” (which simply cannot be defined). So in “continuous spaces” things can fall apart!

One fix could be to discretize our continuous spaced world and create imaginary squares, in order to turn this to a grid-world problem that can be dealt with a q-table.

However this is not feasible always, and as we mentioned before, even if we perform discretization, this might produce unrealistically large q-tables which need an equally unrealistically large number of episodes in order to effectively fill-up the log entries for each action per each state.

So here we could approach such a problem using function-approximation that can correspond states to actions.

Deep Neural Networks are defined as Universal Function Approximators.

To see the big picture (if you come from a Deep Learning background) — with Neural Networks (NN) — in Image Classification problems we turn pixels to labels.

Well, in the case of Reinforcement Learning, the idea is to turn states to actions.

In image classification problems... 
Pixels -> [NN] -> Probabilities over Labels

In reinforcement learning... 
Environment's State -> [NN] -> Probabilities over Agent's Actions

So the main idea is to create something that will approximate a q-table over infinitely real value sets of states, rather than a discrete set of states.

In the Neural Network, we have: “Input=State” and “Output=Action”

DRL in a nutshell…

In Deep Reinforcement Learning, you represent the Q-Table with a neural network. You try to keep your estimates about how actions affect the world over your state, by altering the weights of the neural network (gradient descent).

Thus through function approximation, we try to predict the probability of the best-fitted action for a given state and thus simulate a q-table.

The big difference between classic offline classification training, where the training dataset is provided to you, is that with this online training it is YOU the one that builds the dataset with every step of your exploration, as it produces results to be evaluated by the dataset. Such a transitioning creation of a dataset can cause obvious issues to a neural network and can be dealt with various techniques.


Please enter your comment!
Please enter your name here

4 + seventeen =

This site uses Akismet to reduce spam. Learn how your comment data is processed.