Reinforcement Learning Using Rainbow with PyTorch

What Is Reinforcement Learning?

Reinforcement Leaning (RL) is the ability to learn a strategy in a given environment without any supervision. It interacts with an environment, can take actions and see the result where the goal is to gain maximum cumulative score.

Breakout and Pacman Results

I used two Atari games: "Breakout" and "MsPacman" where the computer learned how to play without any prior knowledge and got excellent results:

In the Breakout game, it learns to tunnel the ball at the top to gain maximum score.

In the Pacman game, the goal was to finish the game and not to gain maximum score by eating ghosts.

A (short) Theory

There are many methods for reinforecment learning. The focus here is on the Rainbow method which combines a few algorithms to improve the results and performance.

The problem is defined by an agent interacting with an environment. An environment can be a game, a simulation, a task a robot takes, etc. This example uses the OpenAI Gymnasium package that simulates Atari games.

The Agent observe the state (the pixels of the screen), takes an action (moving the game joystick), and get reward. It repeats this process until the game ends. The goal is to define the strategy the agent should take to gain maximum cumulative rewards (the score of the game).

A Transition is defined as the combination of: $$ (s_t, a_t) \rightarrow (r_{t+1}, s_{t+1}) $$

Which means observe the state in time t, take the action a. This leads to a reward and state at time t+1.

It is assumed that the problem is framed as Markov Decision Process which means that future only depends on the current state and not on the history. The current state itself is based on the history.

The Reward function R predicts the next reward triggered by one action:

$$ R(s,a) = \mathbb{E}[R_{t+1}|S_t = s, A_t = a] = \Sigma_{r}r\Sigma_{s_{t+1}} P(s_{t+1},r|s_t,a) $$

We define the Return function as the discounted sum of reward from time t:

$$ G_t = R_{t+1} + \gamma R_{t+2} + ... = \Sigma_{k=t}^\infty\gamma^{k-t+1}R_{k} $$

The factor gamma is used to penalize future rewards because they are uncertain, they don't provide immediate benefit, and prevent infinite loops.

A Policy $\pi$ is the algoritm to select the best action given a state. The best action maximize the expected value of the future rewards

The Value function is the expected Return for state s:

$$ V_\pi(s) = \mathbb{E}_\pi[G_t|S_t = s] $$

The Q-Function is the expected Return for the state s and action a pair:

$$ Q_\pi(s,a) = \mathbb{E}_\pi[G_t|S_t = s, A_t = a] $$

In a complex environment, we use Deep Neural Network (DNN) to approximate the Q-Function.

We define the Advantage as the differenece between Q-Function and the Value function and it represents the advantage for the given state:

$$ A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s) $$

The optimal policy is the action that gets the maximal return = $ argmax_\pi [Q_\pi(s,a)] $

The Bellman Equations represents the value function as the immediate reward + future values as follows:

$$ \begin{aligned} V(s) &= \mathbb{E}[G_t \vert S_t = s] \\ &= \mathbb{E} [R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \vert S_t = s] \\ &= \mathbb{E} [R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \dots) \vert S_t = s] \\ &= \mathbb{E} [R_{t+1} + \gamma G_{t+1} \vert S_t = s] \\ &= \mathbb{E} [R_{t+1} + \gamma V(S_{t+1}) \vert S_t = s] \end{aligned} $$

Similarly for the Q-Function: $$ \begin{aligned} Q(s, a) &= \mathbb{E} [R_{t+1} + \gamma V(S_{t+1}) \mid S_t = s, A_t = a] \\ &= \mathbb{E} [R_{t+1} + \gamma \mathbb{E}_{a\sim\pi} Q(S_{t+1}, a) \mid S_t = s, A_t = a] \end{aligned} $$

DQN Method

A DQN uses deep learning to estimate the Q-Function and take the best action. It uses, Epsilon-Greedy to slowly move between exploration (taking random actions) and explotation (taking actions by the policy). A Replay Memory remembers previous experiences to breaks temporal correlation and biases. It mixes past and recent transitions when training the network. The Q-Function is predicted using an online policy trained with gradient decent. The online policy is copied periodically to a Target Network which is used for future rewards computation. This stabilizes training

Rainbow

The Rainbow scheme combines multiple improvements for DQN:

Code

Imports & Setup

Parameters

Environment

Replay Buffer

Model

Agent

Run

Play by policy

Play Ramdom