Reinforcement Learning: Q-Learning with Open AI Taxi
I wanted to start educating myself about reinforcement learning and its algorithms. Thus. I decided to start with the simplest (and most famous) algorithm, Q-Learning.
Code is from this video and this article¶
In [2]:
import numpy as np
import gym
import random
The Taxi Problem¶
There are four designated locations in the grid world indicated by R(ed), B(lue), G(reen), and Y(ellow). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drive to the passenger's location, pick up the passenger, drive to the passenger's destination (another one of the four specified locations), and then drop off the passenger. Once the passenger is dropped off, the episode ends. There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is the taxi), and 4 destination locations. Actions: There are 6 discrete deterministic actions:- 0: move south
- 1: move north
- 2: move east
- 3: move west
- 4: pickup passenger
- 5: dropoff passenger
- blue: passenger
- magenta: destination
- yellow: empty taxi
- green: full taxi
- other letters: locations
In [3]:
env = gym.make("Taxi-v2")
env.render()
Initialize needed variables and creating the Q-table¶
In [4]:
print("Number of actions: %d" % env.action_space.n)
print("Number of states: %d" % env.observation_space.n)
In [5]:
action_size = env.action_space.n
state_size = env.observation_space.n
In [6]:
qtable = np.zeros((state_size, action_size))
print(qtable)
In [7]:
total_episodes = 50000
total_test_episodes = 5
max_steps = 99
learning_rate = 0.7
discount_rate = 0.9 #Also known as gamma
epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.01
What is discount factor?¶
The discount factor affects how much weight it gives to future rewards in the value function. A discount factor, γ=0, will result in state/action values representing the immediate reward, while a higher discount factor, γ=0.9, will result in the values representing the cumulative discounted future reward an agent expects to receive (behaving under a given policy)Q-Learning equation¶
\begin{equation*} Q^{new}(s_t, a_t) = (1 - \alpha) \cdot Q(s_t, a_t) + \alpha \cdot (r_t + \gamma \cdot max Q (s_{t+1} ,a)) \end{equation*}Training the Q-table¶
In [8]:
for episode in range(total_episodes):
#Reset environment every time a new episode begins
state = env.reset()
step = 0
done = False
for step in range(max_steps):
#Choose an action in current state
#Generate random number
exp_exp_tradeoff = random.uniform(0,1)
#If random number > epsilon --> exploitation (select the action with the biggest Q value for this state)
if exp_exp_tradeoff > epsilon:
action = np.argmax(qtable[state, :])
#Else, do a random choice --> exploration
else:
action = env.action_space.sample()
#Do the action (a) and observe the outcome state (s') and reward (R)
new_state, reward, done, info = env.step(action)
#Update q value for the state based on the formula
#Q(s,a) = Q(s,a) + lr[R(s,a) + gamma * max Q(s',a') - Q(s,a)]
qtable[state, action] = qtable[state, action] + learning_rate * (reward + discount_rate * np.max(qtable[new_state, :]) - qtable[state, action])
state = new_state
if done is True:
break
episode += 1
#Reduce epsilon (because we want to reduce the number of exploration as time passes)
epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
Testing the performance of the Q-table¶
In [10]:
env.reset()
rewards = []
for episode in range(total_test_episodes):
state = env.reset()
step = 0
done = False
total_rewards = 0
print("******************************************************************")
print("EPISODE ", episode)
for step in range(max_steps):
env.render()
action = np.argmax(qtable[state, :])
new_state, reward, done, info = env.step(action)
total_rewards += reward
if done is True:
env.render()
rewards.append(total_rewards)
print("Score: ", total_rewards)
break
state = new_state
env.close()
print("Mean score over time: " + str(sum(rewards) / total_test_episodes))