import numpy as np
import gym
import random

The Taxi Problem¶

There are four designated locations in the grid world indicated by R(ed), B(lue), G(reen), and Y(ellow). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drive to the passenger's location, pick up the passenger, drive to the passenger's destination (another one of the four specified locations), and then drop off the passenger. Once the passenger is dropped off, the episode ends. There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is the taxi), and 4 destination locations. Actions: There are 6 discrete deterministic actions:

0: move south
1: move north
2: move east
3: move west
4: pickup passenger
5: dropoff passenger

Rewards: There is a reward of -1 for each action and an additional reward of +20 for delievering the passenger. There is a reward of -10 for executing actions "pickup" and "dropoff" illegally. Rendering:

blue: passenger
magenta: destination
yellow: empty taxi
green: full taxi
other letters: locations

ENV_NAME = "Taxi-v2"
env = gym.make(ENV_NAME)
env.render()

+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

print("Number of actions: %d" % env.action_space.n)
print("Number of states: %d" % env.observation_space.n)

Number of actions: 6
Number of states: 500

action_size = env.action_space.n
state_size = env.observation_space.n

np.random.seed(123)
env.seed(123)

[123]

Keras-RL and gym's discrete environments¶

Keras-RL examples does not use gym's discrete environment as examples. Being the beginner that I am to both Keras-RL and gym, I had to find another source to refer to for discrete environments. Therefore, I found and referred to this example which used gym's Frozen Lake environment, which is a discrete environment in gym, as reference.

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Embedding, Reshape
from keras.optimizers import Adam

Using TensorFlow backend.

What does an Embedding layer do and what are the parameters?¶

Embedding(input_dimensions=500, output_dimensions=6, input_length)¶

In Deep Q-Learning, the input to the neural network are possible states of the environment and the output of the neural network is the action to be taken. The input_length for a discrete environment in OpenAi's gym (e.g Taxi, Frozen Lake) is 1 because the output from env.step(env.action_space.sample())[0] (e.g. the state it will be in), is a single number.

env.reset()
env.step(env.action_space.sample())[0]

351

In the Embedding layer, the input_dimensions refers to the number of states and output_dimensions refers to the vector space we are squishing it to. This means that we have 500 possible states and we want it to be represented by 6 values.

If you do not want to add any dense layers (meaning that you only want a single layer neural network, which is the embedding layer), you will have to set the output_dimensions of the Embedding layer to be the same as the action space of the environment. This means that output_dimensions must be 6 when you are using the Taxi environment because there can only be 6 actions, which are go up, go down, go left, go right, pickup passenger and drop passenger.

model_only_embedding = Sequential()
model_only_embedding.add(Embedding(500, 6, input_length=1))
model_only_embedding.add(Reshape((6,)))
print(model_only_embedding.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 1, 6)              3000      
_________________________________________________________________
reshape_1 (Reshape)          (None, 6)                 0         
=================================================================
Total params: 3,000
Trainable params: 3,000
Non-trainable params: 0
_________________________________________________________________
None

If you want to add Dense layers after the Embedding layer, you can choose your own output_dimensions for your Embedding layer (it does not have to follow the action space size), but the final Dense layer must have the same output size as your action space size.

model = Sequential()
model.add(Embedding(500, 10, input_length=1))
model.add(Reshape((10,)))
model.add(Dense(50, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(action_size, activation='linear'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 1, 10)             5000      
_________________________________________________________________
reshape_3 (Reshape)          (None, 10)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 50)                550       
_________________________________________________________________
dense_5 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_6 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_7 (Dense)              (None, 6)                 306       
=================================================================
Total params: 10,956
Trainable params: 10,956
Non-trainable params: 0
_________________________________________________________________
None

What does the Reshape layer do?¶

In the reshape layer, we take the output from the previous layer and reshape it to a rank 1 tensor (a one-dimensional array). In this notebook, (6,) means a one dimensional array with 6 values. For example, [1, 2, 3, 4, 5, 6]

Parameters when fitting the neural network¶

I tried to set the nb_steps and nb_max_episode_steps to be the same as total_episodes and max_steps in the previous blog post, Q-learning with OpenAI Taxi.

I will be training both the neural network with only the Embedding layer (dqn_only_embedding) and the neural network with a few Dense layers (dqn)¶

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

memory = SequentialMemory(limit=50000, window_length=1)
policy = EpsGreedyQPolicy()
dqn_only_embedding = DQNAgent(model=model, nb_actions=action_size, memory=memory, nb_steps_warmup=500, target_model_update=1e-2, policy=policy)
dqn_only_embedding.compile(Adam(lr=1e-3), metrics=['mae'])
dqn_only_embedding.fit(env, nb_steps=1000000, visualize=False, verbose=1, nb_max_episode_steps=99, log_interval=100000)

Training for 1000000 steps ...
Interval 1 (0 steps performed)
100000/100000 [==============================] - 317s 3ms/step - reward: -0.0663
5872 episodes - episode_reward: -1.128 [-387.000, 15.000] - loss: 0.590 - mean_absolute_error: 8.946 - mean_q: 7.323 - prob: 1.000

Interval 2 (100000 steps performed)
100000/100000 [==============================] - 306s 3ms/step - reward: 0.2085
7019 episodes - episode_reward: 2.971 [-79.000, 15.000] - loss: 0.003 - mean_absolute_error: 7.507 - mean_q: 12.961 - prob: 1.000

Interval 3 (200000 steps performed)
100000/100000 [==============================] - 339s 3ms/step - reward: 0.2120
7055 episodes - episode_reward: 3.007 [-135.000, 15.000] - loss: 0.003 - mean_absolute_error: 7.525 - mean_q: 12.990 - prob: 1.000

Interval 4 (300000 steps performed)
100000/100000 [==============================] - 342s 3ms/step - reward: 0.2238
7076 episodes - episode_reward: 3.159 [-41.000, 15.000] - loss: 0.002 - mean_absolute_error: 7.527 - mean_q: 13.004 - prob: 1.000

Interval 5 (400000 steps performed)
100000/100000 [==============================] - 343s 3ms/step - reward: 0.2200
7053 episodes - episode_reward: 3.122 [-59.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.539 - mean_q: 13.019 - prob: 1.000

Interval 6 (500000 steps performed)
100000/100000 [==============================] - 312s 3ms/step - reward: 0.2232
7087 episodes - episode_reward: 3.147 [-55.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.536 - mean_q: 13.014 - prob: 1.000

Interval 7 (600000 steps performed)
100000/100000 [==============================] - 301s 3ms/step - reward: 0.2137
7056 episodes - episode_reward: 3.030 [-98.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.537 - mean_q: 13.023 - prob: 1.000

Interval 8 (700000 steps performed)
100000/100000 [==============================] - 292s 3ms/step - reward: 0.2246
7078 episodes - episode_reward: 3.172 [-61.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.533 - mean_q: 13.014 - prob: 1.000

Interval 9 (800000 steps performed)
100000/100000 [==============================] - 360s 4ms/step - reward: 0.2141 0s - reward: 0
7053 episodes - episode_reward: 3.036 [-63.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.527 - mean_q: 13.013 - prob: 1.000

Interval 10 (900000 steps performed)
100000/100000 [==============================] - 323s 3ms/step - reward: 0.2296
done, took 3236.311 seconds

<keras.callbacks.History at 0x1c2bca40438>

dqn_only_embedding.test(env, nb_episodes=5, visualize=True, nb_max_episode_steps=99)

Testing for 5 episodes ...
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
| : : : :_|
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : :_: |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : :_: : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| :_: : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R:_| : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Episode 1: reward: 10.000, steps: 11
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
|_: : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
|_: : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
|_| : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Episode 2: reward: 9.000, steps: 12
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
|_: : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
|_: : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
|_| : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Episode 3: reward: 14.000, steps: 7
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
|_: : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| :_: : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : :_: : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: |_: :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | :_:G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Episode 4: reward: 8.000, steps: 13
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
|_: : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
|_: : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| :_: : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : :_: : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : :_: |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : |_: |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Episode 5: reward: 5.000, steps: 16

<keras.callbacks.History at 0x1c2bae79908>

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

memory = SequentialMemory(limit=50000, window_length=1)
policy = EpsGreedyQPolicy()
dqn = DQNAgent(model=model, nb_actions=action_size, memory=memory, nb_steps_warmup=500, target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
dqn.fit(env, nb_steps=1000000, visualize=False, verbose=1, nb_max_episode_steps=99, log_interval=100000)

Training for 1000000 steps ...
Interval 1 (0 steps performed)
100000/100000 [==============================] - 361s 4ms/step - reward: 0.2225
7095 episodes - episode_reward: 3.136 [-47.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.561 - mean_q: 13.064 - prob: 1.000

Interval 2 (100000 steps performed)
100000/100000 [==============================] - 356s 4ms/step - reward: 0.2244
7055 episodes - episode_reward: 3.182 [-108.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.530 - mean_q: 13.017 - prob: 1.000

Interval 3 (200000 steps performed)
100000/100000 [==============================] - 363s 4ms/step - reward: 0.2240
7069 episodes - episode_reward: 3.171 [-105.000, 15.000] - loss: 0.002 - mean_absolute_error: 7.525 - mean_q: 13.006 - prob: 1.000

Interval 4 (300000 steps performed)
100000/100000 [==============================] - 357s 4ms/step - reward: 0.2180
7057 episodes - episode_reward: 3.090 [-53.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.515 - mean_q: 13.002 - prob: 1.000

Interval 5 (400000 steps performed)
100000/100000 [==============================] - 277s 3ms/step - reward: 0.2418
7135 episodes - episode_reward: 3.386 [-49.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.546 - mean_q: 13.052 - prob: 1.000

Interval 6 (500000 steps performed)
100000/100000 [==============================] - 311s 3ms/step - reward: 0.2259
7096 episodes - episode_reward: 3.185 [-97.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.554 - mean_q: 13.052 - prob: 1.000

Interval 7 (600000 steps performed)
100000/100000 [==============================] - 321s 3ms/step - reward: 0.2271
7095 episodes - episode_reward: 3.200 [-76.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.548 - mean_q: 13.043 - prob: 1.000

Interval 8 (700000 steps performed)
100000/100000 [==============================] - 305s 3ms/step - reward: 0.2154
7061 episodes - episode_reward: 3.051 [-117.000, 15.000] - loss: 0.003 - mean_absolute_error: 7.534 - mean_q: 13.022 - prob: 1.000

Interval 9 (800000 steps performed)
100000/100000 [==============================] - 367s 4ms/step - reward: 0.2158
7063 episodes - episode_reward: 3.057 [-93.000, 15.000] - loss: 0.001 - mean_absolute_error: 7.537 - mean_q: 13.028 - prob: 1.000

Interval 10 (900000 steps performed)
100000/100000 [==============================] - 322s 3ms/step - reward: 0.2203
done, took 3342.801 seconds

<keras.callbacks.History at 0x1c2c04b9a90>

dqn.test(env, nb_episodes=5, visualize=True, nb_max_episode_steps=99)

Testing for 5 episodes ...
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : |_: |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : :_: |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : :_: : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| :_: : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
|_: : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
|_| : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Episode 1: reward: 6.000, steps: 15
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
| : : : :_|
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : :_: |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : :_: : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| :_: : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R:_| : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Episode 2: reward: 10.000, steps: 11
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
| : : : :_|
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : :_: |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : :_: : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| :_: : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R:_| : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Episode 3: reward: 11.000, steps: 10
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
|_: : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
|_: : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
|_| : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Episode 4: reward: 8.000, steps: 13
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
| : : : :_|
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : :_|
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : :_: |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : |_: |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Episode 5: reward: 8.000, steps: 13

<keras.callbacks.History at 0x1c2bae79c88>

dqn.save_weights('dqn_{}_weights.h5f'.format("Taxi-v2"), overwrite=True)

Tiew Kee Hui's Blog

Reinforcement Learning: Deep Q-Network (DQN) with Open AI Taxi