Deep Q-Networks with Lunar Lander

2021-12-26

The Udacity RL nanodegree has an exercise coding a deep Q network (DQN). They want you to also add in the replay and a separate target network. For this page, I will keep it simple and ignore those two other additions, and do a ‘plain’ DQN.

Background

Installation

I had to make sure additional libraries were installed:

pip install Box2D box2d-py
pip install pyglet

Visualization

This is how the rendering works. It opens a new window.

import gym

env = gym.make('LunarLander-v2')
env.seed(0)
state = env.reset()
env.render()

If you want to display the output within the notebook, you can do the following.

import gym
import matplotlib.pyplot as plt
from IPython import display

env = gym.make('LunarLander-v2')
env.seed(0)
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))

# Take a step
action = env.action_space.sample()
state, reward, done, _ = env.step(action)
img.set_data(env.render(mode='rgb_array'))
plt.axis('off')
display.display(plt.gcf())
display.clear_output(wait=True)

It will display now in the notebook but for me it also shows a new window.

You can close the viewer done with env.close().

State and Action Space

The state space consists of 8 values. Thanks to reddit for helping me figure this detail out: https://www.reddit.com/r/reinforcementlearning/comments/g6h7x6/observation_space_of_openai_gym_continuous_lunar.

  • The first 2 are position in x axis and y axis (hieght)
  • The other 2 are the x,y axis velocity terms
  • Lander angle and angular velocity
  • Left and right contact points (bool)

The action space consists of 4 values.

  • Do nothing
  • Fire the left orientation engine
  • Fire the main engine
  • Fire the right orientation engine

Other relevant details from the github repo:

The landing pad is always at coordinates (0,0). The coordinates are the first two numbers in the state vector. Reward for moving from the top of the screen to the landing pad and zero speed is about 100..140 points. If the lander moves away from the landing pad it loses reward. The episode finishes if the lander crashes or comes to rest, receiving an additional -100 or +100 points. Each leg with ground contact is +10 points. Firing the main engine is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solved is 200 points.

Additional Resources

I relied on DQN chapter in the Deep Reinforcement Learning in Action book. Since the book used PyTorch and I wanted to use Tensorflow for fun, I made use of this other nice resournce: DQN from Scratch with Tensorflow 2 and the associated github page.

Random Agent

Let’s get a sense of what’s happening by showing what a random agent’s actions.

import gym
import matplotlib.pyplot as plt
from IPython import display

env = gym.make('LunarLander-v2')
env.seed(0)

# watch an untrained agent
def simulate(env: gym.Env) -> None:
    state = env.reset()
    img = plt.imshow(env.render(mode='rgb_array'))
    done = False
    while not done:
        action = env.action_space.sample()
        img.set_data(env.render(mode='rgb_array'))
        plt.axis('off')
        display.display(plt.gcf())
        display.clear_output(wait=True)
        state, reward, done, _ = env.step(action)
    env.close()

simulate(env)

DQN

import gym
import numpy as np
import random
from tensorflow.python.keras import Sequential
from tensorflow.python.keras.layers import Dense

env = gym.make('LunarLander-v2')
env.seed(0)

state_size = env.observation_space.shape
action_size = env.action_space.n

Model

We need to find the function that will take in the state and get the q-values for each action to take. We can use a neural network to approximate this function.

q_net = Sequential()
q_net.add(Dense(64, input_dim=state_size, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(32, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(action_size, activation='linear', kernel_initializer='he_uniform'))
q_net.compile(optimizer=tf.optimizers.Adam(learning_rate=0.001), loss='mse')

We use 2 hidden layers and have an output layer with the possible actions to be taken.

Hyperparameters

gamma = 0.9
epsilon = 1.0 # Not used here

One Episode

Let’s give what going through a whole episode is like.

done = False
state = env.reset()
total_reward = 0
while not done:
    # 1. Predicted qvalues
    qvals = q_net(state[np.newaxis])

    # 2. Here we will select the action either by chance (prob of epsilon) or greedy (prob of 1-epsilon)
    if (random.random() < epsilon):
        action = np.random.randint(0,env.action_space.n)
    else:
        action = np.argmax(qvals)

    # 3. Now let's make this move
    next_state, reward, done, info = env.step(action)
    total_reward += reward

    # 4. Calculate the target qvalue
    ## What are the qvalues for the next state
    next_qvals = q_net(next_state[np.newaxis]).numpy()
    ## Calculate by taking reward with weighted next max qvalue
    target_qval = reward + gamma*np.max(next_qvals)
    ## Copy over the other q-values
    target_qvals = np.copy(qvals)
    target_qvals[0,action] = target_qval

    # 5. Fit
    training_history = q_net.fit(x=state[np.newaxis], y=target_qvals, verbose=0)

    # 6. Get ready for next move
    state = next_state
    losses.extend(training_history.history['loss'])

Let’s highlight some points

  1. What are the possible action values? We call our neural network to get the predicted q-values given the present state
qvals = q_net(state[np.newaxis])
  1. What action should we take? We sometimes choose our action at random with probability of epsilon, which promotes exploration. Ideally, we’d decrease this epsilon value over the course of many episodes. At other times, we’ll choose the greedy option, which is the action with maximum q-value.
if (random.random() < epsilon):
    action = np.random.randint(0,env.action_space.n)
else:
    action = np.argmax(qvals)
  1. We take the given action, which gives us the next state and the reward for this step.
next_state, reward, done, info = env.step(action)
  1. We calculate the target q-values based on the next state. We make use of our neural network to calculate the q-values for this next state. We then take the greedy action with the max q-value and weight this by gamma. This is based on the bellman equation. We keep the other q-values the same and only update the q-value for the action taken.
## What are the qvalues for the next state
next_qvals = q_net(next_state[np.newaxis]).numpy()
## Calculate by taking reward with weighted next max qvalue
target_qval = reward + gamma*np.max(next_qvals)
## Copy over the other q-values
target_qvals = np.copy(qvals)
target_qvals[0,action] = target_qval
  1. We fit the model with our initial state and the target q-values. Our model with generate the predicted q-values qvals and compare those with the target_qvals using the mean-square error. It will then update the weights accordingly such that given a new state, the Q values will be more accurate and lead to the optimal policy.
training_history = q_net.fit(x=state[np.newaxis], y=target_qvals, verbose=0)