# Deep Q-Networks with Lunar Lander

The Udacity RL nanodegree has an exercise coding a deep Q network (DQN). They want you to also add in the replay and a separate target network. For this page, I will keep it simple and ignore those two other additions, and do a ‘plain’ DQN.

# Background

## Installation

I had to make sure additional libraries were installed:

```
pip install Box2D box2d-py
pip install pyglet
```

## Visualization

This is how the rendering works. It opens a new window.

```
import gym
env = gym.make('LunarLander-v2')
env.seed(0)
state = env.reset()
env.render()
```

If you want to display the output within the notebook, you can do the following.

```
import gym
import matplotlib.pyplot as plt
from IPython import display
env = gym.make('LunarLander-v2')
env.seed(0)
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
# Take a step
action = env.action_space.sample()
state, reward, done, _ = env.step(action)
img.set_data(env.render(mode='rgb_array'))
plt.axis('off')
display.display(plt.gcf())
display.clear_output(wait=True)
```

It will display now in the notebook but for me it also shows a new window.

You can close the viewer done with `env.close()`

.

## State and Action Space

The state space consists of 8 values. Thanks to reddit for helping me figure this detail out: https://www.reddit.com/r/reinforcementlearning/comments/g6h7x6/observation_space_of_openai_gym_continuous_lunar.

- The first 2 are position in x axis and y axis (hieght)
- The other 2 are the x,y axis velocity terms
- Lander angle and angular velocity
- Left and right contact points (bool)

The action space consists of 4 values.

- Do nothing
- Fire the left orientation engine
- Fire the main engine
- Fire the right orientation engine

Other relevant details from the github repo:

The landing pad is always at coordinates (0,0). The coordinates are the first two numbers in the state vector. Reward for moving from the top of the screen to the landing pad and zero speed is about 100..140 points. If the lander moves away from the landing pad it loses reward. The episode finishes if the lander crashes or comes to rest, receiving an additional -100 or +100 points. Each leg with ground contact is +10 points. Firing the main engine is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solved is 200 points.

## Additional Resources

I relied on DQN chapter in the Deep Reinforcement Learning in Action book. Since the book used PyTorch and I wanted to use Tensorflow for fun, I made use of this other nice resournce: DQN from Scratch with Tensorflow 2 and the associated github page.

# Random Agent

Let’s get a sense of what’s happening by showing what a random agent’s actions.

```
import gym
import matplotlib.pyplot as plt
from IPython import display
env = gym.make('LunarLander-v2')
env.seed(0)
# watch an untrained agent
def simulate(env: gym.Env) -> None:
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
done = False
while not done:
action = env.action_space.sample()
img.set_data(env.render(mode='rgb_array'))
plt.axis('off')
display.display(plt.gcf())
display.clear_output(wait=True)
state, reward, done, _ = env.step(action)
env.close()
simulate(env)
```

# DQN

```
import gym
import numpy as np
import random
from tensorflow.python.keras import Sequential
from tensorflow.python.keras.layers import Dense
env = gym.make('LunarLander-v2')
env.seed(0)
state_size = env.observation_space.shape
action_size = env.action_space.n
```

## Model

We need to find the function that will take in the state and get the q-values for each action to take. We can use a neural network to approximate this function.

```
q_net = Sequential()
q_net.add(Dense(64, input_dim=state_size, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(32, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(action_size, activation='linear', kernel_initializer='he_uniform'))
q_net.compile(optimizer=tf.optimizers.Adam(learning_rate=0.001), loss='mse')
```

We use 2 hidden layers and have an output layer with the possible actions to be taken.

## Hyperparameters

```
gamma = 0.9
epsilon = 1.0 # Not used here
```

## One Episode

Let’s give what going through a whole episode is like.

```
done = False
state = env.reset()
total_reward = 0
while not done:
# 1. Predicted qvalues
qvals = q_net(state[np.newaxis])
# 2. Here we will select the action either by chance (prob of epsilon) or greedy (prob of 1-epsilon)
if (random.random() < epsilon):
action = np.random.randint(0,env.action_space.n)
else:
action = np.argmax(qvals)
# 3. Now let's make this move
next_state, reward, done, info = env.step(action)
total_reward += reward
# 4. Calculate the target qvalue
## What are the qvalues for the next state
next_qvals = q_net(next_state[np.newaxis]).numpy()
## Calculate by taking reward with weighted next max qvalue
target_qval = reward + gamma*np.max(next_qvals)
## Copy over the other q-values
target_qvals = np.copy(qvals)
target_qvals[0,action] = target_qval
# 5. Fit
training_history = q_net.fit(x=state[np.newaxis], y=target_qvals, verbose=0)
# 6. Get ready for next move
state = next_state
losses.extend(training_history.history['loss'])
```

Let’s highlight some points

- What are the possible action values? We call our neural network to get the predicted q-values given the present state

```
qvals = q_net(state[np.newaxis])
```

- What action should we take? We sometimes choose our action at random with probability of epsilon, which promotes exploration. Ideally, we’d decrease this epsilon value over the course of many episodes. At other times, we’ll choose the greedy option, which is the action with maximum q-value.

```
if (random.random() < epsilon):
action = np.random.randint(0,env.action_space.n)
else:
action = np.argmax(qvals)
```

- We take the given action, which gives us the next state and the reward for this step.

```
next_state, reward, done, info = env.step(action)
```

- We calculate the target q-values based on the next state. We make use of our neural network to calculate the q-values for this next state. We then take the greedy action with the max q-value and weight this by gamma. This is based on the bellman equation. We keep the other q-values the same and only update the q-value for the action taken.

```
## What are the qvalues for the next state
next_qvals = q_net(next_state[np.newaxis]).numpy()
## Calculate by taking reward with weighted next max qvalue
target_qval = reward + gamma*np.max(next_qvals)
## Copy over the other q-values
target_qvals = np.copy(qvals)
target_qvals[0,action] = target_qval
```

- We fit the model with our initial state and the target q-values. Our model with generate the predicted q-values
`qvals`

and compare those with the`target_qvals`

using the mean-square error. It will then update the weights accordingly such that given a new state, the Q values will be more accurate and lead to the optimal policy.

```
training_history = q_net.fit(x=state[np.newaxis], y=target_qvals, verbose=0)
```