Minecraft is the next frontier for Artificial Intelligence.

It's a huge game, with many mechanics and complex sequences of actions. It takes an entire wiki with over 8000 pages just to teach humans how to play Minecraft. So how good can be artificial intelligence?

This is the question we'll answer in this article. We'll design a bot and try to achieve one of the most difficult challenges in Minecraft: finding diamonds from scratch. To make things even worse, we will take on this challenge in randomly generated worlds so we can't learn a particular seed.

Sequence of actions to find diamonds

What we're gonna talk about is not limited to Minecraft. It can be applied to similar complex environments . More specifically, we will implement two different techniques that will become the backbone of our intelligent agent.

But before we can train an agent, we need to understand how to interact with the environment. Let's start with a scripted bot to get familiar with the syntax. We'll use MineRL, a fantastic library to build AI applications in Minecraft.

The code used in this article is available on Google Colab. It is a simplified and finetuned version of the excellent notebooks made by the organizers of the MineRL 2021 competition (MIT License).

!sudo add-apt-repository -y ppa:openjdk-r/ppa > /dev/null 2>&1
!sudo apt purge openjdk-* > /dev/null 2>&1
!sudo apt install openjdk-8-jdk xvfb xserver-xephyr vnc4server python-opengl ffmpeg > /dev/null 2>&1

# # Install MineRL, the virtual display, and a video renderer
!pip install -q -U minerl pyvirtualdisplay colabgymrender

# RL environment
import gym
import minerl

# Visualization
from colabgymrender.recorder import Recorder
from pyvirtualdisplay import Display

# Others
import numpy as np
from tqdm.notebook import tqdm
import logging
logging.disable(logging.ERROR)

# Create virtual display
display = Display(visible=0, size=(400, 300))
display.start()

📜 I. Scripted bot

MineRL allows us to launch Minecraft in Python and interact with the game. This is done through the popular gym library.

env = gym.make('MineRLObtainDiamond-v0')
env = Recorder(env, './video', fps=60)
env.seed(21)
obs = env.reset()
env.release()
env.play()

We are in front of a tree. As you can see, the resolution is quite low. A low resolution means fewer pixels, which speeds things up. Fortunately for us, neural networks don't need a 4K resolution to understand what's happening on screen.

Now, we would like to interact with the game. What can our agent do? Here's the list of possible actions:

The first step to find diamonds is to get wood to make a crafting table and a wooden pickaxe.

Let's try to get closer to the tree. It means that we need to hold the "forward" button for less than a second. With MineRL, there are 20 actions processed per second: we don't need a full second so let's process it 5 times, and wait for 40 more ticks.

script = ['forward'] * 5 + [''] * 40

env = gym.make('MineRLObtainDiamond-v0')
env = Recorder(env, './video', fps=60)
env.seed(21)
obs = env.reset()

for action in script:
    # Get the action space (dict of possible actions)
    action_space = env.action_space.noop()

    # Activate the selected action in the script
    action_space[action] = 1

    # Update the environment with the new action space
    obs, reward, done, _ = env.step(action_space)

env.release()
env.play()

Great, let's chop this tree now. We need four actions in total:

  • Forward to go in front of the tree;
  • Attack to chop the tree;
  • Camera to look up or down;
  • Jump to get the final piece of wood.

Handling the camera can be a hassle. To simplify the syntax, we're gonna use the str_to_act function from this GitHub repository (MIT license). This is what the new script looks like:

script = []
script += [''] * 20 
script += ['forward'] * 5
script += ['attack'] * 61
script += ['camera:[-10,0]'] * 7  # Look up
script += ['attack'] * 240
script += ['jump']
script += ['forward'] * 10        # Jump forward
script += ['camera:[-10,0]'] * 2  # Look up
script += ['attack'] * 150
script += ['camera:[10,0]'] * 7   # Look down
script += [''] * 40
def str_to_act(env, actions):
    action_space = env.action_space.noop()
    for action in actions.split():
        if ':' in action:
            k, v = action.split(':')
            if k == 'camera':
                action_space[k] = eval(v)
            else:
                action_space[k] = v
        else:
            action_space[action] = 1
    return action_space
    
env = gym.make('MineRLObtainDiamond-v0')
env = Recorder(env, './video', fps=60)
env.seed(21)
obs = env.reset()
 
for action in tqdm(script):
    obs, reward, done, _ = env.step(str_to_act(env, action))

env.release()
env.play()

The agent efficiently chopped the entire tree. This is a good start, but we would like to do it in a more automated way...

🧠 II. Deep Learning

Our bot works well in a fixed environment, but what happens if we change the seed or its starting point?

Everything is scripted so the agent would probably try to chop a non-existent tree.

This approach is too static for our requirements: we need something that can adapt to new environments. Instead of scripting orders, we want an AI that knows how to chop trees. Naturally, reinforcement learning is a pertinent framework to train this agent. More specifically, deep RL seems to be the solution since we're processing images to select the best actions.

There are two ways of implementing it:

  • Pure deep RL: the agent is trained from scratch by interacting with the environment. It is rewarded every time it chops a tree.
  • Imitation learning: the agent learns how to chop trees from a dataset. In this case, it is a sequence of actions to chop trees made by a human.

The two approaches have the same outcome, but they're not equivalent. According to the authors of the MineRL 2021 competition, it takes 8 hours for the pure RL solution and 15 minutes for the imitation learning agent to reach the same level of performance.

We don't have that much time to spend, so we're going for the Imitation Learning solution. This technique is also called Behavior Cloning, which is the simplest form of imitation.

Note that Imitation Learning is not always more efficient than RL. If you want to know more about it, Kumar et al. wrote a great blog post about this topic.

The problem is reduced to a multi-class classification task. Our dataset consists of mp4 videos, so we'll use a Convolutional Neural Network (CNN) to translate these images into relevant actions. Our goal is also to limit the number of actions (classes) that can be taken so the CNN has fewer options, which means it'll be trained more efficiently.

import torch
import torch.nn as nn


class CNN(nn.Module):
    def __init__(self, input_shape, output_dim):
        super().__init__()
        n_input_channels = input_shape[0]
        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, output_dim)
        )

    def forward(self, observations):
        return self.cnn(observations)

def dataset_action_batch_to_actions(dataset_actions, camera_margin=5):
    camera_actions = dataset_actions["camera"].squeeze()
    attack_actions = dataset_actions["attack"].squeeze()
    forward_actions = dataset_actions["forward"].squeeze()
    jump_actions = dataset_actions["jump"].squeeze()
    batch_size = len(camera_actions)
    actions = np.zeros((batch_size,), dtype=int)

    for i in range(len(camera_actions)):
        if camera_actions[i][0] < -camera_margin:
            actions[i] = 3
        elif camera_actions[i][0] > camera_margin:
            actions[i] = 4
        elif camera_actions[i][1] > camera_margin:
            actions[i] = 5
        elif camera_actions[i][1] < -camera_margin:
            actions[i] = 6
        elif forward_actions[i] == 1:
            if jump_actions[i] == 1:
                actions[i] = 2
            else:
                actions[i] = 1
        elif attack_actions[i] == 1:
            actions[i] = 0
        else:
            actions[i] = -1
    return actions

class ActionShaping(gym.ActionWrapper):
    def __init__(self, env, camera_angle=10):
        super().__init__(env)
        self.camera_angle = camera_angle
        self._actions = [
            [('attack', 1)],
            [('forward', 1)],
            [('jump', 1)],
            [('camera', [-self.camera_angle, 0])],
            [('camera', [self.camera_angle, 0])],
            [('camera', [0, self.camera_angle])],
            [('camera', [0, -self.camera_angle])],
        ]
        self.actions = []
        for actions in self._actions:
            act = self.env.action_space.noop()
            for a, v in actions:
                act[a] = v
                act['attack'] = 1
            self.actions.append(act)
        self.action_space = gym.spaces.Discrete(len(self.actions))

    def action(self, action):
        return self.actions[action]

In this example, we manually define 7 relevant actions: attack, forward, jump, and move the camera (left, right, up, down). Another popular approach is to apply K-means in order to automatically retrieve the most relevant actions taken by humans. In any case, the objective is to discard the least useful actions to complete our objective, such as crafting in our example.

Let's train our CNN on the MineRLTreechop-v0 dataset. Other datasets can be found at this address. We chose a learning rate of 0.0001 and 6 epochs with a batch size of 32.

%%time

# Get data
minerl.data.download(directory='data', environment='MineRLTreechop-v0')
data = minerl.data.make("MineRLTreechop-v0", data_dir='data', num_workers=2)

# Model
model = CNN((3, 64, 64), 7).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss()

# Training loop
step = 0
losses = []
for state, action, _, _, _ \
          in tqdm(data.batch_iter(num_epochs=6, batch_size=32, seq_len=1)):
    # Get pov observations
    obs = state['pov'].squeeze().astype(np.float32)
    # Transpose and normalize
    obs = obs.transpose(0, 3, 1, 2) / 255.0

    # Translate batch of actions for the ActionShaping wrapper
    actions = dataset_action_batch_to_actions(action)

    # Remove samples with no corresponding action
    mask = actions != -1
    obs = obs[mask]
    actions = actions[mask]

    # Update weights with backprop
    logits = model(torch.from_numpy(obs).float().cuda())
    loss = criterion(logits, torch.from_numpy(actions).long().cuda())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print loss
    step += 1
    losses.append(loss.item())
    if (step % 2000) == 0:
        mean_loss = sum(losses) / len(losses)
        tqdm.write(f'Step {step:>5} | Training loss = {mean_loss:.3f}')
        losses.clear()

torch.save(model.state_dict(), 'model.pth')
del data
Download: https://minerl.s3.amazonaws.com/v4/MineRLTreechop-v0.tar: 100%|██████████| 1511.0/1510.73792 [00:25<00:00, 58.46MB/s]
Step  2000 | Training loss = 0.901
Step  4000 | Training loss = 0.878
Step  6000 | Training loss = 0.836
Step  8000 | Training loss = 0.826
Step 10000 | Training loss = 0.828
Step 12000 | Training loss = 0.805
Step 14000 | Training loss = 0.804
Step 16000 | Training loss = 0.773
Step 18000 | Training loss = 0.791
Step 20000 | Training loss = 0.789
Step 22000 | Training loss = 0.789
Step 24000 | Training loss = 0.816
Step 26000 | Training loss = 0.785
Step 28000 | Training loss = 0.769
Step 30000 | Training loss = 0.789
Step 32000 | Training loss = 0.777
Step 34000 | Training loss = 0.763
Step 36000 | Training loss = 0.738
Step 38000 | Training loss = 0.744
Step 40000 | Training loss = 0.751
Step 42000 | Training loss = 0.763
Step 44000 | Training loss = 0.764
Step 46000 | Training loss = 0.744
Step 48000 | Training loss = 0.732
Step 50000 | Training loss = 0.740
Step 52000 | Training loss = 0.748
Step 54000 | Training loss = 0.678
Step 56000 | Training loss = 0.765
Step 58000 | Training loss = 0.727
Step 60000 | Training loss = 0.735
Step 62000 | Training loss = 0.707
Step 64000 | Training loss = 0.716
Step 66000 | Training loss = 0.718
Step 68000 | Training loss = 0.710
Step 70000 | Training loss = 0.692
Step 72000 | Training loss = 0.693
Step 74000 | Training loss = 0.687
Step 76000 | Training loss = 0.695
CPU times: user 15min 21s, sys: 55.3 s, total: 16min 16s
Wall time: 26min 46s

Our model is trained. We can now instantiate an environment and see how it behaves. If the training was successful, it should frantically cut all the trees in sight.

This time, we'll use the ActionShaping wrapper to map the array of numbers created with dataset_action_batch_to_actions to discrete actions in MineRL.

Our model needs a pov observation in the correct format and outputs logits. These logits can be turned into a probability distribution over a set of 7 actions with the softmax function. We then randomly choose an action based on the probabilities. The selected action is implemented in MineRL thanks to env.step(action).

This process is repeated as many times as we want. Let's do it 1000 times and watch the result.

model = CNN((3, 64, 64), 7).cuda()
model.load_state_dict(torch.load('model.pth'))

env = gym.make('MineRLObtainDiamond-v0')
env1 = Recorder(env, './video', fps=60)
env = ActionShaping(env1)

action_list = np.arange(env.action_space.n)

obs = env.reset()

for step in tqdm(range(1000)):
    # Get input in the correct format
    obs = torch.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).cuda()
    # Turn logits into probabilities
    probabilities = torch.softmax(model(obs), dim=1)[0].detach().cpu().numpy()
    # Sample action according to the probabilities
    action = np.random.choice(action_list, p=probabilities)

    obs, reward, _, _ = env.step(action)

env1.release()
env1.play()

Our agent is quite chaotic but it manages to chop trees in this new, unseen environment. Now, how to find diamonds?

⛏️ III. Script + Imitation Learning

A simple yet powerful approach consists of combining scripted actions with artificial intelligence. Learn the boring stuff, script the knowledge.

In this paradigm, we'll use the CNN to get a healthy amount of wood (3000 steps). Then, we can script a sequence to craft planks, sticks, a crafting table, a wooden pickaxe, and start mining stone (it should be below our feet). This stone can then be used to craft a stone pickaxe, which can mine iron ore.

This is when things get complicated: iron ore is quite rare, so we would need to run the game for a while to find a deposit. Then, we would have to craft a furnace and melt it to get the iron pickaxe. Finally, we would have to go even deeper and be even luckier to obtain a diamond without falling into lava.

As you can see, it's doable but the outcome is fairly random. We could train another agent to find diamonds, and even a third one to create the iron pickaxe. If you're interested in more complex approaches, you can read the results of the MineRL Diamond 2021 Competition by Kanervisto et al. It describes several solutions using different clever techniques, including end-to-end deep learning architectures. Nonetheless, it is a complex problem and no team managed to consistently find diamonds, if at all.

This is why we will limit ourselves to obtaining a stone pickaxe in the following example, but you can modify the code to go further.

script = []
script += ['craft:planks'] * 6
script += ['craft:stick'] * 2
script += ['craft:crafting_table'] * 2
script += ['camera:[10,0]'] * 18
script += ['attack'] * 20
script += [''] * 10
script += ['jump']
script += [''] * 5
script += ['place:crafting_table']
script += [''] * 10

# Craft a wooden pickaxe and equip it
script += ['camera:[-1,0]']
script += ['nearbyCraft:wooden_pickaxe']
script += ['camera:[1,0]']
script += [''] * 10
script += ['equip:wooden_pickaxe']
script += [''] * 10

# Dig stone
script += ['attack'] * 500

# Craft stone pickaxe
script += [''] * 10
script += ['jump']
script += [''] * 5
script += ['place:crafting_table']
script += [''] * 10
script += ['camera:[-1,0]']
script += ['nearbyCraft:stone_pickaxe']
script += ['camera:[1,0]']
script += [''] * 10
script += ['equip:stone_pickaxe']
script += [''] * 10
model = CNN((3, 64, 64), 7).cuda()
model.load_state_dict(torch.load('model.pth'))

env_script = gym.make('MineRLObtainDiamond-v0')
env_cnn = Recorder(env_script, './video', fps=60)
env_script = ActionShaping(env_cnn)

action_list = np.arange(env_script.action_space.n)

for _ in range(10):
    obs = env_script.reset()
    done = False

    # 1. Get wood with the CNN
    for i in tqdm(range(3000)):
        obs = torch.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).cuda()
        probabilities = torch.softmax(model(obs), dim=1)[0].detach().cpu().numpy()
        action = np.random.choice(action_list, p=probabilities)
        obs, reward, done, _ = env_script.step(action)
        if done:
            break

    # 2. Craft stone pickaxe with scripted actions
    if not done:
        for action in tqdm(script):
            obs, reward, done, _ = env_cnn.step(str_to_act(env_cnn, action))
            if done:
                break

    print(obs["inventory"])
    env_cnn.release()
    env_cnn.play()

We can see our agent chopping wood like a madman during the first 3000 steps, then our script takes over and completes the task. It might not be obvious, but the command print(obs.inventory) shows a stone pickaxe. Note that this is a cherry-picked example: most of the runs don't end that well.

There are several reasons why the agent may fail: it can spawn in a hostile environment (water, lava, etc.), in an area without wood, or even fall and die. Playing with different seeds will give you a good understanding of the complexity of this problem and, hopefully, ideas to build event better agents.

Conclusion

I hope you enjoyed this little guide to reinforcement learning in Minecraft. Beyond its obvious popularity, Minecraft is an interesting environment to try and test RL agents. Like NetHack, it requires a thorough knowledge of its mechanics to plan precise sequences of actions in a procedurally-generated world. In this article,

  • We learned how to use MineRL;
  • We saw two approaches (script and behavior cloning) and how to combine them;
  • We visualized the agent's actions with short videos.

The main drawback of the environment is its slow processing time. Minecraft is not a lightweight game like NetHack or Pong, which is why the agents take a long time to be trained. If this is a problem for you, I would recommend lighter environments like Gym Retro.

Thank you for your attention! Feel free to follow me on Twitter if you're interested in AI applied to video games.