Create a Bot to Find Diamonds in Minecraft
Reinforcement Learning and Behavior Cloning in Python with MineRL

Minecraft is the next frontier for Artificial Intelligence.
It's a huge game, with many mechanics and complex sequences of actions. It takes an entire wiki with over 8000 pages just to teach humans how to play Minecraft. So how good can be artificial intelligence?
This is the question we'll answer in this article. We'll design a bot and try to achieve one of the most difficult challenges in Minecraft: finding diamonds from scratch. To make things even worse, we will take on this challenge in randomly generated worlds so we can't learn a particular seed.

What we're gonna talk about is not limited to Minecraft. It can be applied to similar complex environments . More specifically, we will implement two different techniques that will become the backbone of our intelligent agent.
But before we can train an agent, we need to understand how to interact with the environment. Let's start with a scripted bot to get familiar with the syntax. We'll use MineRL, a fantastic library to build AI applications in Minecraft.
The code used in this article is available on Google Colab. It is a simplified and finetuned version of the excellent notebooks made by the organizers of the MineRL 2021 competition (MIT License).
!sudo add-apt-repository -y ppa:openjdk-r/ppa > /dev/null 2>&1
!sudo apt purge openjdk-* > /dev/null 2>&1
!sudo apt install openjdk-8-jdk xvfb xserver-xephyr vnc4server python-opengl ffmpeg > /dev/null 2>&1
# # Install MineRL, the virtual display, and a video renderer
!pip install -q -U minerl pyvirtualdisplay colabgymrender imageio==2.4.1
# RL environment
import gym
import minerl
# Visualization
from colabgymrender.recorder import Recorder
from pyvirtualdisplay import Display
# Others
import numpy as np
from tqdm.notebook import tqdm
import logging
logging.disable(logging.ERROR)
# Create virtual display
display = Display(visible=0, size=(400, 300))
display.start()
env = gym.make('MineRLObtainDiamond-v0')
env = Recorder(env, './video', fps=60)
env.seed(21)
obs = env.reset()
env.release()
env.play()

We are in front of a tree. As you can see, the resolution is quite low. A low resolution means fewer pixels, which speeds things up. Fortunately for us, neural networks don't need a 4K resolution to understand what's happening on screen.
Now, we would like to interact with the game. What can our agent do? Here's the list of possible actions:

The first step to find diamonds is to get wood to make a crafting table and a wooden pickaxe.
Let's try to get closer to the tree. It means that we need to hold the "forward" button for less than a second. With MineRL, there are 20 actions processed per second: we don't need a full second so let's process it 5 times, and wait for 40 more ticks.

script = ['forward'] * 5 + [''] * 40
env = gym.make('MineRLObtainDiamond-v0')
env = Recorder(env, './video', fps=60)
env.seed(21)
obs = env.reset()
for action in script:
# Get the action space (dict of possible actions)
action_space = env.action_space.noop()
# Activate the selected action in the script
action_space[action] = 1
# Update the environment with the new action space
obs, reward, done, _ = env.step(action_space)
env.release()
env.play()

Great, let's chop this tree now. We need four actions in total:
- Forward to go in front of the tree;
- Attack to chop the tree;
- Camera to look up or down;
- Jump to get the final piece of wood.

Handling the camera can be a hassle. To simplify the syntax, we're gonna use the str_to_act
function from this GitHub repository (MIT license). This is what the new script looks like:
script = []
script += [''] * 20
script += ['forward'] * 5
script += ['attack'] * 61
script += ['camera:[-10,0]'] * 7 # Look up
script += ['attack'] * 240
script += ['jump']
script += ['forward'] * 10 # Jump forward
script += ['camera:[-10,0]'] * 2 # Look up
script += ['attack'] * 150
script += ['camera:[10,0]'] * 7 # Look down
script += [''] * 40
def str_to_act(env, actions):
action_space = env.action_space.noop()
for action in actions.split():
if ':' in action:
k, v = action.split(':')
if k == 'camera':
action_space[k] = eval(v)
else:
action_space[k] = v
else:
action_space[action] = 1
return action_space
env = gym.make('MineRLObtainDiamond-v0')
env = Recorder(env, './video', fps=60)
env.seed(21)
obs = env.reset()
for action in tqdm(script):
obs, reward, done, _ = env.step(str_to_act(env, action))
env.release()
env.play()
The agent efficiently chopped the entire tree. This is a good start, but we would like to do it in a more automated way...
🧠 II. Deep Learning
Our bot works well in a fixed environment, but what happens if we change the seed or its starting point?
Everything is scripted so the agent would probably try to chop a non-existent tree.
This approach is too static for our requirements: we need something that can adapt to new environments. Instead of scripting orders, we want an AI that knows how to chop trees. Naturally, reinforcement learning is a pertinent framework to train this agent. More specifically, deep RL seems to be the solution since we're processing images to select the best actions.
There are two ways of implementing it:
- Pure deep RL: the agent is trained from scratch by interacting with the environment. It is rewarded every time it chops a tree.
- Imitation learning: the agent learns how to chop trees from a dataset. In this case, it is a sequence of actions to chop trees made by a human.
The two approaches have the same outcome, but they're not equivalent. According to the authors of the MineRL 2021 competition, it takes 8 hours for the pure RL solution and 15 minutes for the imitation learning agent to reach the same level of performance.
We don't have that much time to spend, so we're going for the Imitation Learning solution. This technique is also called Behavior Cloning, which is the simplest form of imitation.
Note that Imitation Learning is not always more efficient than RL. If you want to know more about it, Kumar et al. wrote a great blog post about this topic.

The problem is reduced to a multi-class classification task. Our dataset consists of mp4 videos, so we'll use a Convolutional Neural Network (CNN) to translate these images into relevant actions. Our goal is also to limit the number of actions (classes) that can be taken so the CNN has fewer options, which means it'll be trained more efficiently.
import torch
import torch.nn as nn
class CNN(nn.Module):
def __init__(self, input_shape, output_dim):
super().__init__()
n_input_channels = input_shape[0]
self.cnn = nn.Sequential(
nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Flatten(),
nn.Linear(1024, 512),
nn.ReLU(),
nn.Linear(512, output_dim)
)
def forward(self, observations):
return self.cnn(observations)
def dataset_action_batch_to_actions(dataset_actions, camera_margin=5):
camera_actions = dataset_actions["camera"].squeeze()
attack_actions = dataset_actions["attack"].squeeze()
forward_actions = dataset_actions["forward"].squeeze()
jump_actions = dataset_actions["jump"].squeeze()
batch_size = len(camera_actions)
actions = np.zeros((batch_size,), dtype=int)
for i in range(len(camera_actions)):
if camera_actions[i][0] < -camera_margin:
actions[i] = 3
elif camera_actions[i][0] > camera_margin:
actions[i] = 4
elif camera_actions[i][1] > camera_margin:
actions[i] = 5
elif camera_actions[i][1] < -camera_margin:
actions[i] = 6
elif forward_actions[i] == 1:
if jump_actions[i] == 1:
actions[i] = 2
else:
actions[i] = 1
elif attack_actions[i] == 1:
actions[i] = 0
else:
actions[i] = -1
return actions
class ActionShaping(gym.ActionWrapper):
def __init__(self, env, camera_angle=10):
super().__init__(env)
self.camera_angle = camera_angle
self._actions = [
[('attack', 1)],
[('forward', 1)],
[('jump', 1)],
[('camera', [-self.camera_angle, 0])],
[('camera', [self.camera_angle, 0])],
[('camera', [0, self.camera_angle])],
[('camera', [0, -self.camera_angle])],
]
self.actions = []
for actions in self._actions:
act = self.env.action_space.noop()
for a, v in actions:
act[a] = v
act['attack'] = 1
self.actions.append(act)
self.action_space = gym.spaces.Discrete(len(self.actions))
def action(self, action):
return self.actions[action]
In this example, we manually define 7 relevant actions: attack, forward, jump, and move the camera (left, right, up, down). Another popular approach is to apply K-means in order to automatically retrieve the most relevant actions taken by humans. In any case, the objective is to discard the least useful actions to complete our objective, such as crafting in our example.
Let's train our CNN on the MineRLTreechop-v0
dataset. Other datasets can be found at this address. We chose a learning rate of 0.0001 and 6 epochs with a batch size of 32.
%%time
# Get data
minerl.data.download(directory='data', environment='MineRLTreechop-v0')
data = minerl.data.make("MineRLTreechop-v0", data_dir='data', num_workers=2)
# Model
model = CNN((3, 64, 64), 7).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss()
# Training loop
step = 0
losses = []
for state, action, _, _, _ \
in tqdm(data.batch_iter(num_epochs=6, batch_size=32, seq_len=1)):
# Get pov observations
obs = state['pov'].squeeze().astype(np.float32)
# Transpose and normalize
obs = obs.transpose(0, 3, 1, 2) / 255.0
# Translate batch of actions for the ActionShaping wrapper
actions = dataset_action_batch_to_actions(action)
# Remove samples with no corresponding action
mask = actions != -1
obs = obs[mask]
actions = actions[mask]
# Update weights with backprop
logits = model(torch.from_numpy(obs).float().cuda())
loss = criterion(logits, torch.from_numpy(actions).long().cuda())
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Print loss
step += 1
losses.append(loss.item())
if (step % 2000) == 0:
mean_loss = sum(losses) / len(losses)
tqdm.write(f'Step {step:>5} | Training loss = {mean_loss:.3f}')
losses.clear()
torch.save(model.state_dict(), 'model.pth')
del data
Our model is trained. We can now instantiate an environment and see how it behaves. If the training was successful, it should frantically cut all the trees in sight.
This time, we'll use the ActionShaping
wrapper to map the array of numbers created with dataset_action_batch_to_actions
to discrete actions in MineRL.
Our model needs a pov observation in the correct format and outputs logits. These logits can be turned into a probability distribution over a set of 7 actions with the softmax
function. We then randomly choose an action based on the probabilities. The selected action is implemented in MineRL thanks to env.step(action)
.
This process is repeated as many times as we want. Let's do it 1000 times and watch the result.
model = CNN((3, 64, 64), 7).cuda()
model.load_state_dict(torch.load('model.pth'))
env = gym.make('MineRLObtainDiamond-v0')
env1 = Recorder(env, './video', fps=60)
env = ActionShaping(env1)
action_list = np.arange(env.action_space.n)
obs = env.reset()
for step in tqdm(range(1000)):
# Get input in the correct format
obs = torch.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).cuda()
# Turn logits into probabilities
probabilities = torch.softmax(model(obs), dim=1)[0].detach().cpu().numpy()
# Sample action according to the probabilities
action = np.random.choice(action_list, p=probabilities)
obs, reward, _, _ = env.step(action)
env1.release()
env1.play()
Our agent is quite chaotic but it manages to chop trees in this new, unseen environment. Now, how to find diamonds?
⛏️ III. Script + Imitation Learning
A simple yet powerful approach consists of combining scripted actions with artificial intelligence. Learn the boring stuff, script the knowledge.
In this paradigm, we'll use the CNN to get a healthy amount of wood (3000 steps). Then, we can script a sequence to craft planks, sticks, a crafting table, a wooden pickaxe, and start mining stone (it should be below our feet). This stone can then be used to craft a stone pickaxe, which can mine iron ore.

This is when things get complicated: iron ore is quite rare, so we would need to run the game for a while to find a deposit. Then, we would have to craft a furnace and melt it to get the iron pickaxe. Finally, we would have to go even deeper and be even luckier to obtain a diamond without falling into lava.
As you can see, it's doable but the outcome is fairly random. We could train another agent to find diamonds, and even a third one to create the iron pickaxe. If you're interested in more complex approaches, you can read the results of the MineRL Diamond 2021 Competition by Kanervisto et al. It describes several solutions using different clever techniques, including end-to-end deep learning architectures. Nonetheless, it is a complex problem and no team managed to consistently find diamonds, if at all.
This is why we will limit ourselves to obtaining a stone pickaxe in the following example, but you can modify the code to go further.
script = []
script += ['craft:planks'] * 6
script += ['craft:stick'] * 2
script += ['craft:crafting_table'] * 2
script += ['camera:[10,0]'] * 18
script += ['attack'] * 20
script += [''] * 10
script += ['jump']
script += [''] * 5
script += ['place:crafting_table']
script += [''] * 10
# Craft a wooden pickaxe and equip it
script += ['camera:[-1,0]']
script += ['nearbyCraft:wooden_pickaxe']
script += ['camera:[1,0]']
script += [''] * 10
script += ['equip:wooden_pickaxe']
script += [''] * 10
# Dig stone
script += ['attack'] * 500
# Craft stone pickaxe
script += [''] * 10
script += ['jump']
script += [''] * 5
script += ['place:crafting_table']
script += [''] * 10
script += ['camera:[-1,0]']
script += ['nearbyCraft:stone_pickaxe']
script += ['camera:[1,0]']
script += [''] * 10
script += ['equip:stone_pickaxe']
script += [''] * 10
model = CNN((3, 64, 64), 7).cuda()
model.load_state_dict(torch.load('model.pth'))
env_script = gym.make('MineRLObtainDiamond-v0')
env_cnn = Recorder(env_script, './video', fps=60)
env_script = ActionShaping(env_cnn)
action_list = np.arange(env_script.action_space.n)
for _ in range(10):
obs = env_script.reset()
done = False
# 1. Get wood with the CNN
for i in tqdm(range(3000)):
obs = torch.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).cuda()
probabilities = torch.softmax(model(obs), dim=1)[0].detach().cpu().numpy()
action = np.random.choice(action_list, p=probabilities)
obs, reward, done, _ = env_script.step(action)
if done:
break
# 2. Craft stone pickaxe with scripted actions
if not done:
for action in tqdm(script):
obs, reward, done, _ = env_cnn.step(str_to_act(env_cnn, action))
if done:
break
print(obs["inventory"])
env_cnn.release()
env_cnn.play()
We can see our agent chopping wood like a madman during the first 3000 steps, then our script takes over and completes the task. It might not be obvious, but the command print(obs.inventory)
shows a stone pickaxe. Note that this is a cherry-picked example: most of the runs don't end that well.
There are several reasons why the agent may fail: it can spawn in a hostile environment (water, lava, etc.), in an area without wood, or even fall and die. Playing with different seeds will give you a good understanding of the complexity of this problem and, hopefully, ideas to build event better agents.
Conclusion
I hope you enjoyed this little guide to reinforcement learning in Minecraft. Beyond its obvious popularity, Minecraft is an interesting environment to try and test RL agents. Like NetHack, it requires a thorough knowledge of its mechanics to plan precise sequences of actions in a procedurally-generated world. In this article,
- We learned how to use MineRL;
- We saw two approaches (script and behavior cloning) and how to combine them;
- We visualized the agent's actions with short videos.
The main drawback of the environment is its slow processing time. Minecraft is not a lightweight game like NetHack or Pong, which is why the agents take a long time to be trained. If this is a problem for you, I would recommend lighter environments like Gym Retro.
Thank you for your attention! Feel free to follow me on Twitter if you're interested in AI applied to video games.