A Beginner’s Guide to LLM Fine-Tuning

How to fine-tune Llama and other LLMs with one tool

Large Language Models
Author

Maxime Lbonne

Published

August 27, 2023

Find more a lot more architectures and applications using graph neural networks in my book, Hands-On Graph Neural Networks 👇 Hands-On Graph Neural Networks Using Python

The growing interest in Large Language Models (LLMs) has led to a surge in tools and wrappers designed to streamline their training process.

Popular options include FastChat from LMSYS (used to train Vicuna) and Hugging Face’s transformers/trl libraries (used in my previous article). In addition, each big LLM project, like WizardLM, tends to have its own training script, inspired by the original Alpaca implementation.

In this article, we will use Axolotl, a tool created by the OpenAccess AI Collective. We will use it to fine-tune a Code Llama 7b model on an evol-instruct dataset comprised of 1,000 samples of Python code.

🤔 Why Axolotl?

The main appeal of Axolotl is that it provides a one-stop solution, which includes numerous features, model architectures, and an active community. Here’s a quick list of my favorite things about it:

  • Configuration: All parameters used to train an LLM are neatly stored in a yaml config file. This makes it convenient for sharing and reproducing models. You can see an example for Llama 2 here.

  • Dataset Flexibility: Axolotl allows the specification of multiple datasets with varied prompt formats such as alpaca ({"instruction": "...", "input": "...", "output": "..."}), sharegpt:chat ({"conversations": [{"from": "...", "value": "..."}]}), and raw completion ({"text": "..."}). Combining datasets is seamless, and the hassle of unifying the prompt format is eliminated.

  • Features: Axolotl is packed with SOTA techniques such as FSDP, deepspeed, LoRA, QLoRA, ReLoRA, sample packing, GPTQ, FlashAttention, xformers, and rope scaling.

  • Utilities: There are numerous user-friendly utilities integrated, including the addition or alteration of special tokens, or a custom wandb configuration.

Some well-known models trained using this tool are Manticore-13b from the OpenAccess AI Collective and Samantha-1.11-70b from Eric Hartford. Like other wrappers, it is built on top of the transformers library and uses many of its features.

⚙️ Create your own config file

Before anything, we need a configuration file. You can reuse an existing configuration from the examples folder. In our case, we will tweak the QLoRA config for Llama 2 to create our own Code Llama model. The model will be trained on a subset of 1,000 Python samples from the nickrosh/Evol-Instruct-Code-80k-v1 dataset.

First, we must change the base_model and base_model_config fields to “codellama/CodeLlama-7b-hf”. To push our trained adapter to the Hugging Face Hub, let’s add a new field hub_model_id, which corresponds to the name of our model, “EvolCodeLlama-7b”. Now, we have to update the dataset to mlabonne/Evol-Instruct-Python-1k and set type to “alpaca”.

There’s no sample bigger than 2048 tokens in this dataset, so we can reduce the sequence_len to “2048” and save some VRAM. Talking about VRAM, we’re going to use a micro_batch_size of 10 and a gradient_accumulation_steps of 1 to maximize its use. In practice, you try different values until you use >95% of the available VRAM.

For convenience, I’m going to add the name “axolotl” to the wandb_project field so it’s easier to track on my account. I’m also setting the warmup_steps to “100” (personal preference) and the eval_steps to 0.01 so we’ll end up with 100 evaluations.

Here’s how the final config file should look:

base_model: codellama/CodeLlama-7b-hf
base_model_config: codellama/CodeLlama-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true
hub_model_id: EvolCodeLlama-7b

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
    - path: mlabonne/Evol-Instruct-Python-1k
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.02
output_dir: ./qlora-out

adapter: qlora
lora_model_dir:

sequence_len: 2048
sample_packing: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 10
num_epochs: 3
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
eval_steps: 0.01
save_strategy: epoch
save_steps:
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
    bos_token: "<s>"
    eos_token: "</s>"
    unk_token: "<unk>"

You can also find this config file here as a GitHub gist.

Before we start training our model, I want to introduce a few parameters that are important to understand:

  • QLoRA: We’re using QLoRA for fine-tuning, which is why we’re loading the base model in 4-bit precision (NF4 format). You can check this article from Benjamin Marie to know more about QLoRA.

  • Gradient checkpointing: It lowers the VRAM requirements by removing some activations that are re-computed on demand during the backward pass. It also slows down training by about 20%, according to Hugging Face’s documentation.

  • FlashAttention: This implements the FlashAttentionmechanism, which improves the speed and memory efficiency of our model thanks to a clever fusion of GPU operations (learn more about it in this article from Aleksa Gordiç).

  • Sample packing: Smart way of creating batches with as little padding as possible, by reorganizing the order of the samples (bin packing problem). As a result, we need fewer batches to train the model on the same dataset. It was inspired by the Multipack Sampler (see my note) and Krell et al.

You can find FlashAttention in some other tools, but sample packing is relatively new. As far as I know, OpenChatwas the first project to use sample packing during fine-tuning. Thanks to Axolotl, we’ll use these techniques for free.

🦙 Fine-tune Code Llama

Having the config file ready, it’s time to get our hands dirty with the actual fine-tuning. You might consider running the training on a Colab notebook. However, for those without access to a high-performance GPU, a more cost-effective solution consists of renting cloud-based GPU services, like AWS, Lambda Labs, Vast.ai, Banana, or RunPod.

Personally, I use RunPod, which is a popular option in the fine-tuning community. It’s not the cheapest service but it hits a good tradeoff with a clean UI. You can easily replicate the following steps using your favorite service.

When your RunPod account is set up, go to Manage > Templates and click on “New Template”. Here is a simple template:

Let’s review the different fields and their corresponding values:

  • Template Name: Axolotl (you can choose whatever you want)

  • Container Image: winglian/axolotl-runpod:main-py3.10-cu118-2.0.1

  • Container Disk: 100 GB

  • Volume Disk: 0 GB

  • Volume Mount Path: /workspace

In addition, there are two handy environment variables can include:

  • HUGGING_FACE_HUB_TOKEN: you can find your token on this page (requires an account)

  • WANDB_API_KEY: you can find your key on this page (requires an account)

Alternatively, you can simply log in the terminal later (using huggingface-cli login and wandb login). Once you’re set-up, go to Community Cloud and deploy an RTX 3090. Here you can search for the name of your template and select it as follows:

You can click on “Continue” and RunPod will deploy your template. You can see the installation in your pod’s logs (Manage > Pods). When the option becomes available, click on “Connect”. Here, click on “tart Web Terminal” and then “Connect to Web Terminal”. You are now connected to your pod!

The following steps are the same no matter what service you choose:

  1. We install Axolotl and the PEFT library as follows:
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl

pip3 install -e .[flash-attn]
pip3 install -U git+https://github.com/huggingface/peft.git
  1. Download the config file we created:
wget https://gist.githubusercontent.com/mlabonne/8055f6335e2b85f082c8c75561321a66/raw/93915a9563fcfff8df9a81fc0cdbf63894465922/EvolCodeLlama-7b.yaml
  1. You can now start fine-tuning the model with the following command:
accelerate launch scripts/finetune.py EvolCodeLlama-7b.yaml

If everything is configured correctly, you should be able to train the model in a little more than one hour (it took me 1h 11m 44s). If you check the GPU memory used, you’ll see almost 100% with this config, which means we’re optimizing it pretty nicely. If you’re using a GPU with more VRAM (like an A100), you can increase the micro-batch size to make sure you’re fully using it.

In the meantime, feel free to close the web terminal and check your loss on Weights & Biases. We’re using tmux so the training won’t stop if you close the terminal. Here are my loss curves:

We see a steady improvement in the eval loss, which is a good sign. However, you can also spot drops in the eval loss that are not correlated with a decrease in the quality of the outputs. The best way to evaluate your model is simply by using it: you can run it in the terminal with the command accelerate launch scripts/finetune.py EvolCodeLlama-7b.yaml –inference –lora_model_dir=“./qlora-out”.

The QLoRA adapter should already be uploaded to the Hugging Face Hub. However, you can also merge the base Code Llama model with this adapter and push the merged model there by following these steps:

  1. Download this script:
wget https://gist.githubusercontent.com/mlabonne/a3542b0519708b8871d0703c938bba9f/raw/60abc5afc07f9d843bc23d56f4e0b7ab072c4a62/merge_peft.py
  1. Execute it with this command:
python merge_peft.py --base_model=codellama/CodeLlama-7b-hf --peft_model=./qlora-out --hub_id=EvolCodeLlama-7b

Congratulations, you should have your own EvolCodeLlama-7b on the Hugging Face Hub at this point! For reference, you can access my own model trained with this process here: mlabonne/EvolCodeLlama-7b

Considering that our EvolCodeLlama-7b is a code LLM, it would be interesting to compare its performance with other models on standard benchmarks, such as HumanEval and MBPP. For reference, you can find a leaderboard at the following address: Multilingual Code Evals.

If you’re happy with this model, you can quantize it with GGML for local inference with this free Google Colab notebook. You can also fine-tune bigger models (e.g., 70b parameters) thanks to deepspeed, which only requires an additional config file.

Conclusion

In this article, we’ve covered the essentials of how to efficiently fine-tune LLMs. We customized parameters to train on our Code Llama model on a small Python dataset. Finally, we merged the weights and uploaded the result on Hugging Face.

I hope you found this guide useful. I recommend using Axolotl with a cloud-based GPU service to get some experience and upload a few models on Hugging Face. Build your own datasets, play with the parameters, and break stuff along the way. Like with every wrapper, don’t hesitate to check the source code to get a good intuition of what it’s actually doing. It will massively help in the long run.

Thanks to the OpenAccess AI Collective and all the contributors!