Quantize Llama models with GGUF and llama.cpp

Check out the LLM Engineer’s Handbook to master the art of LLMs from fine-tuning to deployment👇

Due to the massive size of Large Language Models (LLMs), quantization has become an essential technique to run them efficiently. By reducing the precision of their weights, you can save memory and speed up inference while preserving most of the model’s performance. Recently, 8-bit and 4-bit quantization unlocked the possibility of running LLMs on consumer hardware. Coupled with the release of Llama models and parameter-efficient techniques to fine-tune them (LoRA, QLoRA), this created a rich ecosystem of local LLMs that are now competing with OpenAI’s GPT-3.5 and GPT-4.

Currently, there are three main quantization techniques: NF4, GPTQ, and GGML. NF4 is a static method used by QLoRA to load a model in 4-bit precision to perform fine-tuning. In a previous article, we explored the GPTQ method and quantized our own model to run it on a consumer GPU. In this article, we will introduce the GGML technique, see how to quantize Llama models, and provide tips and tricks to achieve the best results.

You can find the code on Google Colab and GitHub.

What is GGML?

GGML is a C library focused on machine learning. It was created by Georgi Gerganov, which is what the initials “GG” stand for. This library not only provides foundational elements for machine learning, such as tensors, but also a unique binary format to distribute LLMs.

This format recently changed to GGUF. This new format is designed to be extensible, so that new features shouldn’t break compatibility with existing models. It also centralizes all the metadata in one file, such as special tokens, RoPE scaling parameters, etc. In short, it answers a few historical pain points and should be future-proof. For more information, you can read the specification at this address. In the rest of the article, we will call “GGML models” all models that either use GGUF or previous formats.

GGML was designed to be used in conjunction with the llama.cpp library, also created by Georgi Gerganov. The library is written in C/C++ for efficient inference of Llama models. It can load GGML models and run them on a CPU. Originally, this was the main difference with GPTQ models, which are loaded and run on a GPU. However, you can now offload some layers of your LLM to the GPU with llama.cpp. To give you an example, there are 35 layers for a 7b parameter model. This drastically speeds up inference and allows you to run LLMs that don’t fit in your VRAM.

If command-line tools are your thing, llama.cpp and GGUF support have been integrated into many GUIs, like oobabooga’s text-generation-web-ui, koboldcpp, LM Studio, or ctransformers. You can simply load your GGML models with these tools and interact with them in a ChatGPT-like way. Fortunately, many quantized models are directly available on the Hugging Face Hub. You’ll quickly notice that most of them are quantized by TheBloke, a popular figure in the LLM community.

In the next section, we will see how to quantize our own models and run them on a consumer GPU.

How to quantize LLMs with GGML?

Let’s look at the files inside of TheBloke/Llama-2-13B-chat-GGML repo. We can see 14 different GGML models, corresponding to different types of quantization. They follow a particular naming convention: “q” + the number of bits used to store the weights (precision) + a particular variant. Here is a list of all the possible quant methods and their corresponding use cases, based on model cards made by TheBloke:

q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_s: Uses Q3_K for all tensors
q4_0: Original quant method, 4-bit.
q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
q4_k_s: Uses Q4_K for all tensors
q5_0: Higher accuracy, higher resource usage and slower inference.
q5_1: Even higher accuracy, resource usage and slower inference.
q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
q5_k_s: Uses Q5_K for all tensors
q6_k: Uses Q8_K for all tensors
q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

As a rule of thumb, I recommend using Q5_K_M as it preserves most of the model’s performance. Alternatively, you can use Q4_K_M if you want to save some memory. In general, K_M versions are better than K_S versions. I cannot recommend Q2_K or Q3_* versions, as they drastically decrease model performance.

Now that we know more about the quantization types available, let’s see how to use them on a real model. You can execute the following code on a free T4 GPU on Google Colab. The first step consists of compiling llama.cpp and installing the required libraries in our Python environment.

# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

Now we can download our model. We will use the model we fine-tuned in this article, mlabonne/EvolCodeLlama-7b.

MODEL_ID = "mlabonne/EvolCodeLlama-7b"

# Download model
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}

This step can take a while. Once it’s done, we need to convert our weight to GGML FP16 format.

MODEL_NAME = MODEL_ID.split('/')[-1]

# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

Finally, we can quantize the model using one or several methods. In this case, we will use the Q4_K_M and Q5_K_M methods I recommended earlier. This is the only step that actually requires a GPU.

QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"]

for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}

Our two quantized models are now ready for inference. We can check the size of the bin files to see how much we compressed them. The FP16 model takes up 13.5 GB, while the Q4_K_M model takes up 4.08 GB (3.3 times smaller) and the Q5_K_M model takes up 4.78 GB (2.8 times smaller).

Let’s use llama.cpp to efficiently run them. Since we’re using a GPU with 16 GB of VRAM, we can offload every layer to the GPU. In this case, it represents 35 layers (7b parameter model), so we’ll use the -ngl 35 parameter. In the following code block, we’ll also input a prompt and the quantization method we want to use.

import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]

prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Let’s ask the model “Write a Python function to print the nth Fibonacci numbers” using the Q5_K_M method. If we look at the logs, we can confirm that we successfully offloaded our layers thanks to the line “llm_load_tensors: offloaded 35/35 layers to GPU”. Here is the code the model generated:

def fib(n):
    if n == 0 or n == 1:
        return n
    return fib(n - 2) + fib(n - 1)

for i in range(1, 10):
    print(fib(i))

This wasn’t a very complex prompt, but it successfully produced a working piece of code in no time. With llama.cpp, you can use your local LLM as an assistant in a terminal using the interactive mode (-i flag). Note that this also works on Macbooks with Apple’s Metal Performance Shaders (MPS), which is an excellent option to run LLMs.

Finally, we can push our quantized model to a new repo on the Hugging Face Hub with the “-GGUF” suffix. First, let’s log in and modify the following code block to match your username. You can enter your Hugging Face token (https://huggingface.co/settings/tokens) in Google Colab’s “Secrets” tab. We use the allow_patterns parameter to only upload GGUF models and not the entirety of the directory.

!pip install -q huggingface_hub
from huggingface_hub import create_repo, HfApi
from google.colab import userdata

# Defined in the secrets tab in Google Colab
hf_token = userdata.get('huggingface')

api = HfApi()
username = "mlabonne"

# Create empty repo
create_repo(
    repo_id = f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
    token=hf_token
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns=f"*.gguf",
    token=hf_token
)

We have successfully quantized, run, and pushed GGML models to the Hugging Face Hub! In the next section, we will explore how GGML actually quantize these models.

Quantization with GGML

The way GGML quantizes weights is not as sophisticated as GPTQ’s. Basically, it groups blocks of values and rounds them to a lower precision. Some techniques, like Q4_K_M and Q5_K_M, implement a higher precision for critical layers. In this case, every weight is stored in 4-bit precision, with the exception of half of the attention.wv and feed_forward.w2 tensors. Experimentally, this mixed precision proves to be a good tradeoff between accuracy and resource usage.

If we look into the ggml.c file, we can see how the blocks are defined. For example, the block_q4_0 structure is defined as:

#define QK4_0 32
typedef struct {
    ggml_fp16_t d;          // delta
    uint8_t qs[QK4_0 / 2];  // nibbles / quants
} block_q4_0;

In GGML, weights are processed in blocks, each consisting of 32 values. For each block, a scale factor (delta) is derived from the largest weight value. All weights in the block are then scaled, quantized, and packed efficiently for storage (nibbles). This approach significantly reduces the storage requirements while allowing for a relatively simple and deterministic conversion between the original and quantized weights.

Now that we know more about the quantization process, we can compare the results with NF4 and GPTQ.

NF4 vs. GGML vs. GPTQ

Which technique is better for 4-bit quantization? To answer this question, we need to introduce the different backends that run these quantized LLMs. For GGML models, llama.cpp with Q4_K_M models is the way to go. For GPTQ models, we have two options: AutoGPTQ or ExLlama. Finally, NF4 models can directly be run in transformers with the --load-in-4bit flag.

Oobabooga ran multiple experiments in an excellent blog post that compare different models in terms of perplexity (lower is better):

Based on these results, we can say that GGML models have a slight advantage in terms of perplexity. The difference is not particularly significant, which is why it is better to focus on the generation speed in terms of tokens/second. The best technique depends on your GPU: if you have enough VRAM to fit the entire quantized model, GPTQ with ExLlama will be the fastest. If that’s not the case, you can offload some layers and use GGML models with llama.cpp to run your LLM.

Conclusion

In this article, we introduced the GGML library and the new GGUF format to efficiently store these quantized models. We used it to quantize our own Llama model in different formats (Q4_K_M and Q5_K_M). We then ran the GGML model and pushed our bin files to the Hugging Face Hub. Finally, we delved deeper into GGML’s code to understand how it actually quantizes the weights and compared it to NF4 and GPTQ.

Quantization is a formidable vector to democratize LLMs by lowering the cost of running them. In the future, mixed precision and other techniques will keep improving the performance we can achieve with quantized weights. Until then, I hope you enjoyed reading this article and learned something new.