Convert the original llama70B model to GGUF format.

1. Download#

First, download the model from Hugging Face. You can download the original model or other variants. Here, I will download a llama3-70B variant, which has a total size of around 130GB.

huggingface-cli download cognitivecomputations/dolphin-2.9.1-llama-3-70b --cache-dir ./model

If you are using a network in China, you can use the Hugging Face proxy https://hf-mirror.com/ for downloading.

2. Install llama.cpp#

Download and install from GitHub: https://github.com/ggerganov/llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Installation completed.

3. Convert the model to ggml format#

Stay in the llama.cpp directory.

python convert.py huggingface model directory \
--outfile moxing.gguf \
--outtype f16 --vocab-type bpe
# Example
python convert.py ./model/models--cognitivecomputations--dolphin-2.9.1-llama-3-70b/snapshots/3f2d2fae186870be37ac83af1030d00a17766929 \
--outfile ./GGUF/dolphin-2.9.1-llama-3-70b-f16.gguf \
--outtype f16 --vocab-type bpe

It may take some time to execute. After completion, you will get the dolphin-2.9.1-llama-3-70b-f16.gguf file, which has the same size of 130GB. Now, you can run it, but it requires more than 140GB of GPU memory, which is generally not feasible. So, let's quantize the file to reduce its size while sacrificing a bit of quality.

4. Quantize the GGUF model#

First, let's list the quality options:

q2_k: Specific tensors are set to higher precision, while others remain at the base level.
q3_k_l, q3_k_m, q3_k_s: These variants use different levels of precision on different tensors to achieve a balance between performance and efficiency.
q4_0: This is the original quantization scheme, using 4-bit precision.
q4_1 and q4_k_m, q4_k_s: These provide different levels of accuracy and inference speed, suitable for scenarios that require a balance of resource usage.
q5_0, q5_1, q5_k_m, q5_k_s: These versions guarantee higher accuracy but use more resources and have slower inference speed.
q6_k and q8_0: These provide the highest precision but may not be suitable for all users due to high resource consumption and slow speed.

We will use the Q4_K_M scheme.

Still in the llama.cpp directory, after compiling and having the quantize executable file (if not, compile it using make and give it execution permission), run the following command:

./quantize ./GGUF/dolphin-2.9.1-llama-3-70b-f16.gguf ./GGUF/dolphin-2.9.1-llama-3-70b-Q4_K_M.gguf Q4_K_M

After quantization, the file size will be around 40GB. Now, you can run it with a GPU memory of 48GB, reducing the cost by half.

5. Run inference#

You can use llama.cpp for inference or use ollama, which is more friendly to gguf. For specific code, you can visit the official GitHub: https://github.com/ollama/ollama

Summary#

This concludes the first part. If you encounter any issues, feel free to discuss them in the comments.

This article is synchronized and updated to xLog by Mix Space.
The original link is https://sunx.ai/posts/nlp/llama70b