1. Download#
First, download the model from Hugging Face. You can download the original model or other variants. Here, I will download a llama3-70B variant, which has a total size of around 130GB.
huggingface-cli download cognitivecomputations/dolphin-2.9.1-llama-3-70b --cache-dir ./model
If you are using a network in China, you can use the Hugging Face proxy https://hf-mirror.com/ for downloading.
2. Install llama.cpp#
Download and install from GitHub: https://github.com/ggerganov/llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
Installation completed.
3. Convert the model to ggml format#
Stay in the llama.cpp directory.
python convert.py huggingface model directory \
--outfile moxing.gguf \
--outtype f16 --vocab-type bpe
# Example
python convert.py ./model/models--cognitivecomputations--dolphin-2.9.1-llama-3-70b/snapshots/3f2d2fae186870be37ac83af1030d00a17766929 \
--outfile ./GGUF/dolphin-2.9.1-llama-3-70b-f16.gguf \
--outtype f16 --vocab-type bpe
It may take some time to execute. After completion, you will get the dolphin-2.9.1-llama-3-70b-f16.gguf
file, which has the same size of 130GB. Now, you can run it, but it requires more than 140GB of GPU memory, which is generally not feasible. So, let's quantize the file to reduce its size while sacrificing a bit of quality.
4. Quantize the GGUF model#
First, let's list the quality options:
- q2_k: Specific tensors are set to higher precision, while others remain at the base level.
- q3_k_l, q3_k_m, q3_k_s: These variants use different levels of precision on different tensors to achieve a balance between performance and efficiency.
- q4_0: This is the original quantization scheme, using 4-bit precision.
- q4_1 and q4_k_m, q4_k_s: These provide different levels of accuracy and inference speed, suitable for scenarios that require a balance of resource usage.
- q5_0, q5_1, q5_k_m, q5_k_s: These versions guarantee higher accuracy but use more resources and have slower inference speed.
- q6_k and q8_0: These provide the highest precision but may not be suitable for all users due to high resource consumption and slow speed.
We will use the Q4_K_M scheme.
Still in the llama.cpp directory, after compiling and having the quantize
executable file (if not, compile it using make
and give it execution permission), run the following command:
./quantize ./GGUF/dolphin-2.9.1-llama-3-70b-f16.gguf ./GGUF/dolphin-2.9.1-llama-3-70b-Q4_K_M.gguf Q4_K_M
After quantization, the file size will be around 40GB. Now, you can run it with a GPU memory of 48GB, reducing the cost by half.
5. Run inference#
You can use llama.cpp for inference or use ollama, which is more friendly to gguf. For specific code, you can visit the official GitHub: https://github.com/ollama/ollama
Summary#
This concludes the first part. If you encounter any issues, feel free to discuss them in the comments.
This article is synchronized and updated to xLog by Mix Space.
The original link is https://sunx.ai/posts/nlp/llama70b