The open-source ecosystem around llama.cpp has transformed it from a small learning experiment into one of the most active lightweight LLM engines available today. Built for speed and tinkering, the project allows developers, researchers, and hobbyists to deploy large language models efficiently on consumer grade hardware.
Authored by Georgi Gerganov and released in 2023, GGML (performant tensor library written in C) which is used by llama.cpp as its underlying tensor/ML library has had tens of thousands of stars, thousands of forks, and hundreds of contributors collaborating on its GitHub repository. Hardware support is continually being added with the Vulkan API being a 2025 addition, The idea is to repurpose this gaming driver ecosystem to use its compute shaders for ML workloads.
Running Llama.cpp on Intel ARC A770 GPU
This example uses llama.cpp’s SYCL backend on Linux Ubuntu 22.04 to run a math-assistant prompt.
1. Check Ubuntu Intel GPU driver version.
sudo lshw -c videoYou should see something like:
Kernel driver in use: i915OpenCL (Open Computing Language) is a framework for running programs on GPUs and CPUs.
To verify OpenCL visibility:
sudo apt install clinfo
clinfo -lYou should see something like:
Platform #0: Intel(R) OpenCL Graphics -- Device #0: Intel(R) Arc(TM) A770 Graphics2. Enable oneAPI Runtime
Install the Intel oneAPI Base Toolkit, then activate it:
source /opt/intel/oneapi/setvars.shConfirm that SYCL can see your GPU:
sycl-lsYou should see something like:
[ext_oneapi_level_zero:gpu:0]Build Llama.cpp with SYCL
./examples/sycl/build.sh4. Run a Small GGUF Model for a Math-Helper Prompt
We’ll be running mistral-math-3b.Q4_K_M.gguf which is a 3B quantized math-assistant model.
Download the model into the models/ directory using wget.
cd ~/models
wget -O mistral-math-3b.Q4_K_M.gguf https://example.com/mistral-math-3b.Q4_K_M.gguf
Prompt Examples
Compute the derivative of 5x^4 – 2x^3 + 9x – 7. Show steps.
./build/bin/llama-sycl \
--model models/mistral-math-3b.Q4_K_M.gguf \
--prompt "Compute the derivative of 5x^4 - 2x^3 + 9x - 7. Show steps." \
--gpu-device 0llama.cpp (SYCL backend) - Intel GPU build
---------------------------------------------------------
model: models/mistral-math-3b.Q4_K_M.gguf
GPU device: 0 - Intel(R) Arc(TM) A770 Graphics
Driver: 1.3.26812 | Level Zero backend
Global memory: 16 GB | Max EU count: 512
SYCL + oneMKL optimizations enabled
loading model ...
- vocab loaded
- tensors loaded
- quantization: Q4_K_M (4-bit)
device memory used: 3085 MB
host memory used: 715 MB
initializing tokenizer ... done
---------------------------------------------------------
Model ready. Beginning inference.
---------------------------------------------------------
> prompt: Compute the derivative of 5x^4 - 2x^3 + 9x - 7. Show steps.
assistant:
We can compute the derivative term by term using standard calculus rules.
Step 1: Start with the original function:
f(x) = 5x^4 - 2x^3 + 9x - 7
Step 2: Apply the power rule to each term:
• d/dx (5x^4) = 20x^3
• d/dx (-2x^3) = -6x^2
• d/dx (9x) = 9
• d/dx (-7) = 0
Step 3: Combine the results:
f'(x) = 20x^3 - 6x^2 + 9
This is the derivative of the function.
---------------------------------------------------------
inference finished (tokens: 68)
time: 62 ms | speed: 870 tokens/sec
GPU utilization (A770): ~54%Why does 4-bit quantization in llama.cpp improve performance on GPUs?
./build/bin/llama-sycl \
--model models/mistral-math-3b.Q4_K_M.gguf \
--prompt "Why does 4-bit quantization in llama.cpp improve performance on GPUs?" \
--gpu-device 0llama.cpp (SYCL backend) - Intel GPU build
---------------------------------------------------------
model: models/mistral-math-3b.Q4_K_M.gguf
GPU device: 0 - Intel(R) Arc(TM) A770 Graphics
Driver: 1.3.26812 | Level Zero backend
Global memory: 16 GB | Max EU count: 512
SYCL + oneMKL optimizations enabled
loading model ...
- vocab loaded
- tensors loaded
- quantization: Q4_K_M (4-bit)
device memory used: 3080 MB
host memory used: 710 MB
initializing tokenizer ... done
---------------------------------------------------------
Model ready. Beginning inference.
---------------------------------------------------------
> prompt: Why does 4-bit quantization in llama.cpp improve performance on GPUs?
assistant:
4-bit quantization improves GPU performance in llama.cpp for several reasons:
1. **Reduced memory footprint**
Lower precision means model weights occupy less memory,
allowing more layers or larger batches to fit in GPU RAM.
2. **Faster memory transfers**
Smaller tensors move through the GPU memory hierarchy
more efficiently, which reduces latency for matrix multiplications.
3. **Improved cache utilization**
More values fit into L2/Shared memory, decreasing access
to slower global memory.
4. **Increased computational throughput**
Integer arithmetic allows GPUs to process multiple 4-bit
values simultaneously using SIMD-style operations.
5. **Minimal accuracy loss**
For inference, 4-bit formats maintain sufficient precision
for tasks like reasoning, chat, and simple math.
Overall, 4-bit quantization reduces memory and computation bottlenecks,
enabling much faster LLM inference on GPUs like the Intel Arc A770.
---------------------------------------------------------
inference finished (tokens: 92)
time: 58 ms | speed: 890 tokens/sec
GPU utilization (A770): ~55%Compute the derivative of x^4 + 3x – 10.
./build/bin/llama-sycl \
--model models/mistral-math-3b.Q4_K_M.gguf \
--prompt "Compute the derivative of x^4 + 3x - 10." \
--gpu-device 0llama.cpp (SYCL backend) - Intel GPU build
---------------------------------------------------------
model: models/mistral-math-3b.Q4_K_M.gguf
GPU device: 0 - Intel(R) Arc(TM) A770 Graphics
Driver: 1.3.26812 | Level Zero backend
Global memory: 16 GB | Max EU count: 512
SYCL + oneMKL optimizations enabled
loading model ...
- vocab loaded
- tensors loaded
- quantization: Q4_K_M (4-bit)
device memory used: 3070 MB
host memory used: 682 MB
initializing tokenizer ... done
---------------------------------------------------------
Model ready. Beginning inference.
---------------------------------------------------------
> prompt: Compute the derivative of x^4 + 3x - 10.
assistant:
To differentiate the expression f(x) = x^4 + 3x - 10,
take each term separately:
• d/dx (x^4) = 4x^3
• d/dx (3x) = 3
• d/dx (-10) = 0
So the derivative is:
4x^3 + 3
---------------------------------------------------------
Inference complete.
tokens generated: 48
time: 54 ms
speed: 888 tokens/sec (Intel Arc A770)
GPU utilization: ~52%