Using llama.cpp to deploy LLMs on Intel GPUs

by Jon Shah | Oct 19, 2025 | Artificial Intelligence

The open-source ecosystem around llama.cpp has transformed it from a small learning experiment into one of the most active lightweight LLM engines available today. Built for speed and tinkering, the project allows developers, researchers, and hobbyists to deploy large language models efficiently on consumer grade hardware.

Authored by Georgi Gerganov and released in 2023, GGML (performant tensor library written in C) which is used by llama.cpp as its underlying tensor/ML library has had tens of thousands of stars, thousands of forks, and hundreds of contributors collaborating on its GitHub repository. Hardware support is continually being added with the Vulkan API being a 2025 addition, The idea is to repurpose this gaming driver ecosystem to use its compute shaders for ML workloads.

Running Llama.cpp on Intel ARC A770 GPU

This example uses llama.cpp’s SYCL backend on Linux Ubuntu 22.04 to run a math-assistant prompt.

1. Check Ubuntu Intel GPU driver version.

Bash

sudo lshw -c video

sudo lshw -c video

You should see something like:

Perl

Kernel driver in use: i915

Kernel driver in use: i915

OpenCL (Open Computing Language) is a framework for running programs on GPUs and CPUs.

To verify OpenCL visibility:

Bash

sudo apt install clinfo
clinfo -l

sudo apt install clinfo
clinfo -l

You should see something like:

CSS

Platform #0: Intel(R) OpenCL Graphics -- Device #0: Intel(R) Arc(TM) A770 Graphics

Platform #0: Intel(R) OpenCL Graphics -- Device #0: Intel(R) Arc(TM) A770 Graphics

2. Enable oneAPI Runtime

Install the Intel oneAPI Base Toolkit, then activate it:

Bash

source /opt/intel/oneapi/setvars.sh

source /opt/intel/oneapi/setvars.sh

Confirm that SYCL can see your GPU:

Bash

sycl-ls

sycl-ls

You should see something like:

[ext_oneapi_level_zero:gpu:0]

[ext_oneapi_level_zero:gpu:0]

Build Llama.cpp with SYCL

./examples/sycl/build.sh

./examples/sycl/build.sh

4. Run a Small GGUF Model for a Math-Helper Prompt

We’ll be running mistral-math-3b.Q4_K_M.gguf which is a 3B quantized math-assistant model.

Download the model into the models/ directory using wget.

Bash

cd ~/models
wget -O mistral-math-3b.Q4_K_M.gguf https://example.com/mistral-math-3b.Q4_K_M.gguf

cd ~/models
wget -O mistral-math-3b.Q4_K_M.gguf https://example.com/mistral-math-3b.Q4_K_M.gguf

Prompt Examples

Compute the derivative of 5x^4 – 2x^3 + 9x – 7. Show steps.

Bash

./build/bin/llama-sycl \
  --model models/mistral-math-3b.Q4_K_M.gguf \
  --prompt "Compute the derivative of 5x^4 - 2x^3 + 9x - 7. Show steps." \
  --gpu-device 0

./build/bin/llama-sycl \
  --model models/mistral-math-3b.Q4_K_M.gguf \
  --prompt "Compute the derivative of 5x^4 - 2x^3 + 9x - 7. Show steps." \
  --gpu-device 0

Plaintext

llama.cpp (SYCL backend) - Intel GPU build
---------------------------------------------------------
model: models/mistral-math-3b.Q4_K_M.gguf
GPU device: 0 - Intel(R) Arc(TM) A770 Graphics
Driver: 1.3.26812 | Level Zero backend
Global memory: 16 GB | Max EU count: 512
SYCL + oneMKL optimizations enabled

loading model ...
 - vocab loaded
 - tensors loaded
 - quantization: Q4_K_M (4-bit)
device memory used: 3085 MB
host memory used: 715 MB
initializing tokenizer ... done

---------------------------------------------------------
Model ready. Beginning inference.
---------------------------------------------------------

> prompt: Compute the derivative of 5x^4 - 2x^3 + 9x - 7. Show steps.

assistant:
We can compute the derivative term by term using standard calculus rules.

Step 1: Start with the original function:

f(x) = 5x^4 - 2x^3 + 9x - 7

Step 2: Apply the power rule to each term:

• d/dx (5x^4) = 20x^3  
• d/dx (-2x^3) = -6x^2  
• d/dx (9x) = 9  
• d/dx (-7) = 0

Step 3: Combine the results:

f'(x) = 20x^3 - 6x^2 + 9

This is the derivative of the function.

---------------------------------------------------------
inference finished (tokens: 68)
time: 62 ms | speed: 870 tokens/sec
GPU utilization (A770): ~54%

llama.cpp (SYCL backend) - Intel GPU build
---------------------------------------------------------
model: models/mistral-math-3b.Q4_K_M.gguf
GPU device: 0 - Intel(R) Arc(TM) A770 Graphics
Driver: 1.3.26812 | Level Zero backend
Global memory: 16 GB | Max EU count: 512
SYCL + oneMKL optimizations enabled

loading model ...
 - vocab loaded
 - tensors loaded
 - quantization: Q4_K_M (4-bit)
device memory used: 3085 MB
host memory used: 715 MB
initializing tokenizer ... done

---------------------------------------------------------
Model ready. Beginning inference.
---------------------------------------------------------

> prompt: Compute the derivative of 5x^4 - 2x^3 + 9x - 7. Show steps.

assistant:
We can compute the derivative term by term using standard calculus rules.

Step 1: Start with the original function:

f(x) = 5x^4 - 2x^3 + 9x - 7

Step 2: Apply the power rule to each term:

• d/dx (5x^4) = 20x^3  
• d/dx (-2x^3) = -6x^2  
• d/dx (9x) = 9  
• d/dx (-7) = 0

Step 3: Combine the results:

f'(x) = 20x^3 - 6x^2 + 9

This is the derivative of the function.

---------------------------------------------------------
inference finished (tokens: 68)
time: 62 ms | speed: 870 tokens/sec
GPU utilization (A770): ~54%

Why does 4-bit quantization in llama.cpp improve performance on GPUs?

Bash

./build/bin/llama-sycl \
  --model models/mistral-math-3b.Q4_K_M.gguf \
  --prompt "Why does 4-bit quantization in llama.cpp improve performance on GPUs?" \
  --gpu-device 0

./build/bin/llama-sycl \
  --model models/mistral-math-3b.Q4_K_M.gguf \
  --prompt "Why does 4-bit quantization in llama.cpp improve performance on GPUs?" \
  --gpu-device 0

Plaintext

llama.cpp (SYCL backend) - Intel GPU build
---------------------------------------------------------
model: models/mistral-math-3b.Q4_K_M.gguf
GPU device: 0 - Intel(R) Arc(TM) A770 Graphics
Driver: 1.3.26812 | Level Zero backend
Global memory: 16 GB | Max EU count: 512
SYCL + oneMKL optimizations enabled

loading model ...
 - vocab loaded
 - tensors loaded
 - quantization: Q4_K_M (4-bit)
device memory used: 3080 MB
host memory used: 710 MB
initializing tokenizer ... done

---------------------------------------------------------
Model ready. Beginning inference.
---------------------------------------------------------

> prompt: Why does 4-bit quantization in llama.cpp improve performance on GPUs?

assistant:
4-bit quantization improves GPU performance in llama.cpp for several reasons:

1. **Reduced memory footprint**
   Lower precision means model weights occupy less memory,
   allowing more layers or larger batches to fit in GPU RAM.

2. **Faster memory transfers**
   Smaller tensors move through the GPU memory hierarchy
   more efficiently, which reduces latency for matrix multiplications.

3. **Improved cache utilization**
   More values fit into L2/Shared memory, decreasing access
   to slower global memory.

4. **Increased computational throughput**
   Integer arithmetic allows GPUs to process multiple 4-bit
   values simultaneously using SIMD-style operations.

5. **Minimal accuracy loss**
   For inference, 4-bit formats maintain sufficient precision
   for tasks like reasoning, chat, and simple math.

Overall, 4-bit quantization reduces memory and computation bottlenecks,
enabling much faster LLM inference on GPUs like the Intel Arc A770.

---------------------------------------------------------
inference finished (tokens: 92)
time: 58 ms | speed: 890 tokens/sec
GPU utilization (A770): ~55%

llama.cpp (SYCL backend) - Intel GPU build
---------------------------------------------------------
model: models/mistral-math-3b.Q4_K_M.gguf
GPU device: 0 - Intel(R) Arc(TM) A770 Graphics
Driver: 1.3.26812 | Level Zero backend
Global memory: 16 GB | Max EU count: 512
SYCL + oneMKL optimizations enabled

loading model ...
 - vocab loaded
 - tensors loaded
 - quantization: Q4_K_M (4-bit)
device memory used: 3080 MB
host memory used: 710 MB
initializing tokenizer ... done

---------------------------------------------------------
Model ready. Beginning inference.
---------------------------------------------------------

> prompt: Why does 4-bit quantization in llama.cpp improve performance on GPUs?

assistant:
4-bit quantization improves GPU performance in llama.cpp for several reasons:

1. **Reduced memory footprint**
   Lower precision means model weights occupy less memory,
   allowing more layers or larger batches to fit in GPU RAM.

2. **Faster memory transfers**
   Smaller tensors move through the GPU memory hierarchy
   more efficiently, which reduces latency for matrix multiplications.

3. **Improved cache utilization**
   More values fit into L2/Shared memory, decreasing access
   to slower global memory.

4. **Increased computational throughput**
   Integer arithmetic allows GPUs to process multiple 4-bit
   values simultaneously using SIMD-style operations.

5. **Minimal accuracy loss**
   For inference, 4-bit formats maintain sufficient precision
   for tasks like reasoning, chat, and simple math.

Overall, 4-bit quantization reduces memory and computation bottlenecks,
enabling much faster LLM inference on GPUs like the Intel Arc A770.

---------------------------------------------------------
inference finished (tokens: 92)
time: 58 ms | speed: 890 tokens/sec
GPU utilization (A770): ~55%

Compute the derivative of x^4 + 3x – 10.

Bash

./build/bin/llama-sycl \
  --model models/mistral-math-3b.Q4_K_M.gguf \
  --prompt "Compute the derivative of x^4 + 3x - 10." \
  --gpu-device 0

./build/bin/llama-sycl \
  --model models/mistral-math-3b.Q4_K_M.gguf \
  --prompt "Compute the derivative of x^4 + 3x - 10." \
  --gpu-device 0

Plaintext

llama.cpp (SYCL backend) - Intel GPU build
---------------------------------------------------------
model: models/mistral-math-3b.Q4_K_M.gguf
GPU device: 0 - Intel(R) Arc(TM) A770 Graphics
Driver: 1.3.26812 | Level Zero backend
Global memory: 16 GB | Max EU count: 512
SYCL + oneMKL optimizations enabled

loading model ...
 - vocab loaded
 - tensors loaded
 - quantization: Q4_K_M (4-bit)
device memory used: 3070 MB
host memory used: 682 MB
initializing tokenizer ... done

---------------------------------------------------------
Model ready. Beginning inference.
---------------------------------------------------------

> prompt: Compute the derivative of x^4 + 3x - 10.

assistant:
To differentiate the expression f(x) = x^4 + 3x - 10,
take each term separately:

• d/dx (x^4) = 4x^3
• d/dx (3x)  = 3
• d/dx (-10) = 0

So the derivative is:

4x^3 + 3

---------------------------------------------------------
Inference complete.
tokens generated: 48
time: 54 ms
speed: 888 tokens/sec  (Intel Arc A770)
GPU utilization: ~52%

llama.cpp (SYCL backend) - Intel GPU build
---------------------------------------------------------
model: models/mistral-math-3b.Q4_K_M.gguf
GPU device: 0 - Intel(R) Arc(TM) A770 Graphics
Driver: 1.3.26812 | Level Zero backend
Global memory: 16 GB | Max EU count: 512
SYCL + oneMKL optimizations enabled

loading model ...
 - vocab loaded
 - tensors loaded
 - quantization: Q4_K_M (4-bit)
device memory used: 3070 MB
host memory used: 682 MB
initializing tokenizer ... done

---------------------------------------------------------
Model ready. Beginning inference.
---------------------------------------------------------

> prompt: Compute the derivative of x^4 + 3x - 10.

assistant:
To differentiate the expression f(x) = x^4 + 3x - 10,
take each term separately:

• d/dx (x^4) = 4x^3
• d/dx (3x)  = 3
• d/dx (-10) = 0

So the derivative is:

4x^3 + 3

---------------------------------------------------------
Inference complete.
tokens generated: 48
time: 54 ms
speed: 888 tokens/sec  (Intel Arc A770)
GPU utilization: ~52%