This document describes how to quantize your models on Deucalion and test them. If you adjust some details, you can also test this on BSC Mare Nostrum 5.
Pre-requisites
- Active account on Deucalion (you can also test most of this workflow on MN5 with some adjustments)
- Familiarity with Slurm job submission (
sbatch,srun) - Models pre-downloaded to /projects/... (accessible from compute nodes)
Steps
- Download model
- Clone & build llama.cpp
- Prepare conversion environment
- Convert the model
- Test model
Clone & build llama.cpp
For this, you download llama.cppto use only cpu and do the following steps:
1. Save as script_build_llama_cpu.sh:
#!/bin/bash
#SBATCH --account=your_account
#SBATCH --partition=dev-x86
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --time=00:30:00
#SBATCH --output=logs/build_%j.out
#SBATCH --error=logs/build_%j.err
#Load dependencies
module load CMake/3.31.3-GCCcore-14.2.0
#clone the repository
git clone https://github.com/ggml-org/llama.cpp
#open folder
cd llama.cpp
# Configure CPU-only build
cmake -B build_cpu \
-DGGML_CUDA=OFF \
-DLLAMA_BUILD_UI=ON \
-DCMAKE_BUILD_TYPE=Release
#build
cmake --build build_cpu --config Release -j$(nproc)
echo "llama.cpp ready"
If you want to use the gpu nodes save the following file as script_build_llama_gpu.sh:
#!/bin/bash
#SBATCH --account=account_name
#SBATCH --partition=dev-a100-40
#SBATCH --gpus=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --time=00:30:00
#SBATCH --output=logs/build_%j.out
#SBATCH --error=logs/build_%j.err
#load dependencies
module load CMake/3.31.3-GCCcore-14.2.0
module load CUDA/12.9.1
#clone repository
git clone https://github.com/ggml-org/llama.cpp
#open folder
cd llama.cpp
# Configure with CUDA support
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_FLAGS="-allow-unsupported-compiler" \
-DLLAMA_BUILD_UI=ON
# Build
cmake --build build --config Release -j$(nproc)
echo "llama.cpp ready"
2. Submit the job and monitor:
Note: For MN5, since you so not have internet connection you can build llama.cpp on Deucalion x86 partitions and move the folder to MN5 (Deucalion-->laptop-->MN5) or build it in your own environment and transfer it there (it is advised to use a container environment to avoid problems due to architecture differences).
Prepare conversion environment
Option A: Singularity container
For the following steps, you can build your container inside Deucalion.
1. Create the requirements.txt file
You can build a container with the following requirements.txt:
2. Create container.def:
Bootstrap: docker
From: python:3.12
Stage: build
%files
requirements.txt
%post
# Avoid interactive prompts
export DEBIAN_FRONTEND=noninteractive
pip install --upgrade pip
python -m pip install -r requirements.txt
%environment
export OMP_NUM_THREADS=1
export LC_ALL=C
3. Build on a compute node
For this, you can run an interactive session:
Note: This --fakeroot option requires prior approval. Contact our support or Deucalion support if not enabled.
Option B: Pandora Tool (x86 only)
The alternative is to use the pandora tool available at Deucalion for x86 architectures:
Convert you model to .gguf format (quantized version)
EXTRA: Merge LORA weights + Model
If you intend to use your model as it is (downloaded for example from huggingface), you can skip this section.
- In case you have fine-tuned your own model, you can use a Python script
merge_lora.pylike the following one to merge LORA weights with the base model weights.
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
import sys
base_path = "/path/to/your/base/model"
lora_path = "/path/to/your/fine-tuned/model"
output_dir = "output/path/for/your/merged/model"
print("Loading base model...")
base_model = AutoModelForCausalLM.from_pretrained(
base_path,
torch_dtype=torch.bfloat16, # Use bfloat16 for better precision
device_map="auto",
low_cpu_mem_usage=True,
trust_remote_code=True # Required for Qwen architectures
)
print("Loading LoRA adapter...")
model = PeftModel.from_pretrained(base_model, lora_path)
print("Merging LoRA with base model...")
merged_model = model.merge_and_unload()
print("Saving merged model...")
merged_model.save_pretrained(output_dir, safe_serialization=True, max_shard_size="4GB")
print("Saving tokenizer...")
AutoTokenizer.from_pretrained(base_path, trust_remote_code=True).save_pretrained(output_dir)
print(f"Merged model saved to: {output_dir}")
- To run the previous Python script save first
run-merge.sh:
#!/bin/bash
#SBATCH --account=<account_name>
#SBATCH --partition=<partition_name>
#SBATCH --nodes=1
#SBATCH --gpus=1 #if using a100 partitions
#SBATCH --cpus-per-task=8
#SBATCH --time=00:30:00
#SBATCH --output=logs/merge_%j.out
#SBATCH --error=logs/merge_%j.err
# Point to your script & container
MERGE_SCRIPT="/path/to/merge_lora.py"
CONTAINER="/path/to/container/container_name.sif"
singularity exec \
--nv \
--bind /projects:/projects \
"$CONTAINER" \
python3 "$MERGE_SCRIPT"
- And then, you can run it by typing
sbatch run-merge.sh.
Convert the model
Once you have your environment prepared with the container, you can save the following script gguf-conv.sh:
#!/bin/bash
#SBATCH --account=<account>
#SBATCH --partition=<partition> #choose x86 cpu or gpu
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --time=01:00:00
#SBATCH --output=logs/conv_%j.out
#SBATCH --error=logs/conv_%j.err
CONTAINER="/your/path/to/container/container_name.sif"
export OMP_NUM_THREADS=32
singularity exec \
--bind /projects:/projects \
"$CONTAINER" \
bash -c '
set -e
cd /path/to/llama.cpp
MODEL_PATH="/path/to/your/model" #either a model like QWEN3-8B or a merged model
OUTPUT_PATH="/output/path/"
# Step 1: Convert to FP16 GGUF
python convert_hf_to_gguf.py "$QWEN_PATH" \
--outfile "$OUTPUT_PATH/qwen8b-f16.gguf" \
--outtype f16
# Step 2: Quantize
./build_cpu/bin/llama-quantize \
"$OUTPUT_PATH/qwen8b-f16.gguf" \
"$OUTPUT_PATH/qwen8b-Q4_K_M.gguf" \
Q4_K_M
'
Please be aware that if you choose dev-x86or normal-x86 cpu nodes you choose use your llama.cpp built with cpu and therefore run the command ./build_cpu/bin/llama-quantize. Otherwise, if you wish to use gpu nodes you should run it like this ./build/bin/llama-quantize.
To run this script you type on your terminal:
You can keep track of the logs on your terminal like this:
# Monitor logs
tail -f logs/conv_%j.out #or logs/conv_%j.err
#press ctrl+C to exit
# Check job status
squeue --me
Test the model with llama-server
Option A (only on Deucalion)
On Deucalion, you can open an interactive session on llama.cpp of Open OnDemand platform. This is recommended for an interactive testing.
- Login in to Deucalion Open OnDemand
- Navigate to Interactive Apps → Llama.cpp
- Configure:
- Account
- Partition (you can choose arm)
- Number of cores
- Time: e.g. 1h
- Model path:
/projects/models/gguf/...(absolute path to .gguf file); if you don't have quantized models yet you can use the default path for available models at Deucalion.
And then select lauch.
Option B
You can test llama-server on your terminal.
Open an interactive session:
srun -A <account_name> --time=04:00:00 --nodes=1 -p dev-x86 --pty bash
singularity shell llama-gguf-x86.sif
cd llama.cpp
./build_cpu/bin/llama-server \
-m /path/to/model/qwen8b-Q4_K_M.gguf \
-c 4096 \
-ngl 0 \
--host 0.0.0.0 \
--port 8080
In another terminal:
ssh -L 8080:localhost:8080 <username@login.deucalion.macc.fccn.pt> #open an ssh tunnel that connects your local machine to Deucalion
Then, to use llama.cpp ui and test the model you open your browser at http://localhost:8080.
Last updated: $(01-06-2026)