Skip to content

This document describes how to quantize your models on Deucalion and test them. If you adjust some details, you can also test this on BSC Mare Nostrum 5.

Pre-requisites

  • Active account on Deucalion (you can also test most of this workflow on MN5 with some adjustments)
  • Familiarity with Slurm job submission (sbatch, srun)
  • Models pre-downloaded to /projects/... (accessible from compute nodes)

Steps

  1. Download model
  2. Clone & build llama.cpp
  3. Prepare conversion environment
  4. Convert the model
  5. Test model

Clone & build llama.cpp

For this, you download llama.cppto use only cpu and do the following steps:

1. Save as script_build_llama_cpu.sh:

#!/bin/bash
#SBATCH --account=your_account
#SBATCH --partition=dev-x86
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --time=00:30:00
#SBATCH --output=logs/build_%j.out
#SBATCH --error=logs/build_%j.err

#Load dependencies
module load CMake/3.31.3-GCCcore-14.2.0 

#clone the repository
git clone https://github.com/ggml-org/llama.cpp

#open folder
cd llama.cpp

# Configure CPU-only build
cmake -B build_cpu \
  -DGGML_CUDA=OFF \
  -DLLAMA_BUILD_UI=ON \
  -DCMAKE_BUILD_TYPE=Release

#build
cmake --build build_cpu --config Release -j$(nproc)

echo "llama.cpp ready"

If you want to use the gpu nodes save the following file as script_build_llama_gpu.sh:

 #!/bin/bash
#SBATCH --account=account_name
#SBATCH --partition=dev-a100-40
#SBATCH --gpus=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --time=00:30:00
#SBATCH --output=logs/build_%j.out
#SBATCH --error=logs/build_%j.err

#load dependencies
module load CMake/3.31.3-GCCcore-14.2.0 
module load CUDA/12.9.1 

#clone repository
git clone https://github.com/ggml-org/llama.cpp

#open folder
cd llama.cpp

# Configure with CUDA support
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_FLAGS="-allow-unsupported-compiler" \
  -DLLAMA_BUILD_UI=ON

# Build
cmake --build build --config Release -j$(nproc)

echo "llama.cpp ready"

2. Submit the job and monitor:

# Submit job
sbatch script_build_llama_cpu.sh  # or _gpu.sh

# Monitor logs
tail -f logs/build_%j.out #or logs/build_%j.err 
#press ctrl+C to exit

# Check job status
squeue --me

Note: For MN5, since you so not have internet connection you can build llama.cpp on Deucalion x86 partitions and move the folder to MN5 (Deucalion-->laptop-->MN5) or build it in your own environment and transfer it there (it is advised to use a container environment to avoid problems due to architecture differences).

Prepare conversion environment

Option A: Singularity container

For the following steps, you can build your container inside Deucalion.

1. Create the requirements.txt file

You can build a container with the following requirements.txt:

torch
transformers 
safetensors 
sentencepiece 
protobuf 
numpy
accelerate

2. Create container.def:

Bootstrap: docker
From: python:3.12
Stage: build

%files
    requirements.txt

%post
    # Avoid interactive prompts
    export DEBIAN_FRONTEND=noninteractive
    pip install --upgrade pip
    python -m pip install -r requirements.txt


%environment
    export OMP_NUM_THREADS=1
    export LC_ALL=C

3. Build on a compute node

For this, you can run an interactive session:

#interactive session
srun -A <account_name> --time=04:00:00 --nodes=1  -p dev-x86 --pty bash

#build container
singularity build --fakeroot /tmp/container.sif container.def

#Then copy the container to the wanted folder
mv container.sif wanted_folder

Note: This --fakeroot option requires prior approval. Contact our support or Deucalion support if not enabled.

Option B: Pandora Tool (x86 only)

The alternative is to use the pandora tool available at Deucalion for x86 architectures:

#build container
pandora build pip -n "containername" \
-r "requirements.txt" \
-a "x86" 

Convert you model to .gguf format (quantized version)

EXTRA: Merge LORA weights + Model

If you intend to use your model as it is (downloaded for example from huggingface), you can skip this section.

  1. In case you have fine-tuned your own model, you can use a Python script merge_lora.py like the following one to merge LORA weights with the base model weights.
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
import sys

base_path = "/path/to/your/base/model"
lora_path = "/path/to/your/fine-tuned/model"
output_dir = "output/path/for/your/merged/model"

print("Loading base model...")
base_model = AutoModelForCausalLM.from_pretrained(
        base_path,
        torch_dtype=torch.bfloat16,  # Use bfloat16 for better precision
        device_map="auto",
        low_cpu_mem_usage=True,
        trust_remote_code=True  # Required for Qwen architectures
    )

print("Loading LoRA adapter...")
model = PeftModel.from_pretrained(base_model, lora_path)

print("Merging LoRA with base model...")
merged_model = model.merge_and_unload()

print("Saving merged model...")
merged_model.save_pretrained(output_dir, safe_serialization=True, max_shard_size="4GB")

print("Saving tokenizer...")
AutoTokenizer.from_pretrained(base_path, trust_remote_code=True).save_pretrained(output_dir)

print(f"Merged model saved to: {output_dir}")
  1. To run the previous Python script save first run-merge.sh:
#!/bin/bash
#SBATCH --account=<account_name> 
#SBATCH --partition=<partition_name>        
#SBATCH --nodes=1   
#SBATCH --gpus=1   #if using a100 partitions    
#SBATCH --cpus-per-task=8
#SBATCH --time=00:30:00
#SBATCH --output=logs/merge_%j.out
#SBATCH --error=logs/merge_%j.err

# Point to your script & container
MERGE_SCRIPT="/path/to/merge_lora.py"
CONTAINER="/path/to/container/container_name.sif"

singularity exec \
    --nv \
    --bind /projects:/projects \
    "$CONTAINER" \
    python3 "$MERGE_SCRIPT"
  1. And then, you can run it by typing sbatch run-merge.sh.

Convert the model

Once you have your environment prepared with the container, you can save the following script gguf-conv.sh:

#!/bin/bash
#SBATCH --account=<account>
#SBATCH --partition=<partition> #choose x86 cpu or gpu
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --time=01:00:00
#SBATCH --output=logs/conv_%j.out
#SBATCH --error=logs/conv_%j.err


CONTAINER="/your/path/to/container/container_name.sif"


export OMP_NUM_THREADS=32

singularity exec \
  --bind /projects:/projects \
  "$CONTAINER" \
  bash -c '
    set -e
    cd /path/to/llama.cpp

    MODEL_PATH="/path/to/your/model" #either a model like QWEN3-8B or a merged model
    OUTPUT_PATH="/output/path/"

    # Step 1: Convert to FP16 GGUF
    python convert_hf_to_gguf.py "$QWEN_PATH" \
      --outfile "$OUTPUT_PATH/qwen8b-f16.gguf" \
      --outtype f16

    # Step 2: Quantize
    ./build_cpu/bin/llama-quantize \
      "$OUTPUT_PATH/qwen8b-f16.gguf" \
      "$OUTPUT_PATH/qwen8b-Q4_K_M.gguf" \
      Q4_K_M
  '

Please be aware that if you choose dev-x86or normal-x86 cpu nodes you choose use your llama.cpp built with cpu and therefore run the command ./build_cpu/bin/llama-quantize. Otherwise, if you wish to use gpu nodes you should run it like this ./build/bin/llama-quantize.

To run this script you type on your terminal:

sbatch gguf-conv.sh

You can keep track of the logs on your terminal like this:

# Monitor logs
tail -f logs/conv_%j.out #or logs/conv_%j.err 
#press ctrl+C to exit

# Check job status
squeue --me

Test the model with llama-server

Option A (only on Deucalion)

On Deucalion, you can open an interactive session on llama.cpp of Open OnDemand platform. This is recommended for an interactive testing.

  1. Login in to Deucalion Open OnDemand
  2. Navigate to Interactive AppsLlama.cpp
  3. Configure:
    • Account
    • Partition (you can choose arm)
    • Number of cores
    • Time: e.g. 1h
    • Model path: /projects/models/gguf/... (absolute path to .gguf file); if you don't have quantized models yet you can use the default path for available models at Deucalion.

And then select lauch.

Option B

You can test llama-server on your terminal.

Open an interactive session:

srun -A <account_name> --time=04:00:00 --nodes=1  -p dev-x86 --pty bash

singularity shell llama-gguf-x86.sif
cd llama.cpp
./build_cpu/bin/llama-server \
  -m /path/to/model/qwen8b-Q4_K_M.gguf \
  -c 4096 \
  -ngl 0 \
  --host 0.0.0.0 \
  --port 8080

In another terminal:

ssh -L 8080:localhost:8080 <username@login.deucalion.macc.fccn.pt> #open an ssh tunnel that connects your local machine to Deucalion
In the same terminal, type:

 ssh -L 8080:localhost:8080 <nodenumber> #to connect to the node ex: gnx504

Then, to use llama.cpp ui and test the model you open your browser at http://localhost:8080.


Last updated: $(01-06-2026)