Fine Tuning Llama3.1-8b

This document describes a simple process for fine tuning llama3.1-8b on Mare Nostrum 5 using LoRA for efficient parameter adaptation.

The goal is to create a model that has knowledge about the Deucalion supercomputer and can answer questions about its utilization. To achieve this goal, a custom training dataset with questions and answers is used.

This implementation is based on the following libraries:

Transformers
PEFT
TRL
Datasets
PyTorch 2.9.0
Cuda 12.6

Environment Setup

Containerization with Singularity

We are going to create a Singularity container with all the necessary libraries and transfer it to MN5.

First, start by cloning this repository to your local machine

git clone https://gitlab.acnca.pt/cnca/aif-pt/baif-quickstart.git

Navigate to the fine_tuning_llama3.1-8b folder, containing the necessary files
```
cd baif-quickstart/example_files/fine_tuning_llama3.1-8b
```

Build the Singularity container

singularity build --fakeroot simple_ft.sif singularity.def

Transfer the content of the fine_tuning_llama3.1-8b folder to MN5
```
scp -r /path/to/baif-quickstart/example_files/fine_tuning_llama3.1-8b <username>@transfer1.bsc.es:/gpfs/projects/<project_name>/fine_tuning_llama3.1-8b
```
Info

This will include the container itself (.sif file), as well as the necessary scripts (.py, .sh) and the training data.

Training Data

The training dataset consists of question and answer pairs generated based on Deucalion's documentation. It was already uploaded to MN5 on a previous step.

Here is a snippet of the training data:

data.jsonl

[...]
{"messages": [{"role": "user", "content": "How do I submit a batch job to Deucalion?"}, {"role": "assistant", "content": "Submit a batch job by creating a batch job script and using the `sbatch` command."}]}
{"messages": [{"role": "user", "content": "How can I check the status of my submitted job on Deucalion?"}, {"role": "assistant", "content": "Use the `squeue` command to view the status of your submitted jobs."}]}
[...]

Model

You can request BAIF User Support access to open weights models such as Qwen, DeepSeek, Llama, Mistral and others. Once your request is accepted, models will be transferred to your scratch folder (e.g. /gpfs/scratch/epor-aif005/models/llama3.1-8b/).

Alternatively, you can upload your own models to MN5 via scp.

Fine Tuning

SSH to MN5's accelerated partition
```
ssh <username>@alogin1.bsc.es
```
Change directory to fine_tuning_llama3.1-8b, where the files were transferred to
```
cd /gpfs/projects/<project_name>/fine_tuning_llama3.1-8b
```

Edit the bash script run_ft.sh used to define the fine tuning job:

Change the <project_name> accordingly (run bsc_project list to find your project name).

You can use nano to edit the file (e.g. nano run_ft.sh - use ctrl + o to write and ctrl + x to exit). Enable nano with module load nano.

Overview of run_ft.sh

This is the bash script that is going to be submitted to the SLURM queue. It calls the python code that runs the fine tuning (simple_fine_tuning.py).

The header of the file containers the SLURM job directives.

run_ft.sh

#!/bin/bash
#SBATCH --job-name=simple_ft_llama
#SBATCH --qos=acc_ehpc
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --time=00:20:00   # Adjust time limit
#SBATCH --output=slurm_output.out
#SBATCH --error=slurm_error.err

# Define project_name (used to set the correct paths to container and model)
PROJECT_NAME="<project_name>"

# Define paths
CONTAINER_PATH="/gpfs/projects/$PROJECT_NAME/fine_tuning_llama3.1-8b/simple_ft.sif" # Update this to the actual path of your .sif file

PROJECT_DIR_HOST="./" # assumes this bash script is in the same folder that contains the project files
PROJECT_DIR_CONTAINER="/app" # Path inside the container where PROJECT_DIR_HOST will be mounted


MODEL_DIR_HOST="/gpfs/scratch/$PROJECT_NAME/models/llama3.1-8b" # Path to the model directory on the host
MODEL_DIR_CONTAINER="/app/model" # Where it will be mounted inside container


singularity exec \
    --nv \
    --pwd $PROJECT_DIR_CONTAINER \
    --bind $PROJECT_DIR_HOST:$PROJECT_DIR_CONTAINER \
    --bind $MODEL_DIR_HOST:$MODEL_DIR_CONTAINER \
    $CONTAINER_PATH \
    python $PROJECT_DIR_CONTAINER/simple_fine_tuning.py

Submit the fine tuning job to SLURM
```
sbatch -A <project_name> run_ft.sh
```
(Optional) Monitor the status of your job with squeue --me
(Optional) Check for errors and logs in the slurm_output.out and slurm_error.err files using cat (e.g. cat slurm_error.err)

Check for the fine tuning completion

cat slurm_output.out

After successfull completion you should find the following message in the output logs:

==================================
Fine Tuning Completed Successfully
==================================