Skip to content

Fine Tuning Llama3.1-8b

This document describes a simple process for fine tuning llama3.1-8b on Mare Nostrum 5 using LoRA for efficient parameter adaptation.

The goal is to create a model that has knowledge about the Deucalion supercomputer and can answer questions about its utilization. To achieve this goal, a custom training dataset with questions and answers is used.

This implementation is based on the following libraries:

  • Transformers
  • PEFT
  • TRL
  • Datasets
  • PyTorch 2.9.0
  • Cuda 12.6

Environment Setup

Containerization with Singularity

We are going to create a Singularity container with all the necessary libraries and transfer it to MN5.

  • First, start by cloning this repository to your local machine

    git clone https://gitlab.acnca.pt/cnca/aif-pt/baif-quickstart.git
    
  • Navigate to the fine_tuning_llama3.1-8b folder, containing the necessary files

    cd baif-quickstart/example_files/fine_tuning_llama3.1-8b
    
  • Build the Singularity container

    singularity build --fakeroot simple_ft.sif singularity.def
    
  • Transfer the content of the fine_tuning_llama3.1-8b folder to MN5

    scp -r /path/to/baif-quickstart/example_files/fine_tuning_llama3.1-8b <username>@transfer1.bsc.es:/gpfs/projects/<project_name>/fine_tuning_llama3.1-8b
    

    Info

    This will include the container itself (.sif file), as well as the necessary scripts (.py, .sh) and the training data.

Training Data

The training dataset consists of question and answer pairs generated based on Deucalion's documentation. It was already uploaded to MN5 on a previous step.

Here is a snippet of the training data:

data.jsonl
[...]
{"messages": [{"role": "user", "content": "How do I submit a batch job to Deucalion?"}, {"role": "assistant", "content": "Submit a batch job by creating a batch job script and using the `sbatch` command."}]}
{"messages": [{"role": "user", "content": "How can I check the status of my submitted job on Deucalion?"}, {"role": "assistant", "content": "Use the `squeue` command to view the status of your submitted jobs."}]}
[...]

Model

You can request BAIF User Support access to open weights models such as Qwen, DeepSeek, Llama, Mistral and others. Once your request is accepted, models will be transferred to your scratch folder (e.g. /gpfs/scratch/epor-aif005/models/llama3.1-8b/).

Alternatively, you can upload your own models to MN5 via scp.

Fine Tuning

  • SSH to MN5's accelerated partition

    ssh <username>@alogin1.bsc.es
    
  • Change directory to fine_tuning_llama3.1-8b, where the files were transferred to

    cd /gpfs/projects/<project_name>/fine_tuning_llama3.1-8b
    
  • Edit the bash script run_ft.sh used to define the fine tuning job:

    Change the <project_name> accordingly (run bsc_project list to find your project name).

    You can use nano to edit the file (e.g. nano run_ft.sh - use ctrl + o to write and ctrl + x to exit). Enable nano with module load nano.

    Overview of run_ft.sh

    This is the bash script that is going to be submitted to the SLURM queue. It calls the python code that runs the fine tuning (simple_fine_tuning.py).

    The header of the file containers the SLURM job directives.

    run_ft.sh
    #!/bin/bash
    #SBATCH --job-name=simple_ft_llama
    #SBATCH --qos=acc_ehpc
    #SBATCH --nodes=1
    #SBATCH --gpus=1
    #SBATCH --time=00:20:00   # Adjust time limit
    #SBATCH --output=slurm_output.out
    #SBATCH --error=slurm_error.err
    
    # Define project_name (used to set the correct paths to container and model)
    PROJECT_NAME="<project_name>"
    
    # Define paths
    CONTAINER_PATH="/gpfs/projects/$PROJECT_NAME/fine_tuning_llama3.1-8b/simple_ft.sif" # Update this to the actual path of your .sif file
    
    PROJECT_DIR_HOST="./" # assumes this bash script is in the same folder that contains the project files
    PROJECT_DIR_CONTAINER="/app" # Path inside the container where PROJECT_DIR_HOST will be mounted
    
    
    MODEL_DIR_HOST="/gpfs/scratch/$PROJECT_NAME/models/llama3.1-8b" # Path to the model directory on the host
    MODEL_DIR_CONTAINER="/app/model" # Where it will be mounted inside container
    
    
    singularity exec \
        --nv \
        --pwd $PROJECT_DIR_CONTAINER \
        --bind $PROJECT_DIR_HOST:$PROJECT_DIR_CONTAINER \
        --bind $MODEL_DIR_HOST:$MODEL_DIR_CONTAINER \
        $CONTAINER_PATH \
        python $PROJECT_DIR_CONTAINER/simple_fine_tuning.py
    
  • Submit the fine tuning job to SLURM

    sbatch -A <project_name> run_ft.sh
    
  • (Optional) Monitor the status of your job with squeue --me

  • (Optional) Check for errors and logs in the slurm_output.out and slurm_error.err files using cat (e.g. cat slurm_error.err)

  • Check for the fine tuning completion

    cat slurm_output.out
    

    After successfull completion you should find the following message in the output logs:

    ==================================
    Fine Tuning Completed Successfully
    ==================================
    

Compare Original vs Fine Tuned Models