Fine Tuning Llama3.1-8b
This document describes a simple process for fine tuning llama3.1-8b on Mare Nostrum 5 using LoRA for efficient parameter adaptation.
The goal is to create a model that has knowledge about the Deucalion supercomputer and can answer questions about its utilization. To achieve this goal, a custom training dataset with questions and answers is used.
This implementation is based on the following libraries:
- Transformers
- PEFT
- TRL
- Datasets
- PyTorch 2.9.0
- Cuda 12.6
Environment Setup
Containerization with Singularity
We are going to create a Singularity container with all the necessary libraries and transfer it to MN5.
-
First, start by cloning this repository to your local machine
-
Navigate to the
fine_tuning_llama3.1-8bfolder, containing the necessary files -
Build the Singularity container
-
Transfer the content of the
fine_tuning_llama3.1-8bfolder to MN5scp -r /path/to/baif-quickstart/example_files/fine_tuning_llama3.1-8b <username>@transfer1.bsc.es:/gpfs/projects/<project_name>/fine_tuning_llama3.1-8bInfo
This will include the container itself (
.siffile), as well as the necessary scripts (.py,.sh) and the training data.
Training Data
The training dataset consists of question and answer pairs generated based on Deucalion's documentation. It was already uploaded to MN5 on a previous step.
Here is a snippet of the training data:
[...]
{"messages": [{"role": "user", "content": "How do I submit a batch job to Deucalion?"}, {"role": "assistant", "content": "Submit a batch job by creating a batch job script and using the `sbatch` command."}]}
{"messages": [{"role": "user", "content": "How can I check the status of my submitted job on Deucalion?"}, {"role": "assistant", "content": "Use the `squeue` command to view the status of your submitted jobs."}]}
[...]
Model
You can request BAIF User Support access to open weights models such as Qwen, DeepSeek, Llama, Mistral and others.
Once your request is accepted, models will be transferred to your scratch folder (e.g. /gpfs/scratch/epor-aif005/models/llama3.1-8b/).
Alternatively, you can upload your own models to MN5 via scp.
Fine Tuning
-
SSH to MN5's accelerated partition
-
Change directory to
fine_tuning_llama3.1-8b, where the files were transferred to -
Edit the bash script
run_ft.shused to define the fine tuning job:Change the
<project_name>accordingly (runbsc_project listto find your project name).You can use
nanoto edit the file (e.g.nano run_ft.sh- usectrl + oto write andctrl + xto exit). Enablenanowithmodule load nano.Overview of run_ft.sh
This is the bash script that is going to be submitted to the SLURM queue. It calls the python code that runs the fine tuning (
simple_fine_tuning.py).The header of the file containers the SLURM job directives.
run_ft.sh#!/bin/bash #SBATCH --job-name=simple_ft_llama #SBATCH --qos=acc_ehpc #SBATCH --nodes=1 #SBATCH --gpus=1 #SBATCH --time=00:20:00 # Adjust time limit #SBATCH --output=slurm_output.out #SBATCH --error=slurm_error.err # Define project_name (used to set the correct paths to container and model) PROJECT_NAME="<project_name>" # Define paths CONTAINER_PATH="/gpfs/projects/$PROJECT_NAME/fine_tuning_llama3.1-8b/simple_ft.sif" # Update this to the actual path of your .sif file PROJECT_DIR_HOST="./" # assumes this bash script is in the same folder that contains the project files PROJECT_DIR_CONTAINER="/app" # Path inside the container where PROJECT_DIR_HOST will be mounted MODEL_DIR_HOST="/gpfs/scratch/$PROJECT_NAME/models/llama3.1-8b" # Path to the model directory on the host MODEL_DIR_CONTAINER="/app/model" # Where it will be mounted inside container singularity exec \ --nv \ --pwd $PROJECT_DIR_CONTAINER \ --bind $PROJECT_DIR_HOST:$PROJECT_DIR_CONTAINER \ --bind $MODEL_DIR_HOST:$MODEL_DIR_CONTAINER \ $CONTAINER_PATH \ python $PROJECT_DIR_CONTAINER/simple_fine_tuning.py -
Submit the fine tuning job to SLURM
-
(Optional) Monitor the status of your job with
squeue --me -
(Optional) Check for errors and logs in the
slurm_output.outandslurm_error.errfiles usingcat(e.g.cat slurm_error.err) -
Check for the fine tuning completion
After successfull completion you should find the following message in the output logs: