Inference with gpt-oss-120b & vLLM
This document describes a simple process for setting up gpt-oss-120b inference on Mare Nostrum 5 using the vLLM engine and a singularity container.
Environment Setup
vLLM Container
-
On your local machine, build the singularity container based on the official vLLM image available on Docker Hub.
-
Send the container to MN5 via
scp -
Copy the openai harmony vocab file
Info
This step is needed to prevent the
openai_harmony.HarmonyError: error downloading or loading vocab file: failed to download or load vocaberror-
Download the vocab file available at openai_harmony_vocab folder to your local machine
-
Transfer the file to MN5
-
Model
You can request BAIF User Support access to open weights models such as Qwen, DeepSeek, Llama, Mistral and GPT-OSS.
Once your request is accepted, models will be transferred to your scratch folder (e.g. /gpfs/scratch/epor-aif005/models/gpt-oss-120b/).
Alternatively, you can upload your own models to MN5 via scp.
Serving the model
According to the documentation, the gpt-oss-120b model can run on a single 80GB GPU (such as NVIDIA H100). However the H100s available on MN5 have 64GB of vRAM. Therefore, we are going to use 2 GPUs to serve the model.
-
SSH to MN5's accelerated partition
-
Allocate GPUs on MN5
Info
This command allocates an interative session on a single node (
-n 1) with 40 threads per process (-c 40). The session is allocated on a GPU partition (acc_ehpc), using 2 gpus (--gres=gpu:2) and is configured to last 1 hour (-t 01:00:00).You can find your
<project_name>by running the commandbsc_project liston MN5. -
Run the vLLM container
module load singularity singularity shell --nv \ --bind /gpfs/scratch/epor-aif005/models/gpt-oss-120b/:/gpt-oss-120b \ --bind /home/<project_folder>/<user>/vocab:/vocab \ --env TIKTOKEN_RS_CACHE_DIR=/vocab \ vllm.sifInfo
This binds the model directory on MN5 to the
/gpt-oss-120bdirectory in the container. Similarly, it binds MN5's vocab dir to/vocabin the container. -
Serve the model
If you followed the steps correctly, you should see a message Application startup complete indicating the model is being served succesfully.
Inference
You can make requests to the model using the hostname of the GPU node that is serving it, as in:
-
Chat completion
-
Simple completion
References
For more information, please visit vLLM's official gpt-oss usage guide.