Skip to content

Inference with gpt-oss-120b & vLLM

This document describes a simple process for setting up gpt-oss-120b inference on Mare Nostrum 5 using the vLLM engine and a singularity container.

Environment Setup

vLLM Container

  • On your local machine, build the singularity container based on the official vLLM image available on Docker Hub.

    singularity build --fakeroot vllm.sif docker://vllm/vllm-openai
    

  • Send the container to MN5 via scp

    scp vllm.sif <user>@transfer1.bsc.es:/home/<project_folder>/<user>
    

  • Copy the openai harmony vocab file

    Info

    This step is needed to prevent the openai_harmony.HarmonyError: error downloading or loading vocab file: failed to download or load vocab error

    • Download the vocab file available at openai_harmony_vocab folder to your local machine

    • Transfer the file to MN5

      scp fb374d419588a4632f3f557e76b4b70aebbca790 <user>@transfer1.bsc.es:/home/<project_folder>/<user>/vocab
      

Model

You can request BAIF User Support access to open weights models such as Qwen, DeepSeek, Llama, Mistral and GPT-OSS. Once your request is accepted, models will be transferred to your scratch folder (e.g. /gpfs/scratch/epor-aif005/models/gpt-oss-120b/).

Alternatively, you can upload your own models to MN5 via scp.

Serving the model

According to the documentation, the gpt-oss-120b model can run on a single 80GB GPU (such as NVIDIA H100). However the H100s available on MN5 have 64GB of vRAM. Therefore, we are going to use 2 GPUs to serve the model.

  • SSH to MN5's accelerated partition

    ssh <user>@alogin1.bsc.es
    

  • Allocate GPUs on MN5

    salloc -A <project_name> -n 1 -c 40 -t 01:00:00 -q acc_ehpc --gres=gpu:2
    

    Info

    This command allocates an interative session on a single node (-n 1) with 40 threads per process (-c 40). The session is allocated on a GPU partition (acc_ehpc), using 2 gpus (--gres=gpu:2) and is configured to last 1 hour (-t 01:00:00).

    You can find your <project_name> by running the command bsc_project list on MN5.

  • Run the vLLM container

    module load singularity
    
    singularity shell --nv \
        --bind /gpfs/scratch/epor-aif005/models/gpt-oss-120b/:/gpt-oss-120b \
        --bind /home/<project_folder>/<user>/vocab:/vocab \
        --env TIKTOKEN_RS_CACHE_DIR=/vocab \
        vllm.sif
    

    Info

    This binds the model directory on MN5 to the /gpt-oss-120b directory in the container. Similarly, it binds MN5's vocab dir to /vocab in the container.

  • Serve the model

    vllm serve /gpt-oss-120b \
        --host 0.0.0.0 \
        --port 8000 \
        --tensor_parallel_size 2 \
        --max-num-seqs 128 \
        --gpu-memory-utilization 0.75
    

If you followed the steps correctly, you should see a message Application startup complete indicating the model is being served succesfully.

Inference

You can make requests to the model using the hostname of the GPU node that is serving it, as in:

  • Chat completion

    curl http://<NODE_HOSTNAME>:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is the capital of Portugal?"}
            ]
        }'
    

  • Simple completion

    curl http://<NODE_HOSTNAME>:8000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
            "prompt": "San Francisco is a",
            "max_tokens": 7,
            "temperature": 0
        }'
    

References

For more information, please visit vLLM's official gpt-oss usage guide.