Inference performance issue

Hi there,

I’m running inference on Llama-3.1-8B-Instruct using Modular MAX 25.6 on an H100. I compared its performance with vLLM using NVIDIA GenAI-Perf and noticed a 2x difference in speed, which makes me think I might be missing something.

  • Average Time To First Token (TTFT): 40 ms (vLLM ~20 ms)

  • Output Token Throughput: 800 tokens/sec (vLLM ~1324 tokens/sec)

I’m using the same hardware and default settings, providing only the path to the model.

Could you suggest what I might be missing or what could cause this difference? Any ideas would be appreciated.

Thanks for trying MAX!

Here is our benchmark guide which should be compatible with how vLLM does benchmarking. I don’t recall using NVIDIA GenAI-Perf before so could provide its command you’re using to benchmark them?

Another important difference is the last time I checked, MAX serve default settings was different from vLLM so could you verify than on your end and what vLLM version are you using?

Hi Ehsan,

Thank you for answering! My GenAI-Perf (from tritonserver-25.09-py3-sdk image ) config is:

export INPUT_SEQUENCE_LENGTH=200
export INPUT_SEQUENCE_STD=0
export OUTPUT_SEQUENCE_LENGTH=200
export CONCURRENCY=10
export MODEL=/fs1/shared/model/llm/Meta-Llama-3.1-8B-Instruct/

genai-perf profile \
    -m $MODEL \
    --endpoint-type chat \
    --streaming \
    -u 127.0.0.1:8000 \
    --warmup-request-count 20 \
    --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
    --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
    --concurrency $CONCURRENCY \
    --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs ignore_eos:true \
    --tokenizer $MODEL \
    --num-requests 500

I used vllm version v0.10.1.1 with default options:

python3 -m vllm.entrypoints.openai.api_server --model /fs1/shared/model/llm/Llama-3.1-8B-Instruct/

My results on H100:
The first is MAX, the second is vllm.

1 Like

Thanks! Let us reproduce it and will update you on this.

This looks interesting. I haven’t tested MAX or vLLM out to give precise comparisons. These metrics require rigorous testing. Initial conjecture - throughput metrics look interesting. Awesome benchmarking

Yesterday, I had a great opportunity to discuss this issue with NVIDIA experts at the GTC conference. I captured a profile using Nsight Systems, which they reviewed, but no significant issues were identified at first glance. Next, I’ll capture a profile for VLLM so we can compare the results side by side.

By the way, how can I verify if MAX is using FP16 rather than FP32? This could explain the twofold difference in performance.

We could reproduce the results. MAX is respecting the dtype which is bfloat16. The genai-perf tool is new and we haven’t investigated it before so thank you for bring it up.

One thing we noticed is the in MAX --enable-prefix-caching helps but still there’s a gap.

Our internal benchmarking is based on vLLM benchmarking tool and we test it on more realistic datasets like SharedGPT or Arxiv summarization.

I’ve made an internal ticket and we’ll do further investigation on this.

2 Likes