Inference performance issue

Mkhl · October 28, 2025, 10:01pm

Hi there,

I’m running inference on Llama-3.1-8B-Instruct using Modular MAX 25.6 on an H100. I compared its performance with vLLM using NVIDIA GenAI-Perf and noticed a 2x difference in speed, which makes me think I might be missing something.

Average Time To First Token (TTFT): 40 ms (vLLM ~20 ms)
Output Token Throughput: 800 tokens/sec (vLLM ~1324 tokens/sec)

I’m using the same hardware and default settings, providing only the path to the model.

Could you suggest what I might be missing or what could cause this difference? Any ideas would be appreciated.

Ehsan · October 29, 2025, 2:57am

Thanks for trying MAX!

Here is our benchmark guide which should be compatible with how vLLM does benchmarking. I don’t recall using NVIDIA GenAI-Perf before so could provide its command you’re using to benchmark them?

Another important difference is the last time I checked, MAX serve default settings was different from vLLM so could you verify than on your end and what vLLM version are you using?

Mkhl · October 29, 2025, 12:29pm

Hi Ehsan,

Thank you for answering! My GenAI-Perf (from tritonserver-25.09-py3-sdk image ) config is:

export INPUT_SEQUENCE_LENGTH=200
export INPUT_SEQUENCE_STD=0
export OUTPUT_SEQUENCE_LENGTH=200
export CONCURRENCY=10
export MODEL=/fs1/shared/model/llm/Meta-Llama-3.1-8B-Instruct/

genai-perf profile \
    -m $MODEL \
    --endpoint-type chat \
    --streaming \
    -u 127.0.0.1:8000 \
    --warmup-request-count 20 \
    --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
    --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
    --concurrency $CONCURRENCY \
    --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \
    --extra-inputs ignore_eos:true \
    --tokenizer $MODEL \
    --num-requests 500

I used vllm version v0.10.1.1 with default options:

python3 -m vllm.entrypoints.openai.api_server --model /fs1/shared/model/llm/Llama-3.1-8B-Instruct/

My results on H100:
The first is MAX, the second is vllm.

Ehsan · October 29, 2025, 7:20pm

Thanks! Let us reproduce it and will update you on this.

rcmpge · October 30, 2025, 4:51am

This looks interesting. I haven’t tested MAX or vLLM out to give precise comparisons. These metrics require rigorous testing. Initial conjecture - throughput metrics look interesting. Awesome benchmarking

Mkhl · October 30, 2025, 12:02pm

Yesterday, I had a great opportunity to discuss this issue with NVIDIA experts at the GTC conference. I captured a profile using Nsight Systems, which they reviewed, but no significant issues were identified at first glance. Next, I’ll capture a profile for VLLM so we can compare the results side by side.

By the way, how can I verify if MAX is using FP16 rather than FP32? This could explain the twofold difference in performance.

Ehsan · October 31, 2025, 6:36pm

We could reproduce the results. MAX is respecting the dtype which is bfloat16. The genai-perf tool is new and we haven’t investigated it before so thank you for bring it up.

One thing we noticed is the in MAX --enable-prefix-caching helps but still there’s a gap.

Our internal benchmarking is based on vLLM benchmarking tool and we test it on more realistic datasets like SharedGPT or Arxiv summarization.

I’ve made an internal ticket and we’ll do further investigation on this.

Topic		Replies	Views
MAX on CPU doubt and request MAX discussion , feature-request , 24_6	3	196	July 9, 2025
Build an LLM in MAX from scratch 📖 MAX max-llms , max-llm-book	10	712	February 6, 2026
Help with max serve performance on H100 MAX	6	138	April 14, 2026
Modular: MAX 25.2: Unleash the power of your H200's–without CUDA! Content blog	0	69	March 25, 2025
MAX 25.1 is here, featuring MAX Builds, prefix caching, and more! Official Announcements	0	153	February 19, 2025

Inference performance issue

Related topics