I’m running inference on Llama-3.1-8B-Instruct using Modular MAX 25.6 on an H100. I compared its performance with vLLM using NVIDIA GenAI-Perf and noticed a 2x difference in speed, which makes me think I might be missing something.
Average Time To First Token (TTFT): 40 ms (vLLM ~20 ms)
Here is our benchmark guide which should be compatible with how vLLM does benchmarking. I don’t recall using NVIDIA GenAI-Perf before so could provide its command you’re using to benchmark them?
Another important difference is the last time I checked, MAX serve default settings was different from vLLM so could you verify than on your end and what vLLM version are you using?
This looks interesting. I haven’t tested MAX or vLLM out to give precise comparisons. These metrics require rigorous testing. Initial conjecture - throughput metrics look interesting. Awesome benchmarking
Yesterday, I had a great opportunity to discuss this issue with NVIDIA experts at the GTC conference. I captured a profile using Nsight Systems, which they reviewed, but no significant issues were identified at first glance. Next, I’ll capture a profile for VLLM so we can compare the results side by side.
By the way, how can I verify if MAX is using FP16 rather than FP32? This could explain the twofold difference in performance.
We could reproduce the results. MAX is respecting the dtype which is bfloat16. The genai-perf tool is new and we haven’t investigated it before so thank you for bring it up.
One thing we noticed is the in MAX --enable-prefix-caching helps but still there’s a gap.
Our internal benchmarking is based on vLLM benchmarking tool and we test it on more realistic datasets like SharedGPT or Arxiv summarization.
I’ve made an internal ticket and we’ll do further investigation on this.