Help with max serve performance on H100

Hi, I’m just getting into the modular world, so not sure where I’m going wrong.

I just deployed MAX on an AWS p5.48xlarge, running GPT-OSS. I was going to do some load testing, but I’m getting really bad performance from trivial prompts - so much so, that my reverse proxy is timing out. According to logs and metrics, the model is producing tokens, just slowly and more than I would expect in response to “Hello”.

Here are the flags passed to max serve:

–device-memory-utilization=0.70
–devices=gpu:gpu0,1,2,3,4,5,6,7
–enable-prefix-caching
–max-batch-size=128
–max-length=65536
–model-path=openai/gpt-oss-120b
–port=8000
–pretty-print-config

and the pretty-print output:

Metrics initialized.

Pipeline Architecture

 architecture   : GptOssForCausalLM
 pipeline_class : TextGenerationPipeline
 pipeline_model : GptOssModel
 tokenizer      : TextTokenizer

Model Information

Model: main
════════════════════════════════════════
model_path : openai/gpt-oss-120b
huggingface_revision : main
quantization_encoding : float4_e2m1fnx2
weight_path : *.safetensors (22 files)
devices : gpu[0], gpu[1], gpu[2], gpu[3], gpu[4], gpu[5], gpu[6], gpu[7]

 max_seq_len           : 65536

── KV Cache ──
page_size : 128 tokens
prefix_caching : True
kv_connector : null
memory_utilization : 70.0%
available_cache_memory : 250.58 GiB

Pipeline Config

 max_seq_len            : 65536
 max_batch_size         : 128
 chunked_prefill        : True
 max_batch_input_tokens : 8192
 in_flight_batching     : False

Sampling Config

 top_k                  : -1
 top_p                  : 1
 min_p                  : 0.0
 temperature            : 1
 frequency_penalty      : 0.0
 presence_penalty       : 0.0
 repetition_penalty     : 1.0
 max_new_tokens         : None
 min_new_tokens         : 0
 ignore_eos             : False
 detokenize             : True
 stop_strings           : None
 stop_token_ids         : [200002, 199999, 200012]

Server Config

 host                   : 0.0.0.0
 port                   : 8000
 metrics_port           : 8001
 api_types              : openai, sagemaker
 operation_mode         : standard

File System Config

 allowed_image_roots    : None
 max_local_image_bytes  : 20.00 MiB

Metrics and Telemetry Config

 metric_recording       : PROCESS
 metric_level           : BASIC (10)
 detailed_buffer_factor : 20
 disable_telemetry      : False
 transaction_recording  : None

Model Worker Config

 use_heartbeat          : False
 health_fail_timeout    : 60.0s

@BradLarson I don’t see anything clearly wrong with the CLI args and this seems like a bit of a UX problem if nothing is broken here.

Oh, also MAX seems to want a lot of memory for gpt-oss-120b, all 8 GPUs. For comparison, vLLM serves it fine with 4, and SGLang only needs 2. Is this expected, or have I got something configured wrong?

I believe gpt-oss-120b is one of the models which has an optimized 8 GPU pipeline variant. It may be trying to use that under the assumption that you want maximum token throughput at the default latency target since you’ve provided it with 8 GPUs.

Oh, I’d definitely expect it to use all the GPUs I gave it. But if I only give it 4, it dies on startup. By playing with the settings I can make it die in different points in the startup sequence, but I haven’t made it all the way through. :grinning_face_with_smiling_eyes:

Well, I was able to get it working by reverting to 26.2.0. I had been using a nightly build to try out Gemma 4. So it’s much faster now, but still not stellar. Slower than vLLM and SGLang.

Anyway, thanks @owenhilyard for looking into it.

Someone from the MAX team should be taking a look at this, since this is a really bad performance regression. I’m not a Modular employee so I can’t summon an 8x H100 system to further debug this with you.