Hi, I’m just getting into the modular world, so not sure where I’m going wrong.
I just deployed MAX on an AWS p5.48xlarge, running GPT-OSS. I was going to do some load testing, but I’m getting really bad performance from trivial prompts - so much so, that my reverse proxy is timing out. According to logs and metrics, the model is producing tokens, just slowly and more than I would expect in response to “Hello”.
Here are the flags passed to max serve:
–device-memory-utilization=0.70
–devices=gpu:gpu0,1,2,3,4,5,6,7
–enable-prefix-caching
–max-batch-size=128
–max-length=65536
–model-path=openai/gpt-oss-120b
–port=8000
–pretty-print-config
and the pretty-print output:
Metrics initialized.
Pipeline Architecture
architecture : GptOssForCausalLM
pipeline_class : TextGenerationPipeline
pipeline_model : GptOssModel
tokenizer : TextTokenizer
Model Information
Model: main
════════════════════════════════════════
model_path : openai/gpt-oss-120b
huggingface_revision : main
quantization_encoding : float4_e2m1fnx2
weight_path : *.safetensors (22 files)
devices : gpu[0], gpu[1], gpu[2], gpu[3], gpu[4], gpu[5], gpu[6], gpu[7]
max_seq_len : 65536
── KV Cache ──
page_size : 128 tokens
prefix_caching : True
kv_connector : null
memory_utilization : 70.0%
available_cache_memory : 250.58 GiB
Pipeline Config
max_seq_len : 65536
max_batch_size : 128
chunked_prefill : True
max_batch_input_tokens : 8192
in_flight_batching : False
Sampling Config
top_k : -1
top_p : 1
min_p : 0.0
temperature : 1
frequency_penalty : 0.0
presence_penalty : 0.0
repetition_penalty : 1.0
max_new_tokens : None
min_new_tokens : 0
ignore_eos : False
detokenize : True
stop_strings : None
stop_token_ids : [200002, 199999, 200012]
Server Config
host : 0.0.0.0
port : 8000
metrics_port : 8001
api_types : openai, sagemaker
operation_mode : standard
File System Config
allowed_image_roots : None
max_local_image_bytes : 20.00 MiB
Metrics and Telemetry Config
metric_recording : PROCESS
metric_level : BASIC (10)
detailed_buffer_factor : 20
disable_telemetry : False
transaction_recording : None
Model Worker Config
use_heartbeat : False
health_fail_timeout : 60.0s