Hi, I am trying to run the Max quickstart tutorial Run Inference with an Endpoint on my laptop but stuck with launching the endpoint:
(quickstart) $ max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
09:57:22.801 INFO: 54276 MainThread: root: Logging initialized: Console: INFO, File: None, Telemetry: None
09:57:22.801 INFO: 54276 MainThread: max.serve: Unsupported recording method. Metrics unavailable in model worker
09:57:22.807 INFO: 54276 MainThread: max.pipelines: Starting download of model: modularai/Llama-3.1-8B-Instruct-GGUF
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 13.21it/s]
09:57:22.884 INFO: 54276 MainThread: max.pipelines: Finished download of model: modularai/Llama-3.1-8B-Instruct-GGUF in 0.076673 seconds.
09:57:22.950 WARNING: 54276 MainThread: max.pipelines: Insufficient cache memory to support a batch containing one request at the max sequence length of 131072 tokens. Need to allocate at least 1024 pages (32.00 GiB), but only have enough memory for 303 pages (9.47 GiB).
09:57:23.127 INFO: 54276 MainThread: max.pipelines: Paged KVCache Manager allocated 303 device pages using 32.00 MiB per page.
09:57:23.128 INFO: 54276 MainThread: max.pipelines: Building and compiling model...
09:57:34.491 INFO: 54276 MainThread: max.pipelines: Building and compiling model took 11.362761 seconds
instrument is None for maxserve.pipeline_load
Specs:
Apple M1 Pro 2021
16GB Mem
Sequoia 15.5
modular==25.5.0.dev2025070105
Thanks @coffeegriz - itβs a bug weβre fixing. The server is actually live on that instrument is None for maxserve.pipeline_load and you can curl to it - its a logging issue. We should have a fix out in the next nightly hopefully.
$ curl [http://0.0.0.0:8000](http://0.0.0.0:8000)
{"detail":"Not Found"}
$ python generate-text.py
The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2.