As a follow-on to our recent open-sourcing of the remainder of the MAX Python API, you can now run all MAX Python API unit and integration tests in the modular repository. This means that you can make additions or modifications to these Python APIs, test them locally, and then submit those changes in a PR that will be tested in CI.
This works in the same way as modifications and tests do for the Mojo standard library or MAX Mojo kernels: using Bazel to coordinate building any local changes and running tests against them. For example, you can run all unit tests against the MAX APIs via
./bazelw test //max/tests/tests/...
or against a specific module using syntax like
./bazelw test //max/tests/tests/torch:all
One caveat: this will test local modifications to Python APIs, but won’t yet pull in changes to the local Mojo standard library or MAX kernels libraries. We’re working on having the graph compiler pull in locally-built Mojo libraries over those shipped in the max package, and when that is enabled you’ll also be able to test your own custom kernels and other Mojo enhancements inside of graphs and models.
Overall, this means that we’re now open to taking contributions to the MAX Python APIs, models, and more in the modular repository. If you have any difficulties testing locally or submitting changes in pull requests, let us know here or in GitHub issues.
Hi @BradLarson I am trying to add support for a model in MAX. The model is working fine and produces sensible, expected outputs for a given prompt. However, I have a question about how to properly test the model against a PyTorch reference.
I went through the max/tests/integration/pipelines/python directory in the Modular repository, which describes the model testing approach. From my understanding, the verify_pipelines.py script is used to functionally verify the model by comparing MAX outputs against PyTorch reference outputs.
As part of this process, we do two things:
First, we compute tolerance values using the flags --find-tolerances and --print-suggested-tolerances.
Then, we add these tolerances to the pipeline configuration, similar to how it is done for other models in verify_pipelines.py.
Given that the model already produces sensible outputs for the prompt, can I conclude that these tests are primarily meant to validate how the model behaves in MAX relative to the PyTorch reference (i.e., numerical parity within tolerances), rather than proving the absolute correctness of the model implementation itself?
I checked with our modeling team, and they recommended using evals over specific logit verification. Thomas suggested the following process (using Gemma 3 as an example):
Start the MAX serving instance for a model: max serve --model-path=google/gemma-3-1b-it
In a separate command line, run evals against the endpoint: uvx --from 'lm-eval[api]' lm_eval --tasks=gsm8k_cot_llama --model=local-chat-completions --model_args=model=google/gemma-3-1b-it,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=64,max_retries=1 --apply_chat_template --limit=320 --seed=42 --gen_kwargs=seed=42,temperature=0 --fewshot_as_multiturn
The result of this should be the percentage of answers the model got correct, and our exit criteria is parity on that measure when compared to an existing implementation in vLLM, SGLang, etc.
Thanks for the reminder, I think we’re planning to update all of these contribution documents today, I’ll make sure that is reworded to reflect where we are today.