Doubt about the benchmarking tutorial

I was reading and following the great tutorial provided at: Deploy a PyTorch model from Hugging Face | Modular Docs but there is a thing I don’t understand about it.

When you run the server you can clearly see MAX Engine is running and is compiling the graph (as shown in stdout). Yet at the end of the tutorial it is said that you can change the backed to vLLM or TensorRT-LLM by just changing the parameters on the benchmarking script, that’s the part that I don’t really understand.

How is it possible to change the backend from the benchmarking script if the server is already running and using MAX Engine, does changing the backend not imply that the graph should be compiled again with the new selected backend and so on?

Probably I’m missing some key details here so any help is highly appreciated, thanks in advance.

1 Like

The benchmarking script itself will run against various backends (docs and script here), but that assumes that instances of them are up and running.

You’re right that the example there takes you through starting up MAX Serve and leaves an instance of that running. To benchmark against TensorRT-LLM or vLLM, it’s assumed that you’d separately spin up a vLLM, etc. instance on equivalent hardware and run the same script against that.

The benchmark script itself won’t replace the serving backend that is running. The --backend parameter on that script ensures that the output is compatible with the various backends. The server and the benchmark are independent processes. Hopefully, that clears things up a bit.

2 Likes

Yes that clears everything up, thanks!.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.