How can I build a minimal Docker image for MAX inference

Background

I have some models that may only be called a few times per day, so keeping a dedicated GPU server running for them is not cost-effective. GPU serverless is very appealing for this kind of low-frequency inference workload.

In a serverless environment, cold start time matters a lot. Because of that, I’m trying to build the smallest possible Docker inference image, ideally something close to a scratch image that only contains the minimum runtime dependencies required for inference.

For testing, I’m using a BERT classification / token classification style model. The target GPU is an NVIDIA RTX A6000.

Previous experiment with MNN

I first tried implementing the inference service in C++ using Alibaba MNN.

The deployment experience with MNN was very good. I was able to build a very clean scratch image containing only the C++ server, the model files, and the dynamic libraries that are actually required at runtime.

With the CUDA backend, inference performance was excellent, but it inevitably pulled in the following large dependencies:

/usr/local/cuda/lib64/libcublasLt.so.12                            ~421 MB
/usr/local/cuda/lib64/libcublas.so.12                              ~105 MB
/tmp/build/_deps/mnn-build/source/backend/cuda/libMNN_Cuda_Main.so  ~61 MB

These files are basically the main runtime dependencies needed for MNN CUDA inference. MNN itself is very friendly for minimal deployment, but I wanted to see whether it would be possible to get an even smaller container while keeping the performance drop relatively small.

I also tried MNN with the Vulkan, OpenCL, and CPU backends. Their dependency size was much smaller, but the performance was extremely poor. Compared with the CUDA backend, the non-CUDA GPU backends were around 200x slower. Surprisingly, in my tests, CPU inference was actually the fastest among those non-CUDA options.

Experiment with MAX C API + C++

My understanding is that MAX can compile and run models at a higher level of abstraction, and I was hoping that it might reduce the need to depend directly on large parts of the CUDA ecosystem. So I tried building a minimal Docker inference image using the MAX C API + C++.

Test repository:

https://github.com/ficapy/mgeo_mojo

The result was very encouraging from a performance perspective: MAX C API + C++ is about 10% faster than my previous MNN CUDA implementation.

However, after trimming the runtime image as much as I could, I found that I still could not get rid of cuBLAS / cuBLASLt. These are the main remaining dependencies:

501.48 MB  /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12.2.5.6
143.18 MB  /opt/pixi-env/lib/libmax.so
101.73 MB  /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12.2.5.6
 55.61 MB  /opt/pixi-env/lib/libNVPTX.so
  7.94 MB  /opt/pixi-env/lib/libMGPRT.so

When I tried removing the cuBLAS-related libraries, the program failed at runtime with:

ABORT: oss/modular/mojo/stdlib/std/ffi/init.mojo:628:18: symbol not found: cublasCreate_v2

So at least from my current experiment, it looks like the NVIDIA GPU inference path still depends on both libcublasLt and libcublas, even when using MAX.

Questions

I’d appreciate any advice on the following:

  1. If inference only needs to run on one specific GPU type, such as RTX A6000 / sm_86, is there any way to further trim the MAX runtime dependencies?

  2. If I’m willing to accept around a 20% performance drop, are there any compilation options, kernel choices, static linking approaches, or feature flags that could reduce the runtime dependency size?

  3. Are there any plans to provide a more serverless-friendly, runtime-only MAX distribution?
    For example, something closer to Go-style single-binary deployment, or a minimal package containing only the required runtime, kernels, and graph execution components.

  4. Are there any plans for MAX to support something similar to CUDA Checkpoint / GPU memory snapshots to further reduce serverless cold start time?
    For example, something like Modal’s GPU memory snapshot feature:
    GPU Memory Snapshots: Supercharging Sub-second Startup

Feedback

Overall, the experience of using MAX for model inference has been very good. The performance of MAX C API + C++ was also a pleasant surprise, since it is slightly faster than my previous MNN CUDA-based implementation.

That said, for GPU serverless deployments, image size, the number of runtime dependencies, model compilation / initialization time, and GPU memory initialization overhead are all critical.

If Modular could further improve the serverless deployment story, for example by providing a smaller runtime-only package, clearer dependency trimming guidance, GPU-architecture-specific builds, or cold start optimizations similar to GPU memory snapshots, I think that would be extremely valuable.

@BradLarson I think this is your area.