MAX kernel launch overhead

Praveen · September 23, 2025, 8:03pm

I had a question about MAX graphs - I enqueued the same function several times in a MAX graph (see code snippet below) and profiled the execution via nsys (see attached image). It seems like there was non-trivial CPU overhead for the launch of each function on the GPU. Does MAX have a way internally to create and execute CUDA graphs for execution of its own graphs?

+            gpu_ctx.enqueue_function[
+                conv1d_kernel[
+                    in_layout, out_layout, conv_layout, input_size, conv_size
+                ]
+            ](
+                output_tensor,
+                input_tensor,
+                kernel_tensor,
+                grid_dim=BLOCKS_PER_GRID,
+                block_dim=(TPB, 1),
+            )
+
+            gpu_ctx.enqueue_function[
+                conv1d_kernel[
+                    out_layout, out_layout, conv_layout, input_size, conv_size
+                ]
+            ](
+                output_tensor,
+                output_tensor,
+                kernel_tensor,
+                grid_dim=BLOCKS_PER_GRID,
+                block_dim=(TPB, 1),
+            )
+
+            gpu_ctx.enqueue_function[
+                conv1d_kernel[
+                    out_layout, out_layout, conv_layout, input_size, conv_size
+                ]
+            ](
+                output_tensor,
+                output_tensor,
+                kernel_tensor,
+                grid_dim=BLOCKS_PER_GRID,
+                block_dim=(TPB, 1),
+            )
+            gpu_ctx.enqueue_function[
+                conv1d_kernel[
+                    out_layout, out_layout, conv_layout, input_size, conv_size
+                ]
+            ](
+                output_tensor,
+                output_tensor,
+                kernel_tensor,
+                grid_dim=BLOCKS_PER_GRID,
+                block_dim=(TPB, 1),
+            )

owenhilyard · September 23, 2025, 8:07pm

The first time you run a particular kernel, it will JIT compile it if you’re running it via gpu_ctx. You should first use cpu_ctx.compile_function_checked and use it as in DeviceFunction | Modular. As far as I am aware, there isn’t a hidden cache for function-level compilation, so you’re recompiling the function every time you do that.

Once you compile them beforehand, you should be able to see the actual overhead. Last I checked it was in the neighborhood of a few dozen microseconds.

Topic		Replies	Views
MAX Graph Python API built-in ops fail to compile for GPU - what's the correct pattern? MAX discussion , gpu	3	84	January 19, 2026
Resources for learning MAX for non-ML developers MAX discussion , gpu , docs , 25_1	3	304	August 21, 2025
MAX 26.1: eager to compile contract, lowering pipeline, kernel selection across GPUs, and extension points for custom ops Mojo discussion	8	153	February 3, 2026
Examples of programming GPU functions using the Mojo MAX Driver API MAX discussion , gpu , 25_1	6	661	October 23, 2025
Community meeting question: MAX speed loading weights & GPU warmup MAX	6	220	August 12, 2025

MAX kernel launch overhead

Related topics