I’m playing withbench_matmul(max/kernels/benchmarks/gpu/bench_matmul.mojo).
Trying to figure out thematmulGPU kernel it used when running the mojo program. To investigate, I modified the relevant matmul_gpu.mojo source so that everyenqueue_functioncall dumps the kernel assembly, as shown below
@@ -1177,7 +1180,7 @@ fn multistage_gemm[
config=config,
elementwise_lambda_fn=elementwise_lambda_fn,
]
- ctx.enqueue_function[gemm_kernel_type](
+ ctx.enqueue_function[gemm_kernel_type, dump_asm=Path("out3.asm")](
tensor_c,
tensor_a,
tensor_b,
Then, run benchmark as below
# In modular top folder
$ ./bazelw run //max/kernels/benchmarks/autotune:kbench bench_matmul.yaml -- --output-dir bench_out
# bench_matmul.yaml as below
name: bench_matmul
file: $KERNEL_BENCHMARKS_ROOT/gpu/bench_matmul.mojo
params:
- $M: [3500, 8192]
N: 4096
K: 4096
Such benchmark run will generate a compiled executable binary, bench_out/out_0/bench_matmul_N-4096_K-4096.
This executable can be reused to test different values of M by passing the --M option, for example
$ ./bench_out/out_0/bench_matmul_N-4096_K-4096 --M=3500
$ ./bench_out/out_0/bench_matmul_N-4096_K-4096 --M=8192
Both of these runs generate an assembly dump (e.g., out3.asm). Interestingly, I found that the two out3.asm (M=3500 and M=8192) files turned out to be different.
Here are my questions
- Why are different kernels generated in runtime?
- Is Mojo build generating multiple specialized variants and embedded at executable binary?
- When does specialization happen?
- Is the kernel specialized for some cases of
Mat compile-time (via meta-programming)?
- Is the kernel specialized for some cases of