I’m still working on GPU acceleration for the pendulum project I started during last weekend’s hackathon, and am getting performance results that indicate CPU performance is significantly better than GPU performance for Neural Network Inference. The results of Digital Twin training (matrix multiplication) with GPU is ~150x faster than CPU. This seems reasonable, but have doubts about the results for Neural Network Inference.
Before I dug too deeply into the implementation details of my pendulum project’s code, I thought it might be useful to benchmark a known kernel implementation, specifically Modular’s Puzzle #4 from mojo-gpu-puzzles.
The results of this analysis is surprising to me (CPU wins!?), and I am uncertain if the results are valid. See below for direct links to AI generated reports and the Mojo benchmark files in my Github repo. I’m running this in VS Code on a Lambda A10 instance, MAX 25.5.0.dev2025070105. Please take a look – does any of this make sense? Any help/guidance greatly appreciated!
=== Comprehensive Performance Analysis: add_10_2d Implementations ===
Testing multiple matrix sizes to find GPU vs CPU crossover point
Number of benchmark runs per size: 5
Performance Results:
===================
Testing size 2 x 2 ...
Matrix size: 2 x 2 (4 elements)
CPU Implementation: 3.3e-05 ms
GPU UnsafePointer: 3.1745124 ms
GPU LayoutTensor: 2.146968 ms
CPU vs GPU UnsafePointer: 96197.34545454544 x (CPU is faster)
CPU vs GPU LayoutTensor: 65059.63636363636 x (CPU is faster)
GPU LayoutTensor is 1.4786025688319526 x faster than GPU UnsafePointer
CPU Throughput: 121.2121212121212 M elements/ms
GPU UnsafePointer Throughput: 0.0012600360294702268 M elements/ms
GPU LayoutTensor Throughput: 0.0018630925099954912 M elements/ms
[intermediate results]
Testing size 2048 x 2048 ...
Matrix size: 2048 x 2048 (4194304 elements)
CPU Implementation: 3.1e-05 ms
GPU UnsafePointer: 12.2906808 ms
GPU LayoutTensor: 13.2701772 ms
CPU vs GPU UnsafePointer: 396473.57419354835 x (CPU is faster)
CPU vs GPU LayoutTensor: 428070.2322580645 x (CPU is faster)
GPU UnsafePointer is 1.0796942346757552 x faster than GPU LayoutTensor
CPU Throughput: 135300129.03225806 M elements/ms
GPU UnsafePointer Throughput: 341.2588829090737 M elements/ms
GPU LayoutTensor Throughput: 316.0699316057362 M elements/ms
=== Analysis Summary ===
❌ No crossover point found in tested range
📊 Recommendation: CPU implementation is faster for all tested sizes
- GPU overhead dominates for these matrix sizes
- Consider testing larger matrices or more complex operations
you’re currently including the data transfer time for for the gpu version, which isn’t usually included in gpu bench timings as far as i know. You’re also including the data creation time in both cpu and gpu, which you probably don’t want
Those would be good starting points to fix before diving deeper into differences.
Thanks for the guidance Seth!
I’ll make some updates to the benchmark code later today and see how they change the results. It’s good to get the benchmarking methodology so its focused on the specific area of concern. My laptop has an older GPU which is not yet supported by Mojo/MAX, so I have to spin up a Lambda instance for GPU compute. I’m still on the free Lambda credits from the hackathon, but logistically it’s still a (small) PITA. Thanks again!