I’m still working on GPU acceleration for the pendulum project I started during last weekend’s hackathon, and am getting performance results that indicate CPU performance is significantly better than GPU performance for Neural Network Inference. The results of Digital Twin training (matrix multiplication) with GPU is ~150x faster than CPU. This seems reasonable, but have doubts about the results for Neural Network Inference.
Before I dug too deeply into the implementation details of my pendulum project’s code, I thought it might be useful to benchmark a known kernel implementation, specifically Modular’s Puzzle #4 from mojo-gpu-puzzles.
The results of this analysis is surprising to me (CPU wins!?), and I am uncertain if the results are valid. See below for direct links to AI generated reports and the Mojo benchmark files in my Github repo. I’m running this in VS Code on a Lambda A10 instance, MAX 25.5.0.dev2025070105. Please take a look – does any of this make sense? Any help/guidance greatly appreciated!
=== Comprehensive Performance Analysis: add_10_2d Implementations ===
Testing multiple matrix sizes to find GPU vs CPU crossover point
Number of benchmark runs per size: 5
Performance Results:
===================
Testing size 2 x 2 ...
Matrix size: 2 x 2 (4 elements)
CPU Implementation: 3.3e-05 ms
GPU UnsafePointer: 3.1745124 ms
GPU LayoutTensor: 2.146968 ms
CPU vs GPU UnsafePointer: 96197.34545454544 x (CPU is faster)
CPU vs GPU LayoutTensor: 65059.63636363636 x (CPU is faster)
GPU LayoutTensor is 1.4786025688319526 x faster than GPU UnsafePointer
CPU Throughput: 121.2121212121212 M elements/ms
GPU UnsafePointer Throughput: 0.0012600360294702268 M elements/ms
GPU LayoutTensor Throughput: 0.0018630925099954912 M elements/ms
[intermediate results]
Testing size 2048 x 2048 ...
Matrix size: 2048 x 2048 (4194304 elements)
CPU Implementation: 3.1e-05 ms
GPU UnsafePointer: 12.2906808 ms
GPU LayoutTensor: 13.2701772 ms
CPU vs GPU UnsafePointer: 396473.57419354835 x (CPU is faster)
CPU vs GPU LayoutTensor: 428070.2322580645 x (CPU is faster)
GPU UnsafePointer is 1.0796942346757552 x faster than GPU LayoutTensor
CPU Throughput: 135300129.03225806 M elements/ms
GPU UnsafePointer Throughput: 341.2588829090737 M elements/ms
GPU LayoutTensor Throughput: 316.0699316057362 M elements/ms
=== Analysis Summary ===
❌ No crossover point found in tested range
📊 Recommendation: CPU implementation is faster for all tested sizes
- GPU overhead dominates for these matrix sizes
- Consider testing larger matrices or more complex operations
you’re currently including the data transfer time for for the gpu version, which isn’t usually included in gpu bench timings as far as i know. You’re also including the data creation time in both cpu and gpu, which you probably don’t want
Those would be good starting points to fix before diving deeper into differences.
Thanks for the guidance Seth!
I’ll make some updates to the benchmark code later today and see how they change the results. It’s good to get the benchmarking methodology so its focused on the specific area of concern. My laptop has an older GPU which is not yet supported by Mojo/MAX, so I have to spin up a Lambda instance for GPU compute. I’m still on the free Lambda credits from the hackathon, but logistically it’s still a (small) PITA. Thanks again!
Update: See below for results after converting from time to using the bench module as recommended above by Seth. Reference new file benchmark_add_10_2d_2.mojo. Performance comparisons are similar to the previous results with CPU much faster than GPU implementations. I’m continuing onward in mojo-gpu-puzzles to learn how to apply GPU programming principles and techniques.
Again, any help/guidance greatly appreciated!
Aside: It took a while to get back to this because of 1) July 4th holiday travel and, 2) I moved from a Lambda instance for GPU to a new laptop with a Blackwell RTX 5090 GPU, so I now have a local setup for GPU programming [stoked!! ].
=== Comprehensive add_10_2d Benchmark ===
Matrix size: 3 x 3
GPU blocks per grid: 1
GPU threads per block: 3 x 3
Running add_10_2d/cpu
Running add_10_2d/gpu_unsafe_ptr
Running add_10_2d/gpu_layout_tensor
| name | met (ms) | iters |
| --------------------------- | --------------------- | ------- |
| add_10_2d/cpu | 2.154067e-06 | 1000000 |
| add_10_2d/gpu_unsafe_ptr | 0.002115296680245406 | 525822 |
| add_10_2d/gpu_layout_tensor | 0.0020489643238379898 | 524608 |
=== Performance Comparison ===
Average execution times:
CPU: 2.154067e-06 ms
GPU UnsafePointer: 0.002115296680245406 ms
GPU LayoutTensor: 0.0020489643238379898 ms
GPU UnsafePointer vs CPU speedup: 0.0010183285494260294 x
GPU LayoutTensor vs CPU speedup: 0.0010512955130253993 x
GPU LayoutTensor is 1.0323736024271846 x faster than GPU UnsafePointer
Fastest implementation: CPU with 2.154067e-06 ms average
Success! I updated the comprehensive benchmark file to use the bench module and am now seeing realistic results where crossover from CPU to GPU is clear at matrix size 128 x 128. See the updated report PERFORMANCE_REPORT_COMPREHENSIVE.md derived from comprehensive_performance_analysis_2.mojo (excerpt below).
Executive Summary
This comprehensive analysis tested CPU vs GPU performance across matrix sizes from 2x2 to 4096x4096 (16.7M elements) using Mojo’s official benchmark module to identify crossover points where GPU implementations become advantageous. The analysis includes three implementations: CPU-only, GPU UnsafePointer, and GPU LayoutTensor. Key discovery: GPU implementations become faster than CPU at 128x128 matrices, with LayoutTensor showing superior performance for larger workloads.