CPU vs GPU Performance: P04 add_10_2d Implementations (CPU wins!?)

EzRyder · July 1, 2025, 8:11pm

I’m still working on GPU acceleration for the pendulum project I started during last weekend’s hackathon, and am getting performance results that indicate CPU performance is significantly better than GPU performance for Neural Network Inference. The results of Digital Twin training (matrix multiplication) with GPU is ~150x faster than CPU. This seems reasonable, but have doubts about the results for Neural Network Inference.

Before I dug too deeply into the implementation details of my pendulum project’s code, I thought it might be useful to benchmark a known kernel implementation, specifically Modular’s Puzzle #4 from mojo-gpu-puzzles.

The results of this analysis is surprising to me (CPU wins!?), and I am uncertain if the results are valid. See below for direct links to AI generated reports and the Mojo benchmark files in my Github repo. I’m running this in VS Code on a Lambda A10 instance, MAX 25.5.0.dev2025070105.
Please take a look – does any of this make sense? Any help/guidance greatly appreciated!

Performance Reports:

Mojo - Simple: simple_performance_analysis.mojo, benchmark_add_10_2d.mojo
Mojo - Comprehensive: comprehensive_performance_analysis.mojo
Github: mojo-gpu-puzzles-ezRyder in /solutions/p04

EzRyder · July 2, 2025, 12:08am

I am now focused on report Comprehensive with results generated by Mojo file comprehensive_performance_analysis.mojo.
Here’s a sample of the output generated by the Mojo file:

=== Comprehensive Performance Analysis: add_10_2d Implementations ===
Testing multiple matrix sizes to find GPU vs CPU crossover point
Number of benchmark runs per size: 5

Performance Results:
===================
Testing size 2 x 2 ...
Matrix size: 2 x 2 (4 elements)
  CPU Implementation:         3.3e-05 ms
  GPU UnsafePointer:          3.1745124 ms
  GPU LayoutTensor:           2.146968 ms
  CPU vs GPU UnsafePointer:   96197.34545454544 x (CPU is faster)
  CPU vs GPU LayoutTensor:    65059.63636363636 x (CPU is faster)
  GPU LayoutTensor is  1.4786025688319526 x faster than GPU UnsafePointer
  CPU Throughput:             121.2121212121212 M elements/ms
  GPU UnsafePointer Throughput:  0.0012600360294702268 M elements/ms
  GPU LayoutTensor Throughput:   0.0018630925099954912 M elements/ms

[intermediate results]

Testing size 2048 x 2048 ...
Matrix size: 2048 x 2048 (4194304 elements)
  CPU Implementation:         3.1e-05 ms
  GPU UnsafePointer:          12.2906808 ms
  GPU LayoutTensor:           13.2701772 ms
  CPU vs GPU UnsafePointer:   396473.57419354835 x (CPU is faster)
  CPU vs GPU LayoutTensor:    428070.2322580645 x (CPU is faster)
  GPU UnsafePointer is  1.0796942346757552 x faster than GPU LayoutTensor
  CPU Throughput:             135300129.03225806 M elements/ms
  GPU UnsafePointer Throughput:  341.2588829090737 M elements/ms
  GPU LayoutTensor Throughput:   316.0699316057362 M elements/ms

=== Analysis Summary ===
❌ No crossover point found in tested range
📊 Recommendation: CPU implementation is faster for all tested sizes
   - GPU overhead dominates for these matrix sizes
   - Consider testing larger matrices or more complex operations

duck_tape · July 2, 2025, 9:50am

Looking at Mojo - Comprehensive, there’s a few things:

you should use the bench module and not use time, here’s an example of cpu vs gpu using bench module mojo-lapper/benchmarks/bench_lapper.mojo at bcad18f3d57cc2df717c508ad3a860d83e4d902f · sstadick/mojo-lapper · GitHub. This will account for more variables, do warmup runs, etc.
you’re currently including the data transfer time for for the gpu version, which isn’t usually included in gpu bench timings as far as i know. You’re also including the data creation time in both cpu and gpu, which you probably don’t want

Those would be good starting points to fix before diving deeper into differences.

EzRyder · July 2, 2025, 2:21pm

Thanks for the guidance Seth!
I’ll make some updates to the benchmark code later today and see how they change the results. It’s good to get the benchmarking methodology so its focused on the specific area of concern. My laptop has an older GPU which is not yet supported by Mojo/MAX, so I have to spin up a Lambda instance for GPU compute. I’m still on the free Lambda credits from the hackathon, but logistically it’s still a (small) PITA. Thanks again!

Topic		Replies	Views
Problem statement Mojo	1	92	March 13, 2025
Modular Hack Weekend Winners Announced! Hack Weekend	0	270	June 30, 2025
[Hackathon] NeRF in Mojo Community Showcase modular-hack-weekend	1	66	June 30, 2025
Modular: Modular GPU Kernel Hackathon Highlights: Innovation, Community, & Mojo🔥 Content blog	0	48	May 20, 2025
Modular weekend hack project - NMS (non max suppression) kernel in Mojo + Pytorch integration with YOLOv10 Community Showcase modular-hack-weekend	2	47	July 1, 2025

CPU vs GPU Performance: P04 add_10_2d Implementations (CPU wins!?)

Performance Reports:

Related topics