Ask Ahmed anything about GPU programming with Mojo (LLVM Developers' Meeting 2024)

Modular · December 6, 2024, 4:20pm

melodyogonna · December 9, 2024, 6:39pm

I’ve never really understood whether Mojo’s GPU capability is useful for graphics programming.
I don’t have much experience with GPUs and other fancy processing units so forgive me if the question is stupid, but is this GPU programming useful for graphics or limited to matrix multiplication-like operations?

lesoup-mxd · December 9, 2024, 6:46pm

Do you think that mojo can become a good driver for a game engine in therms of graphics? With the low-levelness of SIMD instructions, it only seems logical to draw triangles on the screen)

owenhilyard · December 9, 2024, 8:12pm

The use of GPUs as general purpose accelerators (GPGPU), was created primarily because GPUs were designed as linear algebra accelerators to help with 3d math. They got adopted by people doing science because it turns out that having 500 smaller cores working on a parallel task is better than 8 bigger ones. We have access to LLVM intrinsics, which means that, given a sufficient amount of headache, we should be able to do anything that Vulkan can do.

owenhilyard · December 9, 2024, 8:19pm

Since you’ve opened this up to more general GPU questions, are there plans to support persistent kernels? A lot of the GPGPU usage in my domain is in areas where it’s not a great idea to try to launch a full kernel launch for each operation batch, for instance using Nvidia’s GPU Direct Ethernet.

The other questions I have is about what provisions have been made for CXL in the API, since GPUs being able to easily pull data from an SSD or treat host memory as a separate but accessible address space seems like it would have some API implications for usescases which want to stream data onto the GPU.

ataei · December 9, 2024, 11:51pm

Our current focus is on AI kernels and GPGPU programming. From my understanding as non graphics expert, modern graphics compute workloads are not “always” implemented using shading languages to feed the rasterization pipeline. Instead, some are often better suited to a more general GPGPU programming model, which can be easier and/or faster for certain tasks. For example, I have seen CUDA kernels used for ray tracing!

ataei · December 10, 2024, 12:13am

In general, we don’t have any assumptions that prevent us from supporting persistent kernels; you can write a kernel that is launched once and continuously processes data streamed through device’s HBM.

For the second question, I’m not sure I fully understand. Are you referring to unified memory? and how that effect the performance of streaming data to a GPU ?

ataei · December 10, 2024, 12:21am

For graphics rendering you need more than SIMD and GPGPU / CUDA-Like stuff, my non-expert understanding of modern game engines is its a mixture of shaders for basic rendering which requires access to special HW function units + GPGPU code for accelerating things like physics simulation…etc

owenhilyard · December 10, 2024, 2:33am

Glad to hear for persistent kernels.

CXL is another protocol on top of PCIe that new Intel and AMD CPUs support (not aware of any ARM implementations yet), which enables a few things via various device types. The short version is that it lets PCIe devices participate in the memory hierarchy. The long answer is that there are a few different device types:

Type 1 (CXL.io and CXL.cache): This lets a device with no memory of it’s own join the CPU cache coherence protocol. For instance, a network card becomes capable of delivering packets into l3 cache on any CXL-capable CPU, and is notified when the CPU modifies cache lines it has in its cache. This allows you to implement host->device atomics with normal atomic instructions.
Type 2 (CXL.io, CXL.cache and CXL.mem): A device with its own cache and memory, like a GPU. This means the GPU gets full cache coherence with the host, and vise-versa, and the host can easily map the GPU’s address space into its own, again, no special features like how CUDA does it. An NVME drive with an SRAM cache might also implement this protocol, allowing byte-granularity IO with little to no performance loss. The important part of CXL.mem is that any CXL.io device can read from it, so this also gives the underlying capabilities for GPU Direct Storage and GPU Direct RDMA directly via CXL. Finally, as mentioned, a CXL.io device can read from the memory, so this means you would be able to issue RDMA reads or writes to an NVMe drive (this is of great interested to disaggregated storage people).
Type 3 (CXL.io and CXL.mem): Like type 2, but no cache (at least one that can keep up with a CPU cache). This is DRAM-cache or no-cache NVMe drives, or just a PCIe card with more ram on it (yes, you could make a PCIe card that acts as nothing but a big bank of HBM).

When you combine all of these you get a system where a GPU becomes much more of an equal peer to the CPU, and can direct other devices on the system as needed, provided they also implement CXL.

shashankp · December 10, 2024, 7:12pm

Interesting talk highlighting Mojo’s capabilities

Metaprogramming is more expressive and easier to work with than Cpp’s templates
Mojo can drop down to MLIR which is easier than inline Assembly
Parameteric Tensor is a useful abstraction on top of SIMT, which would work well with Autotuning for optimization

There are other aspects that went over my head, but these are my takeaways.

system · June 8, 2025, 7:13pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graphics Library for Mojo General	4	444	June 18, 2025
Supporting New Accelerators in Mojo: The Case of the AMD MI300X GPU Programming mojo-compiler	7	492	May 14, 2025
GPU Programming Manual Community Showcase gpu , docs , modular-content	18	781	September 22, 2025
Mojo for Game Development Community Showcase discussion	2	329	November 4, 2025
How to get Mojo to detect AMD integrated GPU (APU)? GPU Programming gpu	16	149	March 12, 2026

Ask Ahmed anything about GPU programming with Mojo (LLVM Developers' Meeting 2024)

Related topics