I’ve never really understood whether Mojo’s GPU capability is useful for graphics programming.
I don’t have much experience with GPUs and other fancy processing units so forgive me if the question is stupid, but is this GPU programming useful for graphics or limited to matrix multiplication-like operations?
Do you think that mojo can become a good driver for a game engine in therms of graphics? With the low-levelness of SIMD instructions, it only seems logical to draw triangles on the screen)
The use of GPUs as general purpose accelerators (GPGPU), was created primarily because GPUs were designed as linear algebra accelerators to help with 3d math. They got adopted by people doing science because it turns out that having 500 smaller cores working on a parallel task is better than 8 bigger ones. We have access to LLVM intrinsics, which means that, given a sufficient amount of headache, we should be able to do anything that Vulkan can do.
Since you’ve opened this up to more general GPU questions, are there plans to support persistent kernels? A lot of the GPGPU usage in my domain is in areas where it’s not a great idea to try to launch a full kernel launch for each operation batch, for instance using Nvidia’s GPU Direct Ethernet.
The other questions I have is about what provisions have been made for CXL in the API, since GPUs being able to easily pull data from an SSD or treat host memory as a separate but accessible address space seems like it would have some API implications for usescases which want to stream data onto the GPU.
Our current focus is on AI kernels and GPGPU programming. From my understanding as non graphics expert, modern graphics compute workloads are not “always” implemented using shading languages to feed the rasterization pipeline. Instead, some are often better suited to a more general GPGPU programming model, which can be easier and/or faster for certain tasks. For example, I have seen CUDA kernels used for ray tracing!
In general, we don’t have any assumptions that prevent us from supporting persistent kernels; you can write a kernel that is launched once and continuously processes data streamed through device’s HBM.
For the second question, I’m not sure I fully understand. Are you referring to unified memory? and how that effect the performance of streaming data to a GPU ?
For graphics rendering you need more than SIMD and GPGPU / CUDA-Like stuff, my non-expert understanding of modern game engines is its a mixture of shaders for basic rendering which requires access to special HW function units + GPGPU code for accelerating things like physics simulation…etc
Glad to hear for persistent kernels.
CXL is another protocol on top of PCIe that new Intel and AMD CPUs support (not aware of any ARM implementations yet), which enables a few things via various device types. The short version is that it lets PCIe devices participate in the memory hierarchy. The long answer is that there are a few different device types:
- Type 1 (CXL.io and CXL.cache): This lets a device with no memory of it’s own join the CPU cache coherence protocol. For instance, a network card becomes capable of delivering packets into l3 cache on any CXL-capable CPU, and is notified when the CPU modifies cache lines it has in its cache. This allows you to implement host->device atomics with normal atomic instructions.
- Type 2 (CXL.io, CXL.cache and CXL.mem): A device with its own cache and memory, like a GPU. This means the GPU gets full cache coherence with the host, and vise-versa, and the host can easily map the GPU’s address space into its own, again, no special features like how CUDA does it. An NVME drive with an SRAM cache might also implement this protocol, allowing byte-granularity IO with little to no performance loss. The important part of CXL.mem is that any CXL.io device can read from it, so this also gives the underlying capabilities for GPU Direct Storage and GPU Direct RDMA directly via CXL. Finally, as mentioned, a CXL.io device can read from the memory, so this means you would be able to issue RDMA reads or writes to an NVMe drive (this is of great interested to disaggregated storage people).
- Type 3 (CXL.io and CXL.mem): Like type 2, but no cache (at least one that can keep up with a CPU cache). This is DRAM-cache or no-cache NVMe drives, or just a PCIe card with more ram on it (yes, you could make a PCIe card that acts as nothing but a big bank of HBM).
When you combine all of these you get a system where a GPU becomes much more of an equal peer to the CPU, and can direct other devices on the system as needed, provided they also implement CXL.
Interesting talk highlighting Mojo’s capabilities
- Metaprogramming is more expressive and easier to work with than Cpp’s templates
- Mojo can drop down to MLIR which is easier than inline Assembly
- Parameteric Tensor is a useful abstraction on top of SIMT, which would work well with Autotuning for optimization
There are other aspects that went over my head, but these are my takeaways.