Why can't I declare a List for a GPU function?

Recently I posted about the problem I ran into when updating my first Mojo GPU code.

mut for GPU pointers

On thinking about it, that’s a symptom rather than the real problem. What I’m doing is writing the same Julia fractal algorithm twice, once for CPU and once for GPU. The function calculates RGBA values and stores them in a big array of 32 bit unsigned ints.

For the CPU code, I can just declare this array as a mutable list:

fn fractal(mut pixels: List[UInt32], size: Int):

For the GPU code, I had to re-arrange the code a little for parallel execution, which was fine. But I also had to change the parameter definition:

fn julia(pixels: UnsafePointer[UInt32, MutAnyOrigin], …

And yes I understand that the actual implementation on the GPU has to be a bit different from the CPU side. My point is that as a developer, I want to minimise the amount of rewriting from CPU to GPU. Change the loop structure for massive parallelism, sure. Change parameter types, no.

For comparison, in CUDA the standard type declarations do work. I have a similar CPU/GPU comparison program where I can use float4 newPos[]for both C++ and CUDA dialect function parameters.

It’s fine for the Mojo compiler to tell me “sorry, type X isn’t possible for a GPU kernel” if I’m doing something weird with pointers or unions. And also fine for the Mojo compiler to tell me “nope, you can’t extend a list at runtime on the GPU.” When fixed size arrays of basic SIMD types work on both, I shouldn’t have to change the code.

So can we please have List parameters for GPU functions?

In CUDA and C++, newPos will decay to a float4* pointer when passed to any function, not just a GPU function. Using List[UInt32] is roughly equivalent to passing std::vector<uint32_t>, which CUDA will complain about. Neither Mojo nor CUDA really handle the case of single data structures being shared between lanes of a GPU warp, and going cross-warp is firmly in “here be dragons” territory since List isn’t a thread safe datastructure, nor is std::vector. If you construct a List inside of a lane on the GPU, it works exactly as you would expect it to, but having a mutable list shared among lanes and potentially warps is definitely unsound. We might be able to discuss making an immutable one work, since nobody has really asked about that yet and being able to pass an immutable span sounds reasonable.

For the general case, I think what you’re asking for should be doable for thread safe data structures or immutable ones, although the logistics of transferring a tree may be a bit interesting. Unified address space support (introduced with Pascal on the NVIDIA side) does help a bit here as far as dealing with more complex datastructures, but it doesn’t solve the thread safety issues and NVIDIA’s implementation is fairly high latency compared to other GPU vendors, which means that it needs to be used with extreme care to avoid large performance hits in cases where it’s not being used as a compatibility feature to make things work at all (ex: large models on consumer GPUs using a big pool of host memory).

1 Like

I know that lists/arrays are not thread-safe on the GPU. Unless I’m missing something, they’re not thread-safe on the CPU either. Forcing the programmer to explicitly rewrite List[UInt32] into UnsafePointer[UInt32]doesn’t help in the slightest: if I don’t know at least a little about GPU parallel execution, I’ll still write dodgy code.

Yes I’m aware that CUDA has the easier job. Over the years I’ve written a lot of Python and C/C++ code that passes arrays to GPU shaders. It’s always a pain, and easy to get wrong in subtle ways. (Example: memory layout of arrays containing 3 x 32 SIMD float vectors.) CUDA is popular because the compiler does its best to hide some of this low level detail from developers. Mojo should be even better at automatically handling the translation from one architecture to another. If I’m passing a List to a GPU function, the Mojo compiler knows enough to insert the equivalent of a C++ std::vector.data()for me - which would also make it less likely for me to do the wrong thing eg &v[0]

I don’t want the general case. I just want the simple case of a fixed size array of numbers to be simple. Automatic handling of simple lists on the GPU will make developing in Mojo easier and less error-prone. Perfect is the enemy of good enough.

(It isn’t in my code, but for another example, I should be able to use len(pixels) within a GPU function. The Mojo compiler knows about the array and the function, so it can decompose the one list argument into two GPU parameters, pointer to start of array and length.) Easier for me, reduces chance of error.)

1 Like

InlineArray, which is a fixed-sized array, is device passable and can be tossed to the GPU with no issues.

The reason List is more of a hazard on GPUs is because if you launch more than one thread you get unsynchronized access to it. Until we have a better way manage “GPU Thread Safety” (which is a bit different than normal thread safety), I don’t think it’s a good idea to expose something that, if you use it like you would in CPU code, will cause UB very quickly. Passing a Span might be a reasonable compromise.

LayoutTensor and TileTensor are also things you should look into, they handle a lot of the things that List does except for being resizable, which helps a lot with this issue, and they’re the blessed solution to many of these problems.

1 Like

I can’t use InlineArraybecause while fixed-size is not a problem for this and many similar programs, compile-time fixed size makes them unusable. (Unless I want to go back to Ye Olde FORTRAN style, declare a humongous billion element array and hope the swapper can overcommit.)

Looking at some of my other code, plain old Python is better at this than Mojo! I can use array from the standard library, or numpy arrays, and they work fine with GPU code in both directions since they are also MemoryView objects. In Python I do have to change the type declarations in the GPU functions, but that’s only because Python itself doesn’t know about GPUs and I must translate into C or GLSL.

I don’t want to resize lists during calculation, I don’t want thread safe data structures, I just want to pass a big block of ints/floats between CPU and GPU. Mojo, unlike Python or C/C++, knows about the data structures on both sides. It should be easier.

The main reason this doesn’t work is because the header of the list is not actually in memory that the GPU can read, and unless you’re using the fork of the stdlib I’m working allocators in the list body isn’t either. This means you must copy the data to somewhere the GPU can see first (something I am working on fixing, although there will be performance penalties).

Python and Julia both hide this problem for you and do the copy and/or hide the fact that they are allocating in GPU memory from you. For a systems language, having the language hide multi-gigabyte copies and allocations is completely unacceptable so Mojo instead requires that you have something that is a GPU-safe allocation. Even then, you can probably only pass the list body immutably because there is still the thread safety issue and the issue that the GPU cannot access the stack to “fix” the list after appends happen.

Anything used by a GPU kernel needs to either be immutable or what I’ll call “GPU thread safe”, in that it can handle not only multiple threads in a warp writing to it, but also that it can handle the weaker memory model of GPUs. Anything that is read only is by definition thread safe. We’ve carved out an exception for flat buffers to be mutable because it’s required to do useful things, but ideally this will also be made memory safe later on.

LayoutTensor or TileTensor (WIP but easier to use) are the built-in facility for doing what you ask in a way that handles GPUs properly and performantly. List does not handle GPUs properly and making that happen would have some fairly severe performance consequences right now, so we can’t really make List work until I get unified memory support working. Even once that happens, passing a mutable List instead of a mutable Span is still likely going to be wildly unsound.

My comment about Python being easier is badly written and unclear. (Although not the initial bit about InlineArray.) Let me try again.

My GPU background is not AI, it’s 3D graphics. I’ve written a lot of shaders for pure graphics and for computation on 1D/2D arrays. So I know about copying from CPU to/from GPU, and that GPU memory access patterns are different from CPU.

I want to be able to use the same List declaration for arrays of numbers being passed back and forth between CPU and GPU. If I can declare an alias, oops now comptime PixelArray = List[UInt32]and use that type name both CPU side and GPU side code, that makes it obvious that they are intended to be the same to both humans and the Mojo compiler. (Presumably DeviceContext.enqueue_create_host_buffer would be modified to accept such type names.)

One of the major pain points in writing shaders, whether for Python or for C/C++, is getting the declarations and layout to match in two different programming languages. I would have thought that using the same type name in the same language for both CPU and GPU code would make it easier for the Mojo compiler to generate code. I’d trust a compiler to get it right more often than I do.

I also want to have arrays of numbers/SIMD values in purely GPU Mojo code, without any CPU/GPU transfer. CPU side I would use a List, so why not GPU side? It’s Mojo, so the compiler knows about the GPU, knows about differences in alignment, knows how to decompose a List into a GPU side pointer and length register or whatever. I won’t care if the List memory layout is actually different from what it would be on the CPU.

I also spend a lot of time testing functions in C on the CPU, then rewriting them in shader language for the GPU. Yes I recognise that some rewriting is always going to be necessary to take advantage of GPU parallelism, and because of GPU differences. But the less I have to rewrite, the fewer errors I can make. Changing Lists into Pointers is something I shouldn’t have to do, the Mojo compiler will be better than I am.

CUDA is so popular because it reduces the amount of rewriting C/C++ programmers need to do. When people ask me why I’m enthusiastic about Mojo, I often say it is “CUDA for Python programmers”. Type name equivalence - not memory equivalence - between CPU and GPU code will help a lot.

See the other thread, GPUs cannot actually read the top-level struct for List in most cases, so you cannot pass a List to a GPU in a way that lets the GPU in-place modify it. This is why graphics APIs have dedicated buffer creation functions, because you need to actually allocate memory somewhere the GPU can see, and why Mojo has functions to do the same thing. List is not allocated on GPU visible memory right now because it causes a lot of problems for non-GPU code, and the allocator API is unfinished.

1 Like