Hi everyone, thank you for the awesome work on the GPU kernel development. A more generic and easy to distribute alternative to CUDA cannot come soon enough . I am new here but I am experimenting with rewriting some of my CUDA kernels in Mojo to evaluate it. I am trying to understand the best way to structure this interface, specially when it comes to
LayoutTensor
.
I want to make these kernels into a reusable package and be able to feed any size tensor and have that work. Given this, inside this package of kernels should I create a new LayoutTensor
alias at every call given that the shapes can change at any point?
Also when it comes to feeding the tensors into the package and out: It is a bit impractical to pass these data types around since then they all need to import from this alias which also makes it hard do create them on the fly. How would you deal with passing this information into the package and out of the package, considering the sizes can change.
For example one could think of somehow doing this:
from my_pkg import awesome_kernel
def main():
...
ctx.enqueue_function[awesome_kernel](
arbritrary_size_tensor,
grid_dim=1,
block_dim=VECTOR_WIDTH,
)
But I am not sure how to support both LayoutTensor
and have arbritary sized tensors.
Another idea would be to have a wrapper function that serves as an entry point to the kernel. Maybe then this function could create the LayoutTensor
on the fly on every function call.
from my_pkg import awesome_kernel_wrapper
def main():
...
awesome_kernel_wrapper(
ctx, # So the wrapper can schedule the kernel
arbritrary_size_tensor,
)
I prefer this second option but this implies passing the device context to the wrapper function in another package.
I am curious to see what code structures are others considering or how one of these could be achieved.