Leetgpu, tensara how to handle shared memory?

dolewhip · June 20, 2025, 12:47am

hey looking at GPU puzzles, LeetGPU and tensara
These are great

solutions/p11/p11.mojo

main

from gpu import thread_idx, block_idx, block_dim, barrier
from gpu.host import DeviceContext
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
from sys import sizeof, argv
from testing import assert_equal

alias TPB = 8
alias SIZE = 6
alias CONV = 3
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias in_layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(SIZE)
alias conv_layout = Layout.row_major(CONV)


# ANCHOR: conv_1d_simple_solution
fn conv_1d_simple[

This file has been truncated. show original

github.com/modular/mojo-gpu-puzzles

solutions/p15/op/conv1d.mojo

main

from gpu import thread_idx, block_idx, block_dim, barrier
from gpu.host import DeviceContext
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
from sys import sizeof, argv
from testing import assert_equal

alias TPB = 15
alias BLOCKS_PER_GRID = (2, 1)


fn conv1d_kernel[
    in_layout: Layout,
    out_layout: Layout,
    conv_layout: Layout,
    input_size: Int,
    conv_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[mut=True, dtype, out_layout],

This file has been truncated. show original

think these should be simple

but spinning a bit on how to handle the shared memory

doing
var shared_kernel = tbdtype.row_majorkernel_size.shared().alloc()
var shared_input = tbdtype.row_majorinput_size.shared().alloc()

but missing something on how to handle the parameterized functions

solve gives us input_size: Int32, kernel_size: Int32

seems simple but I am missing something. seems like I need to convert to Int? or how do we handle the tb shared alloc?

I did a brute force which passed the initial tests but fails submit so hoping to fix up like the GPU puzzles shared memory approach

dolewhip · June 26, 2025, 11:40pm

on leetgpu brute force passed for me today and got a shared memory version working.

gist.github.com

https://gist.github.com/quinnavila/592f3ac1d97d2b790b7e8bff2cbf106f

1DConvolution.mojo

from gpu.host import DeviceContext
from gpu.id import block_dim, block_idx, thread_idx
from gpu import barrier
from memory import UnsafePointer, stack_allocation
from gpu.memory import AddressSpace
from math import ceildiv

alias BLOCK_SIZE = 256

fn convolution_1d_kernel(input: UnsafePointer[Float32], kernel: UnsafePointer[Float32],

This file has been truncated. show original

Keeping it basic with just stack allocation shared memory worked fine. I did create an alias for the kernel size and curious how we can improve more.

Topic		Replies	Views
GPU Puzzles P09 Shared memory indexing issue Standard Library gpu	2	63	June 27, 2025
Doubt related to Mojo and direct GPU memory access GPU Programming	4	151	April 17, 2025
Looking for examples of mulit-gpu usage with Mojo GPU Programming gpu	3	223	April 4, 2025
Mojo manual gpu basics exercise does not compile GPU Programming 25_3	7	128	April 2, 2025
GPU tensor creation? GPU Programming	1	64	May 10, 2025

Leetgpu, tensara how to handle shared memory?

Related topics