Hi, I want to check whether doing var local_sum: output.element_type = 0 creates a thread-local variables, and whether, in general, all variables are thread-local unless otherwise in the Mojo GPU programming model.
From a claude code search it said that all variables, unless explicitly declared, would be thread-local to the gpu, which makes sense and is efficient, but I couldn’t find it in the documentation of var (Variables | Modular).
This question came up when I was going through GPU puzzles no.11. This is the solution given, which included the var local_sum: output.element_type = 0 line.
fn conv_1d_simple[
in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
output: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, conv_layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
shared_a = tb[dtype]().row_major[SIZE]().shared().alloc()
shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
if global_i < SIZE:
shared_a[local_i] = a[global_i]
else:
shared_a[local_i] = 0
if global_i < CONV:
shared_b[local_i] = b[global_i]
barrier()
if global_i < SIZE:
# Note: using `var` allows us to include the type in the type inference
# `out.element_type` is available in LayoutTensor
var local_sum: output.element_type = 0
# Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant
# See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement
@parameter
for j in range(CONV):
# Bonus: do we need this check for this specific example with fixed SIZE, CONV
if local_i + j < SIZE:
local_sum += shared_a[local_i + j] * shared_b[j]
output[global_i] = local_sum
A big thank you in advance!