Hi! I’m trying to run MAX graphs on GPU but getting compilation failures for even simple operations. Looking for guidance on the correct API pattern.
Environment:
MAX version: modular-26.1.0.dev2026011405 (nightly)
GPU: NVIDIA 3090
OS: Linux
import numpy as np
import os
from max import engine, driver
from max.graph import Graph, ops, TensorType, DeviceRef
from max.dtype import DType
os.environ["MODULAR_DEVICE_CONTEXT_SYNC_MODE"] = "true"
def test_sum_gpu():
device = driver.Accelerator(id=0)
device_ref = DeviceRef.GPU(0)
session = engine.InferenceSession(devices=[device])
SIZE = 1024
def simple_sum(x):
return ops.sum(x)
graph = Graph("test_sum", simple_sum,
input_types=[TensorType(DType.float32, (SIZE,), device_ref)])
model = session.load(graph)
data = np.random.randn(SIZE).astype(np.float32)
gpu_tensor = driver.Tensor(data, device)
output = model.execute(gpu_tensor)
print(f"Sum: {output[0].to_numpy()}")
test_sum_gpu()
ValueError: At oss/modular/mojo/stdlib/std/gpu/host/device_context.mojo:1967:17:
CUDA call failed: CUDA_ERROR_ILLEGAL_ADDRESS (an illegal memory access was encountered)
Questions:
Is there something wrong with how I’m setting up the graph or tensors for GPU execution?
Is there a different pattern I should use for reductions like ops.sum on GPU?
Should I be using a stable release instead of nightly for GPU support?
The first thing that jumps out at me is that you may need to move your tensors to and from the device. Trying to assign memory to device tensors can lead to errors like this. For your source input tensor:
There are some examples in the modular repository that might help demonstrate this. Note that driver.Tensor recently changed to driver.Buffer in the latest nightlies (to allow for a more general Tensor type we’re using in new APIs).
got it! That makes sense. I switched to stable for now, to reduce any doubts but noticed that in the change log somewhere. Thanks for your swift response. Will give it a shot and let you know here.