MAX Graph Python API built-in ops fail to compile for GPU - what's the correct pattern?

Hi! I’m trying to run MAX graphs on GPU but getting compilation failures for even simple operations. Looking for guidance on the correct API pattern.

Environment:

  • MAX version: modular-26.1.0.dev2026011405 (nightly)

  • GPU: NVIDIA 3090

  • OS: Linux

import numpy as np
import os
from max import engine, driver
from max.graph import Graph, ops, TensorType, DeviceRef
from max.dtype import DType

os.environ["MODULAR_DEVICE_CONTEXT_SYNC_MODE"] = "true"

def test_sum_gpu():
    device = driver.Accelerator(id=0)
    device_ref = DeviceRef.GPU(0)
    session = engine.InferenceSession(devices=[device])
    
    SIZE = 1024
    
    def simple_sum(x):
        return ops.sum(x)
    
    graph = Graph("test_sum", simple_sum, 
                  input_types=[TensorType(DType.float32, (SIZE,), device_ref)])
    
    model = session.load(graph)
    
    data = np.random.randn(SIZE).astype(np.float32)
    gpu_tensor = driver.Tensor(data, device)
    
    output = model.execute(gpu_tensor)
    print(f"Sum: {output[0].to_numpy()}")

test_sum_gpu()
ValueError: At oss/modular/mojo/stdlib/std/gpu/host/device_context.mojo:1967:17: 
CUDA call failed: CUDA_ERROR_ILLEGAL_ADDRESS (an illegal memory access was encountered)

Questions:

  1. Is there something wrong with how I’m setting up the graph or tensors for GPU execution?

  2. Is there a different pattern I should use for reductions like ops.sum on GPU?

  3. Should I be using a stable release instead of nightly for GPU support?

Thanks for any guidance!

The first thing that jumps out at me is that you may need to move your tensors to and from the device. Trying to assign memory to device tensors can lead to errors like this. For your source input tensor:

gpu_tensor = driver.Tensor.from_numpy(x_values).to(device)

and then to copy the output to the host:

output = model.execute(gpu_tensor)[0]
output = output.to(CPU())

There are some examples in the modular repository that might help demonstrate this. Note that driver.Tensor recently changed to driver.Buffer in the latest nightlies (to allow for a more general Tensor type we’re using in new APIs).

got it! That makes sense. I switched to stable for now, to reduce any doubts but noticed that in the change log somewhere. Thanks for your swift response. Will give it a shot and let you know here.

this worked!! thank you. pasting here for future ref.

import numpy as np
import os
from max import engine, driver
from max.graph import Graph, ops, TensorType, DeviceRef
from max.dtype import DType

os.environ["MODULAR_DEVICE_CONTEXT_SYNC_MODE"] = "true"

def test_sum_gpu():
    device = driver.Accelerator(id=0)
    device_ref = DeviceRef.GPU(0)
    session = engine.InferenceSession(devices=[device])
    
    SIZE = 1024
    
    def simple_sum(x):
        return ops.sum(x)
    
    graph = Graph("test_sum", simple_sum, 
                  input_types=[TensorType(DType.float32, (SIZE,), device_ref)])
    
    model = session.load(graph)
    
    np.random.seed(42)
    data = np.random.randn(SIZE).astype(np.float32)
    expected = np.sum(data)
    
    # FIX: Use from_numpy().to(device) pattern
    gpu_tensor = driver.Tensor.from_numpy(data).to(device)
    
    output = model.execute(gpu_tensor)
    
    # FIX: Copy output to CPU before reading
    cpu_output = output[0].to(driver.CPU()).to_numpy()
    result = float(cpu_output.item() if cpu_output.ndim == 0 else cpu_output[0])
    
    print(f"Sum: {result:.4f}")
    print(f"Expected: {expected:.4f}")
    print(f"Match: {'✅ PASSED' if abs(result - expected) < 0.01 else '❌ FAILED'}")

test_sum_gpu()
1 Like