MAX Graph Python API built-in ops fail to compile for GPU - what's the correct pattern?

abhisheksreesaila · January 19, 2026, 12:43am

Hi! I’m trying to run MAX graphs on GPU but getting compilation failures for even simple operations. Looking for guidance on the correct API pattern.

Environment:

MAX version: modular-26.1.0.dev2026011405 (nightly)
GPU: NVIDIA 3090
OS: Linux

import numpy as np
import os
from max import engine, driver
from max.graph import Graph, ops, TensorType, DeviceRef
from max.dtype import DType

os.environ["MODULAR_DEVICE_CONTEXT_SYNC_MODE"] = "true"

def test_sum_gpu():
    device = driver.Accelerator(id=0)
    device_ref = DeviceRef.GPU(0)
    session = engine.InferenceSession(devices=[device])
    
    SIZE = 1024
    
    def simple_sum(x):
        return ops.sum(x)
    
    graph = Graph("test_sum", simple_sum, 
                  input_types=[TensorType(DType.float32, (SIZE,), device_ref)])
    
    model = session.load(graph)
    
    data = np.random.randn(SIZE).astype(np.float32)
    gpu_tensor = driver.Tensor(data, device)
    
    output = model.execute(gpu_tensor)
    print(f"Sum: {output[0].to_numpy()}")

test_sum_gpu()

ValueError: At oss/modular/mojo/stdlib/std/gpu/host/device_context.mojo:1967:17: 
CUDA call failed: CUDA_ERROR_ILLEGAL_ADDRESS (an illegal memory access was encountered)

Questions:

Is there something wrong with how I’m setting up the graph or tensors for GPU execution?
Is there a different pattern I should use for reductions like ops.sum on GPU?
Should I be using a stable release instead of nightly for GPU support?

Thanks for any guidance!

BradLarson · January 19, 2026, 2:46am

The first thing that jumps out at me is that you may need to move your tensors to and from the device. Trying to assign memory to device tensors can lead to errors like this. For your source input tensor:

gpu_tensor = driver.Tensor.from_numpy(x_values).to(device)

and then to copy the output to the host:

output = model.execute(gpu_tensor)[0]
output = output.to(CPU())

There are some examples in the modular repository that might help demonstrate this. Note that driver.Tensor recently changed to driver.Buffer in the latest nightlies (to allow for a more general Tensor type we’re using in new APIs).

abhisheksreesaila · January 19, 2026, 2:51am

got it! That makes sense. I switched to stable for now, to reduce any doubts but noticed that in the change log somewhere. Thanks for your swift response. Will give it a shot and let you know here.

abhisheksreesaila · January 19, 2026, 3:04am

this worked!! thank you. pasting here for future ref.

import numpy as np
import os
from max import engine, driver
from max.graph import Graph, ops, TensorType, DeviceRef
from max.dtype import DType

os.environ["MODULAR_DEVICE_CONTEXT_SYNC_MODE"] = "true"

def test_sum_gpu():
    device = driver.Accelerator(id=0)
    device_ref = DeviceRef.GPU(0)
    session = engine.InferenceSession(devices=[device])
    
    SIZE = 1024
    
    def simple_sum(x):
        return ops.sum(x)
    
    graph = Graph("test_sum", simple_sum, 
                  input_types=[TensorType(DType.float32, (SIZE,), device_ref)])
    
    model = session.load(graph)
    
    np.random.seed(42)
    data = np.random.randn(SIZE).astype(np.float32)
    expected = np.sum(data)
    
    # FIX: Use from_numpy().to(device) pattern
    gpu_tensor = driver.Tensor.from_numpy(data).to(device)
    
    output = model.execute(gpu_tensor)
    
    # FIX: Copy output to CPU before reading
    cpu_output = output[0].to(driver.CPU()).to_numpy()
    result = float(cpu_output.item() if cpu_output.ndim == 0 else cpu_output[0])
    
    print(f"Sum: {result:.4f}")
    print(f"Expected: {expected:.4f}")
    print(f"Match: {'✅ PASSED' if abs(result - expected) < 0.01 else '❌ FAILED'}")

test_sum_gpu()

Topic		Replies	Views
Resources for learning MAX for non-ML developers MAX discussion , gpu , docs , 25_1	2	313	February 22, 2025
Examples of custom CPU / GPU operations in Mojo MAX discussion , 24_6	28	1701	April 9, 2025
Examples of programming GPU functions using the Mojo MAX Driver API MAX discussion , gpu , 25_1	5	687	April 26, 2025
MAX' GPU ReduceOps for non-inner axes MAX discussion	6	170	August 12, 2025
Learning MAX Graph API Through Working Examples Community Showcase	7	189	January 26, 2026

MAX Graph Python API built-in ops fail to compile for GPU - what's the correct pattern?

Related topics