Slow tensor creation with experimental Tensor

taesukim · January 9, 2026, 8:28am

When comparing the direct tensor creation time from NumPy, I observe a significant latency discrepancy between the experimental tensor approach and the torch.tensor → from_dlpack pathway.
Are there recommended methods to optimize or mitigate this overhead? I have a strict constraint to not use torch, so the faster path is not applicable.

Sharing a gist Compare tensor creation time · GitHub , which results in:

Benchmarking Tensor Creation | Shape: (1, 512) | Device: gpu:0

— Tensor_v3.constant (Numpy CPU → Max GPU) —
Avg Latency: 0.2157 ms
Min Latency: 0.1367 ms
Max Latency: 1.1027 ms
— Numpy → Torch(GPU) → Max.from_dlpack —
Avg Latency: 0.0690 ms

stef · January 9, 2026, 6:03pm

This is the main reason Tensor is still in experimental we’re actively working on improving this and will have more to share soon.

The main issue is that Tensor.constant() is going through a graph + compiling and executing a kernel, while from_dlpack goes directly through the driver. Tensor.constant also does some checking to validate that you’re not losing precision on your inputs, etc.

Thank you so much for the benchmark! We’ll add this to our internal product benchmark suite so we’ll be measuring it and expect it to improve over time.

In the meantime, I recommend

data = driver.Tensor.from_dlpack(numpy_array).to(driver.Accelerator())

eieder · January 12, 2026, 11:21am

totally expected. Tensor.constant is slow because it makes a graph, compiles a kernel, and runs checks every time it is called. That extra work is a lot more than everything else at small sizes.

Don’t use Tensor if you can’t use torch.always use driver on hot paths.Use Tensor.from_dlpack from NumPy and reuse GPU tensors whenever you can. The experimental Tensor isn’t set up for quick creation yet.

Topic		Replies	Views
About the decision of removing the Max Tensor APIs in Mojo MAX	9	435	May 27, 2025
GPU tensor creation? GPU Programming	1	109	May 10, 2025
Tensor Creation Allocates Memory on All GPUs and Fails with CUDA_VISIBLE_DEVICES MAX	4	121	January 14, 2026
How to package/interface with a GPU kernel with dynamic sized tensors (dynamic LayoutTensor) GPU Programming	15	478	July 12, 2025
Model/TensorMap to dynamically handle MANY DriverTensors as inputs? GPU Programming discussion , 25_3	8	234	April 18, 2025

Slow tensor creation with experimental Tensor

Related topics