Slow tensor creation with experimental Tensor

When comparing the direct tensor creation time from NumPy, I observe a significant latency discrepancy between the experimental tensor approach and the torch.tensor → from_dlpack pathway.
Are there recommended methods to optimize or mitigate this overhead? I have a strict constraint to not use torch, so the faster path is not applicable.

Sharing a gist Compare tensor creation time · GitHub , which results in:

Benchmarking Tensor Creation | Shape: (1, 512) | Device: gpu:0

— Tensor_v3.constant (Numpy CPU → Max GPU) —
Avg Latency: 0.2157 ms
Min Latency: 0.1367 ms
Max Latency: 1.1027 ms
— Numpy → Torch(GPU) → Max.from_dlpack —
Avg Latency: 0.0690 ms

1 Like

This is the main reason Tensor is still in experimental :smiley: we’re actively working on improving this and will have more to share soon.

The main issue is that Tensor.constant() is going through a graph + compiling and executing a kernel, while from_dlpack goes directly through the driver. Tensor.constant also does some checking to validate that you’re not losing precision on your inputs, etc.

Thank you so much for the benchmark! We’ll add this to our internal product benchmark suite so we’ll be measuring it and expect it to improve over time.

In the meantime, I recommend

data = driver.Tensor.from_dlpack(numpy_array).to(driver.Accelerator())
2 Likes

totally expected. Tensor.constant is slow because it makes a graph, compiles a kernel, and runs checks every time it is called. That extra work is a lot more than everything else at small sizes.

Don’t use Tensor if you can’t use torch.always use driver on hot paths.Use Tensor.from_dlpack from NumPy and reuse GPU tensors whenever you can. The experimental Tensor isn’t set up for quick creation yet.