Tensor Creation Allocates Memory on All GPUs and Fails with CUDA_VISIBLE_DEVICES

pei0033 · January 9, 2026, 9:08am

Package Version

max                                      26.1.0.dev2026010820
max-core                                 26.1.0.dev2026010820
max-mojo-libs                            26.1.0.dev2026010820
max-shmem-libs                           26.1.0.dev2026010820

Tensor Allocates GPU Memory on All Devices

When creating a simple CPU tensor, the process allocates approximately 616 MiB of GPU memory on all available GPUs (8 GPUs in my case)

Reproduction Code:

from max.dtype import DType
from max.experimental.tensor import Tensor
from max.driver  import  CPU, Accelerator

gpu_tensor = Tensor.ones([2,3], dtype=DType.float32, device=Accelerator())
# cpu_tensor = Tensor.ones([2,3], dtype=DType.float32, device=CPU()) # similar to gpu tensor

After running this code, nvidia-smi shows that the Python process has allocated memories on all 8 GPUs:

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          843022      C   python                                 3894MiB |
|    1   N/A  N/A          843022      C   python                                  616MiB |
|    2   N/A  N/A          843022      C   python                                  616MiB |
|    3   N/A  N/A          843022      C   python                                  616MiB |
|    4   N/A  N/A          843022      C   python                                  616MiB |
|    5   N/A  N/A          843022      C   python                                  616MiB |
|    6   N/A  N/A          843022      C   python                                  616MiB |
|    7   N/A  N/A          843022      C   python                                  616MiB |
+-----------------------------------------------------------------------------------------+

2: Setting CUDA_VISIBLE_DEVICES Causes Failure

When using CUDA_VISIBLE_DEVICES, the code fails with an error with both CPU and Accelerator device tensors are requested:

$ CUDA_VISIBLE_DEVICES=1 python  tests/tensor_test.py 

Traceback (most recent call last):
  File "/home/jovyan/eunikpark/modular_workspace/tests/tensor_test.py", line 8, in <module>
    cpu_tensor = Tensor.ones([2,3], dtype=DType.float32, device=CPU())
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/site-packages/max/experimental/tensor.py", line 768, in ones
    return cls.full(shape, value=1, dtype=dtype, device=device)
           ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/site-packages/max/experimental/tensor.py", line 612, in full
    cls.constant(value, dtype=dtype, device=device), shape
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/site-packages/max/experimental/tensor.py", line 570, in constant
    return F.constant(value, dtype, device)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/site-packages/max/experimental/functional.py", line 185, in wrapped
    with contextlib.ExitStack() as stack:
         ~~~~~~~~~~~~~~~~~~~~^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/contextlib.py", line 619, in __exit__
    raise exc
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/contextlib.py", line 604, in __exit__
    if cb(*exc_details):
       ~~^^^^^^^^^^^^^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/site-packages/max/experimental/realization_context.py", line 315, in __exit__
    F._run(self.realize_all())
    ~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/site-packages/max/experimental/functional.py", line 79, in _run
    return asyncio.run(coro)
           ~~~~~~~~~~~^^^^^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/asyncio/base_events.py", line 725, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/site-packages/max/experimental/realization_context.py", line 224, in realize_all
    model = _session().load(graph)
            ~~~~~~~~^^
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/site-packages/max/experimental/realization_context.py", line 128, in _session
    devices = driver.load_devices(device_specs)
  File "/home/jovyan/eunikpark/.conda/envs/max/lib/python3.13/site-packages/max/driver/driver.py", line 94, in load_devices
    devices.append(Accelerator(device_spec.id))
                   ~~~~~~~~~~~^^^^^^^^^^^^^^^^
ValueError: failed to create device: No supported "gpu" device available.
  CUDA information: CUDA call failed: CUDA_ERROR_INVALID_DEVICE (invalid device ordinal)
To get more accurate error information, set MODULAR_DEVICE_CONTEXT_SYNC_MODE=true.
  HIP information:  Failed to open library "libamdhip64.so": libamdhip64.so: cannot open shared object file: No such file or directory
To get more accurate error information, set MODULAR_DEVICE_CONTEXT_SYNC_MODE=true.

joshpeterson · January 9, 2026, 8:20pm

Thanks for your questions!

You are seeing these memory allocations on all GPUs cause MAX uses its own memory manager for device memory. When you initialize data on any GPU, MAX will initialize that memory manager on all GPUs. This is a performance optimization, with the assumption that the code will soon be allocating memory on all of the GPUs.

It does make your case look weird, I agree. But this should help performance as you code gets more complex and uses more GPUs.

This looks like a bug on our side. Could you open a new issue here: GitHub · Where software is built? We’ve not tested with CUDA_VISIBLE_DEVICES in our stack yet, but it should work. So we will build out the testing and fix for this.

pei0033 · January 12, 2026, 6:20am

Thank you.
I have opened a new issue!
[BUG]: Setting CUDA_VISIBLE_DEVICES Causes Failure #5761

joshpeterson · January 12, 2026, 11:59am

Thanks, we will have a look at this.

bez · January 14, 2026, 2:49am

The 616MiB may be from the creation of a CUDA Context on each GPU by MAX. Each CUDA Context implicitly uses up some memory. We may try to support creating the CUDA Context on just the accelerator requested instead of all accelerators.

Topic		Replies	Views
MAX Graph Python API built-in ops fail to compile for GPU - what's the correct pattern? MAX discussion , gpu	3	54	January 19, 2026
Having issues with MAX' Matmul on default Google Colab GPU (T4) General	2	75	June 15, 2025
GPU tensor creation? GPU Programming	1	95	May 10, 2025
Model/TensorMap to dynamically handle MANY DriverTensors as inputs? GPU Programming discussion , 25_3	8	227	April 18, 2025
How to package/interface with a GPU kernel with dynamic sized tensors (dynamic LayoutTensor) GPU Programming	15	406	July 12, 2025

Tensor Creation Allocates Memory on All GPUs and Fails with CUDA_VISIBLE_DEVICES

Package Version

Related topics