I am running throught the book llm.modular.com, and I am seeing the same error on both a MacBook Pro M1 and an Nvidia DGX Spark. This is the error when I attempt to run “pixi run s05”:
(Modular) toddb@fidspark max-llm-book % pixi run s05
Pixi task (s05): python tests/test.step_05.py Running tests for Step 05: Token Embeddings...
Results:
✅ Embedding is correctly imported from max.nn.module_v3
✅ Module is correctly imported from max.nn.module_v3
✅ GPT2Embeddings class exists
✅ GPT2Embeddings inherits from Module
✅ self.wte embedding layer is created correctly
✅ config.vocab_size is used correctly
✅ config.n_embd is used correctly
✅ self.wte is called with input_ids in __call__ method
✅ All placeholder 'None' values have been replaced
✅ GPT2Embeddings class can be instantiated
✅ GPT2Embeddings.wte is initialized
tmb: (1) We are on track.
✅ GPT2Embeddings forward pass executes without errors
tmb: (2) We are on track.
✅ Output shape is correct: (2, 4, 768)
tmb: (3) We are on track.
❌ Functional test failed: Failed to compile and execute graph! Please file an issue. This error should have been caught at op creation time.
Failed to compile and execute graph! Please file an issue. This error should have been caught at op creation time.
============================================================
⚠️ Some checks failed. Review the hints above and try again.
============================================================
Please let me know what my next step should be.
Thank you.
-Todd B.
P.S. To get to this point on the DGX Spark, I had to modify the step_05 test, but the same thing happened on the MacBook Pro with no modification.
It could be a different issue with the nightlies, but as one possibility compilation may be failing when trying to target the DGX Spark’s GPU. We unfortunately don’t yet support the brand-new devices in the DGX Spark and Jetson Thor because we use an internal version of libnvptxcompiler to do ahead-of-time compilation of PTX to target NVIDIA hardware and that hasn’t yet been updated to support CUDA 13. CUDA 13 drops support for any NVIDIA hardware order than Turing, as well as driver versions below 580, so we’re working on a solution to allow us to update compatibility to support the new hardware while not dropping the older NVIDIA GPUs.
In the meantime, if that’s the case, you may be able to change this line in main.py to read
device = CPU()
and force execution on the CPU, rather than GPU. I thought that this would default to CPU on the MacBook Pro, as well, so there may be something else going on here.
Yes, I fixed the problem on the DGX. The MacBook Pro is failing with the llm.modular.com book right out of the box. I did a Git status to verify I had the newest version (no changes since Friday).
If you come up with anything, please let me know. Until then, i’ll see what I can do to fix the problem on the DGX (graph compiler), but I cannot guarantee I’ll fix anything since I do this in my free time.
It would be cool to get it working on both so I can compare the performance.
Also, I should be able to get around the graph library problem by disabling it if there is an easy way to do that.
Just putting this here as Orin was not mentioned specifically – I am hitting the same error on nightly with an Orin Nano. I have yet to confirm whether the suggested workaround of using the CPU works, but I anticipate that it should.
I think the problem with steps 5+ is that on GPU-equipped systems, there’s a mismatch between the input tensors in the test cases which were being placed on the CPU and the graph that was running by default on the GPU. Additionally, when running on GPU a datatype of bfloat16 is assumed, and NumPy can’t handle that datatype by default.
I’ve put up a PR here that should fix all but steps 8 and 12. Those require a little more investigation, they were failing with rebind errors and shape mismatches even after these fixes.
Thanks @BradLarson! So if I understand right, this should unblock the noted steps for Apple silicon, but newer Nvidia edge hardware still suffers from the incompatibility you specified before, is that right?
If you pull the latest from the repository, the Orin Nano should now work for all but steps 8 and 12. We weren’t handling the case of an attached NVIDIA GPU correctly. I’m working on the last two steps now.
Apple silicon should have been working, because we were treating those systems as if they didn’t have a MAX-supported GPU just yet.