Custom MultiHead Self Attention Transformer Training Phase using AMD RX 9070 XT 16GB. Python/Pythorch Vs Mojo

Andrea · January 23, 2026, 8:35pm

I wrote a program in python that uses the pytorch package that performs the four steps to build an LLM and use it.

Preparation of the data corpus.
Training on unsupervised data (wikipedia dump).
Training on supervised data (csv file with q&a).
Inference on the model.
As an algorithm I built a custom multi head self attention transformer (I didn’t use hugging face packages but only pytorch).
At the end of the training I get a model (.safetensor) of 1.3 GB with 314.32 million parameters. It works and the inference also gives some sensible answers on the topic of fine tuning. The problem is the training phases which take a very long time (ranging from 6 hours to several days. It depends on the number of epochs and other parameters).
My setup is:
OS: Ubuntu 24.04.3 LTS (x86_64) Desktop
CPU: AMD Ryzen 7 7800X3D 8-Core Processor 32GB
GPU: AMD Radeon RX 9070 XT 16GB GFX=12.0.1
SSD: 4TB
Rocm: 7.1.1.git351ff442
Python version: 3.12.3
torch_version_hip_ = 7.1.52802-26aae437f6
I wonder if writing the custom multi head self attention transformer algorithm in the mojo language is worth it, i.e. it leads to training times at least 10 times less.
I started practicing mojo by following the puzzle tutorial. Mojo works fine so far on the RX 9070 XT GPU.
I’ve seen that with mojo you get 2.5 TFLOPS performance with a tiled kernel. But is it enough?
PyTorch likely hits 30-50 TFLOPS because uses MIOpen (AMD’s deep learning library) written in Assembly language by AMD.
I hope someone can help me clarify this point. Otherwise I don’t see the point in studying a new language like mojo to use the GPU if it practically offers no advantages over traditional python in AI applications.

DarinSimmons · January 23, 2026, 11:52pm

Welcome to the forum. @BradLarson recently posted Calling all AMD RDNA users: help us bring full MAX support to your GPUs! In other words, the RDNA experience is currently functional but the performance is still being tweaked.

When you say that mojo will get 2.5TFLOPS with a tiled and Pytorch will likely get 30-50, I’m curious where you got those numbers from.

BradLarson · January 24, 2026, 3:32am

I’ll say upfront that if your requirement is “leads to training times at least 10 times less”, you’re probably not going to achieve that by replacing a single operation, even one as intensive as multi-head attention, unless the original implementation is not using the hardware properly.

Where you might see wins that large is in looking at the whole model holistically. For example, if this is eager PyTorch there might be opportunities to accelerate things by moving to a graph and making the most out of kernel fusion, or manually fusing things into large kernels. Our MAX graphs are written in Python, so you don’t need to step away from a language you are familiar with.

Mojo as a language will let you get the most out of your RX 9070 XT, because we let you get down to the intrinsics and program the accelerator however you want. For accelerated matrix multiplication, we have added support for the WMMA intrinsics that RDNA3+ uses, but we haven’t yet completed a fully optimized multi-head attention implementation in our kernels library. You have all the tools you need in Mojo to write one, though.

Again, if your objective is to reduce overall training time you may be able to profile your current implementation and find unanticipated bottlenecks. We’ve frequently seen that in some of the models we’ve optimized for inference, slowdowns can come from surprising places.

Andrea · January 24, 2026, 8:28pm

Thank you Darin for your prompt answer. I joined the forum because I’m really interested in mojo.

My goal is to get the most out of my hardware (which actually belongs to my son and he uses it for gaming) so I can learn to work with AI. I can’t afford anything else right now, and I’m not even sure what to buy (suggestions welcome). Even working on the cloud with AWS using the boto3 package seems too expensive at my learning stage.
That said:
The 2.5 TFLOPS value was achieved with my own mojo script.
The 30-50 TFLOPS value (actually, my script reports 1.37 TFLOPS for Float32 (Standard), and 70 TFLOPS for Float16 (AI Accelerators), but that seems overly optimistic) was achieved with my own Python script.
Both of these scripts, as well as my other projects (such as the custom multihead self-attention transformer with all the phases, including the onnx distribution of the model), are on github.com. All these projects are currently private, but if you’re interested, I can add you as a member. I have something similar. I’ve never worked on a GitHub team because I’ve always worked for large companies that didn’t allow it, so I’m not very experienced. I’d really appreciate working with someone who could help me get into the world of AI.
As for the IDE, I use Pycharm Community Edition (which is no longer called that) for Python and VSC for Mojo.

The AMD RX 9070 XT is RDNA 4.

maxchisto · February 6, 2026, 2:27am

I think the following bit of background info is relevant to the convo (from Pytorch Transformer docs):

This Transformer layer implements the original Transformer architecture described in the Attention Is All You Need paper. The intent of this layer is as a reference implementation for foundational understanding and thus it contains only limited features relative to newer Transformer architectures. Given the fast pace of innovation in transformer-like architectures, we recommend exploring this tutorial to build an efficient transformer layer from building blocks in core or using higher level libraries from the PyTorch Ecosystem.

Which to me sounds like that Transformer module is there for illustration / educational purposes and shouldn’t be used for anything serious and we shouldn’t expect it to be optimized.

Naturally, a question arises: which implementation should I use instead then? If there is one implemented in Mojo that is optimized, it will prob run faster than the built-in pytorch Transformer

Andrea · February 7, 2026, 4:29pm

1)About my project
Yes, you’re right when you say this project is for educational purposes only. However, it works. It should work with both CUDA and ROCm.
In any case, I’ve created a new project (pytorch based) and I’am rewriting it professionally so it can be used for production. I hope to make it public soon.

About Mojo
As for mojo I made a repository public on github. Andrea-Ceccherini/MojoTutorial_01

In this repo you can find some “pure” .mojo files which I wrote to check compatibility with ROCm and my AMD Radeon RX 9070 XT 16GB GFX=12.0.1 GPU and test its performance.
For example I wrote some kernel for matrix/tensor multiplication(GEMM) with comparisons using “RDNA4” Instruction Set Architecture.
See https://docs.amd.com/v/u/en-US/rdna4-instruction-set-architecture
So these scripts, that allow direct gpu programming (using LLVM Backend) without the use of pytorch, are written for AMD architectures.
These scripts gives very fast performance when doing computations with tensors.

About rewriting all the custom multihead self attention transformer using mojo.
In my opinion I think it is possible but likely you would be a pioneer.

Pros
Absolute control over:
Memory layout
kernel fusion
Packing strategies
Wave/block configs
Attention variants

Cons
A lot of code to write.
Dropping pytorch you have to write a lot of library (Autograd, Tensor library, Optimizers, Dataloading, Checkpointing & serialization, and so on)
Huge engineering effort
Huge debug
If you want to write code hardware agnostic I think you have to add a layer and this will likely decrease performance.
As far as I know, Mojo doesn’t have all these features at the moment. Maybe it will in the future.

Final consideration
Today python/pytorch has all these features and are well optimized.
It’s worth a try if rewriting them in mojo gives any performance benefits. In my opinion yes but needs a lot of work. Is anyone probably doing it?

owenhilyard · February 7, 2026, 4:57pm

I would use the max kernels for as much as you can. Modular’s kernel team has put a lot of effort into them and most are either SOTA or near SOTA.

@Andrea

Dropping pytorch you have to write a lot of library (Autograd, Tensor library, Optimizers, Dataloading, Checkpointing & serialization, and so on)

MAX gets you back a tensor lib, and a lot of optimizations. Some optimizations just aren’t really necessary because you can just pick the datatypes, so you don’t need a compiler to try to figure it out.

For dataloading, checkpointing and serialization you can still call out to Python in Mojo, and Mojo has zero-copy interop with pytorch tensors and numpy tensors. Ideally, all of that should live in Mojo code in the future, but for now nothing stops you from pointing pytorch at the tensors at the start of every epoch and using it to serialize them.

Autograd is a little messy, but it should be doable to do it by hand for a limited set of operations.

If you want to write code hardware agnostic I think you have to add a layer and this will likely decrease performance.

Mojo has metaprogramming so we don’t need to do that. You can share all of the stuff that shares well and avoid sharing anything that doesn’t.

Topic		Replies	Views
Tenmo — A lean tensor + NN library in pure Mojo Community Showcase	10	335	May 7, 2026
Metaprogramming with Python in Mojo Mojo discussion	8	864	May 26, 2025
Examples of custom CPU / GPU operations in Mojo MAX discussion , 24_6	28	1701	April 9, 2025
Initial support for writing PyTorch custom ops in Mojo Python Interop gpu	1	446	July 17, 2025
MojoDiffusion: Pure Mojo Autograd and Tensor lib with a image/video trainer and inference Community Showcase	0	88	June 6, 2026

Custom MultiHead Self Attention Transformer Training Phase using AMD RX 9070 XT 16GB. Python/Pythorch Vs Mojo

Related topics