You’re already forced to go over a C-ish API to get the data over to a GPU, even in Mojo, so that should provide a logical point to work from. You should be able to call Mojo kernels from C++ using this mechansims at roughly the same performance as you would calling CUDA kernels from C++, provided you leave Mojo entirely on the GPU. If you want Mojo to come over to the CPU side, then there are problems, since C++ is not a simple language to integrate with. At present, the best path forwards is likely to wait for ClangIR and then use Mojo’s ability to interface with arbitrary MLIR to handle the interop layer.
C++ is one heck of a language to try to do interop with, and bidirectional interop is likely going to take a long time. Carbon (the language) is trying, but it’s causing them a lot of headaches.
In C++/CUDA, we don’t actually have to go to a C-ish API for GPU kernels, because we do lambda captures of wrapper classes that have overloaded operator[] that access device buffers directly. It looks like that is not possible in Mojo today.