[Project] Mojo for Robotics: Porting GPU Navigation Kernels (Jetson / Strix Halo)

Hello everyone! Migrating a discussion that started in the Discord over to the forums so we can track progress, and collaborate more easily.

Background

We are building EMOS (the open-source Embodied OS for Physical AI). A major bottleneck in robotics is autonomous navigation—specifically, running high-frequency control loops in unstructured environments. Traditional stacks (like ROS2 Nav2) rely on CPU-bound, sequential polling.

We recently rebuilt our navigation engine (Kompass) as a parallel event engine, moving the entire control stack to the GPU using SYCL (via AdaptiveCpp). The results are great:

  • 3,106x speedup on trajectory cost evaluation
  • 1,850x speedup on dense occupancy grid mapping (Full context and benchmark charts in this X thread)

Goal: Mojo Port

We want to replicate these benchmarks natively in Mojo. Robotics is the ultimate stress test for heterogeneous compute, particularly on shared-memory edge systems.

Since EMOS handles pub/sub and I/O through middleware (e.g. ROS2/Zenoh), right now we are not targeting Mojo to handle any low-level drivers or sockets. We strictly need Mojo for high-performance compute kernel generation. We want to pass data buffers from C++/Python into compiled Mojo kernels, execute the parallel math, and return the result.

Scope (What we are benchmarking):

To get real numbers against our SYCL baseline, we are looking to port three specific, compute-heavy navigation operations:

  1. Trajectory Cost Evaluator: Evaluating 5,000 generated trajectories over a 10-second horizon against a dynamic costmap.
  2. Local Mapper (Raycasting): Projecting 3,600 LiDAR points into a dense 400x400 occupancy grid at 5cm resolution.
  3. Critical Zone Checker: Checking a 100,000-point 3D cloud against the robot’s kinematic footprint for safety stops.

Hardware Targets:

We are specifically targeting shared/unified memory architectures currently the standard in edge robotics:

  • NVIDIA Jetson: Orin, Xavier AGX, Thor
  • AMD: Strix Halo APUs

How to Collaborate:

I am currently setting up the scaffolding repository to make it easy to write, compile, and test these specific kernels.

All are welcome, specially those interested in:

  • Writing the initial Mojo kernels for these math operations.
  • Hardware-specific kernel optimizations for Jetson/Strix Halo.
  • Zero-copy memory management strategies between C++ and Mojo.

If you have Jetson boards, access to Strix Halo, or just love squeezing raw performance out of compiler architectures, we’d love your help. Let’s see what Mojo can do in the physical world! :mechanical_arm:

(Repo link for the scaffolding will be posted in this thread shortly).

4 Likes

Quick update on the Mojo port we kicked off in this thread.

What’s Done

Scaffolding repo is up: https://github.com/aleph-ra/kompass-mojo

The first of the three kernel groups from the original post, the Trajectory Cost Evaluator, is now running end-to-end in Mojo. Six kernels ported from our SYCL baseline, wired up through an @exported C ABI, and called from a standalone C++ benchmark binary that mirrors the kompass-core benchmarking harness.

If you have a Mojo-supported GPU, you can reproduce everything below with:

git clone https://github.com/aleph-ra/kompass-mojo
cd kompass-mojo
pixi install
./scripts/run_benchmarks.sh <platform_name>

Results from my devbox (NVIDIA RTX A5000, Ampere sm_86)

Same workload we use in EMOS kompass-core: 5,001 trajectories × 1,000 points, 10 s horizon, 4 cost functions enabled.

  • kompass-core SYCL (AdaptiveCpp / CUDA): 16.358 ms (±0.12)
  • kompass-mojo (Mojo 0.26.1 / CUDA): 15.973 ms (±0.09)

I should note that I have used Claude to translate my mojo kernels, using the official skills and no optimization work on the mojo side has been done yet. Hence the initial result is quite impressive.

Next Steps

  • Port the Local Mapper (Bresenham raycasting) and Critical Zone Checker kernels into the same FFI layer.
  • Run on the real targets from the original post, Jetson Orin, Thor, and AMD Strix Halo APUs. Strix Halo should be an interesting case since its shared-memory APU architecture also provides a zero-copy path.

Matching AdaptiveCpp on a first-pass port was not something I expected, and it’s a strong early signal for the Mojo-for-robotics question.

4 Likes

Hopefully you’ll eliminate the Middle man bottleneck which was available in ROS2/Blackberry QNX because this are Operating systems designed to run Vision Language and Vision Action models just sitting on top of a Linux Kernel.

With EMOS if you can build this Robotics OS using Mojo. You’ll have to ensure several things:

  • Microkernel Architecture.
  • Non-wormable OS for the Physical AI this is mostly to prevent compromisation of Robotics AI logic through the UART so implying to prevent hacking

Mojo is perfect for your case this are just additional points to your previous stack and for the performance metric’s they can be higher than your estimates if you perform those optimizations.

Source: ACM Digital Library https://share.google/KXhiSnOsiWouKyhPj

What do you think?

Great to have the scaffolding started to work against! I ran this locally on my Strix Halo system, and it reported a mean of 7.85 ms for this Mojo benchmark, which I believe also compares favorably with the 8.23 ms previously reported for that GPU.

As one suggestion: the project has a slightly nontraditional build infrastructure, I believe the scripts can be replaced with dedicated Pixi build commands to further simplify the build process.

Also, the tests currently don’t verify correctness of the results from the trajectory cost evaluator, so I’m worried we’re benchmarking something that’s possibly non-functional. I’d love to make sure that the results are correct through the right unit tests.

The project itself is using an older version of Mojo (26.1), and if you instead set that to the latest nightly, you can update your syntax by installing our latest skills and prompting the agent to update to the current Mojo syntax. Doesn’t seem to impact the results any, but that’ll align you to the latest Mojo.

Thanks alot for the feedback and for running this on Strix Halo. The result matches the pattern and its a really encouraging early data point for the shared-memory APU case.

For your suggestions:

Pixi build commands: Agreed, the shell scripts are a rough first pass. I will replace them with pixi run tasks.

Unit tests for the kernels: Fair concern, and I will ship proper tests. Initially I linked libkompass_mojo.so directly into kompass-core’s own SYCL benchmark binary and invoked both cost evaluators back-to-back on the same in-memory TrajectorySamples2D. Both produce the exact same cost and minimum index (the 0th trajectory is a straight line and has the minimum cost), to bit-exact precision. So I have good confidence the kernels are computing the right thing but it was a one-off parity test. I will add per kernel tests against reference data.

Mojo nightly: Will do, thank you for bringing this up as I didn’t know about the significant syntax changes.

Will post another update here once I have more stuff done.

Another update on the Mojo port.

What’s Done

Both remaining kernel groups from the original post are now ported and running in the same FFI layer as the cost evaluator:

  • Local Mapper - per-ray super-cover Bresenham kernel projecting a 2D laserscan into an occupancy grid.
  • Critical Zone Checker - two kernels: one that takes pre-computed laserscan ranges through a sparse cone-index LUT, one that takes raw PointCloud2 bytes and does the z-filter, body-frame transform, and cone test per point on-device.

Also based on the the feedback from the last update:

  • Shell scripts replaced with pixi run tasks (build / test / benchmark).
  • Per-kernel correctness tests for all three groups. The mapper test also renders the grid as ASCII so regressions are visible at a glance.
  • Migrated to Mojo nightly.

Results from my devbox (NVIDIA RTX A5000, Ampere sm_86)

Same workloads kompass-core uses for its own benchmark suite. Benchmark names match on both sides so the JSON drops straight into plot_benchmarks.py.

Benchmark kompass-core (SYCL) kompass-mojo (Mojo nightly)
CostEvaluator_5k_Trajs 16.358 ms 15.973 ms
Mapper_Dense_400x400 0.247 ms 0.290 ms
CriticalZone_Dense_Scan 0.146 ms 0.026 ms
CriticalZone_100k_Cloud 0.519 ms 0.331 ms

Mojo scores better on three of the four. These are still straight translations of the SYCL kernels using Claude and some manual pruning, no hand optimization done yet. The CriticalZone Dense Scan task has the smallest parallel workload and its CPU variant (part of the original benchmark) scores equal or better at times than the GPU kernel, possibly due to being host/device data transfer bound. It shows a marked improvement on the Mojo test.

Next Steps

Now that the kernels are all in place, the interesting part begins, i.e. running the full suite on the real targets from the original post:

  • NVIDIA Jetson - Orin, Thor
  • AMD Strix Halo APUs

We will share the cross-hardware charts in this thread once we have those numbers.

2 Likes