[Project] Mojo for Robotics: Porting GPU Navigation Kernels (Jetson / Strix Halo)

Hello everyone! Migrating a discussion that started in the Discord over to the forums so we can track progress, and collaborate more easily.

Background

We are building EMOS (the open-source Embodied OS for Physical AI). A major bottleneck in robotics is autonomous navigation—specifically, running high-frequency control loops in unstructured environments. Traditional stacks (like ROS2 Nav2) rely on CPU-bound, sequential polling.

We recently rebuilt our navigation engine (Kompass) as a parallel event engine, moving the entire control stack to the GPU using SYCL (via AdaptiveCpp). The results are great:

  • 3,106x speedup on trajectory cost evaluation
  • 1,850x speedup on dense occupancy grid mapping (Full context and benchmark charts in this X thread)

Goal: Mojo Port

We want to replicate these benchmarks natively in Mojo. Robotics is the ultimate stress test for heterogeneous compute, particularly on shared-memory edge systems.

Since EMOS handles pub/sub and I/O through middleware (e.g. ROS2/Zenoh), right now we are not targeting Mojo to handle any low-level drivers or sockets. We strictly need Mojo for high-performance compute kernel generation. We want to pass data buffers from C++/Python into compiled Mojo kernels, execute the parallel math, and return the result.

Scope (What we are benchmarking):

To get real numbers against our SYCL baseline, we are looking to port three specific, compute-heavy navigation operations:

  1. Trajectory Cost Evaluator: Evaluating 5,000 generated trajectories over a 10-second horizon against a dynamic costmap.
  2. Local Mapper (Raycasting): Projecting 3,600 LiDAR points into a dense 400x400 occupancy grid at 5cm resolution.
  3. Critical Zone Checker: Checking a 100,000-point 3D cloud against the robot’s kinematic footprint for safety stops.

Hardware Targets:

We are specifically targeting shared/unified memory architectures currently the standard in edge robotics:

  • NVIDIA Jetson: Orin, Xavier AGX, Thor
  • AMD: Strix Halo APUs

How to Collaborate:

I am currently setting up the scaffolding repository to make it easy to write, compile, and test these specific kernels.

All are welcome, specially those interested in:

  • Writing the initial Mojo kernels for these math operations.
  • Hardware-specific kernel optimizations for Jetson/Strix Halo.
  • Zero-copy memory management strategies between C++ and Mojo.

If you have Jetson boards, access to Strix Halo, or just love squeezing raw performance out of compiler architectures, we’d love your help. Let’s see what Mojo can do in the physical world! :mechanical_arm:

(Repo link for the scaffolding will be posted in this thread shortly).

Quick update on the Mojo port we kicked off in this thread.

What’s Done

Scaffolding repo is up: https://github.com/aleph-ra/kompass-mojo

The first of the three kernel groups from the original post, the Trajectory Cost Evaluator, is now running end-to-end in Mojo. Six kernels ported from our SYCL baseline, wired up through an @exported C ABI, and called from a standalone C++ benchmark binary that mirrors the kompass-core benchmarking harness.

If you have a Mojo-supported GPU, you can reproduce everything below with:

git clone https://github.com/aleph-ra/kompass-mojo
cd kompass-mojo
pixi install
./scripts/run_benchmarks.sh <platform_name>

Results from my devbox (NVIDIA RTX A5000, Ampere sm_86)

Same workload we use in EMOS kompass-core: 5,001 trajectories × 1,000 points, 10 s horizon, 4 cost functions enabled.

  • kompass-core SYCL (AdaptiveCpp / CUDA): 16.358 ms (±0.12)
  • kompass-mojo (Mojo 0.26.1 / CUDA): 15.973 ms (±0.09)

I should note that I have used Claude to translate my mojo kernels, using the official skills and no optimization work on the mojo side has been done yet. Hence the initial result is quite impressive.

Next Steps

  • Port the Local Mapper (Bresenham raycasting) and Critical Zone Checker kernels into the same FFI layer.
  • Run on the real targets from the original post, Jetson Orin, Thor, and AMD Strix Halo APUs. Strix Halo should be an interesting case since its shared-memory APU architecture also provides a zero-copy path.

Matching AdaptiveCpp on a first-pass port was not something I expected, and it’s a strong early signal for the Mojo-for-robotics question.

Hopefully you’ll eliminate the Middle man bottleneck which was available in ROS2/Blackberry QNX because this are Operating systems designed to run Vision Language and Vision Action models just sitting on top of a Linux Kernel.

With EMOS if you can build this Robotics OS using Mojo. You’ll have to ensure several things:

  • Microkernel Architecture.
  • Non-wormable OS for the Physical AI this is mostly to prevent compromisation of Robotics AI logic through the UART so implying to prevent hacking

Mojo is perfect for your case this are just additional points to your previous stack and for the performance metric’s they can be higher than your estimates if you perform those optimizations.

Source: ACM Digital Library https://share.google/KXhiSnOsiWouKyhPj

What do you think?

Great to have the scaffolding started to work against! I ran this locally on my Strix Halo system, and it reported a mean of 7.85 ms for this Mojo benchmark, which I believe also compares favorably with the 8.23 ms previously reported for that GPU.

As one suggestion: the project has a slightly nontraditional build infrastructure, I believe the scripts can be replaced with dedicated Pixi build commands to further simplify the build process.

Also, the tests currently don’t verify correctness of the results from the trajectory cost evaluator, so I’m worried we’re benchmarking something that’s possibly non-functional. I’d love to make sure that the results are correct through the right unit tests.

The project itself is using an older version of Mojo (26.1), and if you instead set that to the latest nightly, you can update your syntax by installing our latest skills and prompting the agent to update to the current Mojo syntax. Doesn’t seem to impact the results any, but that’ll align you to the latest Mojo.

Thanks alot for the feedback and for running this on Strix Halo. The result matches the pattern and its a really encouraging early data point for the shared-memory APU case.

For your suggestions:

Pixi build commands: Agreed, the shell scripts are a rough first pass. I will replace them with pixi run tasks.

Unit tests for the kernels: Fair concern, and I will ship proper tests. Initially I linked libkompass_mojo.so directly into kompass-core’s own SYCL benchmark binary and invoked both cost evaluators back-to-back on the same in-memory TrajectorySamples2D. Both produce the exact same cost and minimum index (the 0th trajectory is a straight line and has the minimum cost), to bit-exact precision. So I have good confidence the kernels are computing the right thing but it was a one-off parity test. I will add per kernel tests against reference data.

Mojo nightly: Will do, thank you for bringing this up as I didn’t know about the significant syntax changes.

Will post another update here once I have more stuff done.

Another update on the Mojo port.

What’s Done

Both remaining kernel groups from the original post are now ported and running in the same FFI layer as the cost evaluator:

  • Local Mapper - per-ray super-cover Bresenham kernel projecting a 2D laserscan into an occupancy grid.
  • Critical Zone Checker - two kernels: one that takes pre-computed laserscan ranges through a sparse cone-index LUT, one that takes raw PointCloud2 bytes and does the z-filter, body-frame transform, and cone test per point on-device.

Also based on the the feedback from the last update:

  • Shell scripts replaced with pixi run tasks (build / test / benchmark).
  • Per-kernel correctness tests for all three groups. The mapper test also renders the grid as ASCII so regressions are visible at a glance.
  • Migrated to Mojo nightly.

Results from my devbox (NVIDIA RTX A5000, Ampere sm_86)

Same workloads kompass-core uses for its own benchmark suite. Benchmark names match on both sides so the JSON drops straight into plot_benchmarks.py.

Benchmark kompass-core (SYCL) kompass-mojo (Mojo nightly)
CostEvaluator_5k_Trajs 16.358 ms 15.973 ms
Mapper_Dense_400x400 0.247 ms 0.290 ms
CriticalZone_Dense_Scan 0.146 ms 0.026 ms
CriticalZone_100k_Cloud 0.519 ms 0.331 ms

Mojo scores better on three of the four. These are still straight translations of the SYCL kernels using Claude and some manual pruning, no hand optimization done yet. The CriticalZone Dense Scan task has the smallest parallel workload and its CPU variant (part of the original benchmark) scores equal or better at times than the GPU kernel, possibly due to being host/device data transfer bound. It shows a marked improvement on the Mojo test.

Next Steps

Now that the kernels are all in place, the interesting part begins, i.e. running the full suite on the real targets from the original post:

  • NVIDIA Jetson - Orin, Thor
  • AMD Strix Halo APUs

We will share the cross-hardware charts in this thread once we have those numbers.

Another update on the Mojo port. Two things in this one:

  1. Tested mojo port on NVIDIA Jetson Orin AGX
  2. Added a fifth kernel, a pointcloud to laserscan stage that recently landed in kompass-core, so the mapper now covers both the laserscan input and the raw PointCloud2 paths.

Getting Mojo to run on Jetson Orin AGX (JetPack 6)

Mojo 1.0+ requires NVIDIA driver 580 / CUDA 13. JetPack 6.x for Orin AGX ships driver 540 / CUDA 12.6. One needs NVIDIA’s cuda-compat-orin-13-2 forward-compatibility package, i.e. user-mode CUDA 13.2 driver libs sitting on top of the existing 540 kernel driver. Reproduction steps and the link to NVIDIA’s forum thread are in the README.

Results from Jetson Orin AGX (Ampere sm_87, normal power profile)

Benchmark kompass-core (SYCL) kompass-mojo (Mojo nightly)
CostEvaluator_5k_Trajs 36.10 ms 43.67 ms
Mapper_Dense_400x400 1.123 ms 0.715 ms
Mapper_PointCloud_100k 1.737 ms 1.041 ms
CriticalZone_Dense_Scan 0.171 ms 0.120 ms
CriticalZone_100k_Cloud 0.690 ms 0.373 ms

Results from Jetson Orin AGX (Ampere sm_87, max power profile / 50 W)

Benchmark kompass-core (SYCL) kompass-mojo (Mojo nightly)
CostEvaluator_5k_Trajs 42.77 ms 38.96 ms
Mapper_Dense_400x400 0.667 ms 0.620 ms
Mapper_PointCloud_100k 0.755 ms 0.975 ms
CriticalZone_Dense_Scan 0.167 ms 0.105 ms
CriticalZone_100k_Cloud 0.790 ms 0.441 ms

Mojo wins four of five at both power profiles, but the one SYCL win flips between profiles. Neck and neck either way on kernels with a good amount of parallel compute load.

Next Steps

  • Same suite on AMD Strix Halo APUs, the other edge target.
  • Cross-platform charts.

Another update on the Mojo port. This one is the AMD Strix Halo APU run from the original post, which closes the loop on the three target platforms we started with.

Setup

Strix Halo (Ryzen AI MAX+ 395, 40 CU Radeon 8060S, RDNA 3.5 gfx1151) via ROCm. Same five-benchmark suite as the Jetson Orin AGX runs. Mojo’s ROCm path worked out of the box on this hardware, no compat package needed (unlike Jetson Orin).

Results from Strix Halo (AMD RDNA 3.5 gfx1151)

Benchmark kompass-core (SYCL/ROCm) kompass-mojo (Mojo nightly)
CostEvaluator_5k_Trajs 8.18 ms 7.69 ms
Mapper_Dense_400x400 0.226 ms 0.192 ms
Mapper_PointCloud_100k 0.238 ms 0.196 ms
CriticalZone_Dense_Scan 0.026 ms 0.023 ms
CriticalZone_100k_Cloud 0.056 ms 0.062 ms

Mojo wins four of five, same pattern we have seen earlier.

Edge-platform head-to-head - Jetson Orin AGX vs Strix Halo

Both are shared-memory edge platforms from the original post, so this is the comparison that matters most for the EMOS robotics use case. Same kompass-mojo binary, same kernel sources, different vendor stacks. Its awesome that such apples-to-apples cross vendor comparisons of the same code are possible now (thanks to Mojo and SYCL).

Benchmark Orin AGX (50 W) Strix Halo Strix speedup
CostEvaluator_5k_Trajs 38.96 ms 7.69 ms 5.1x
Mapper_Dense_400x400 0.620 ms 0.192 ms 3.2x
Mapper_PointCloud_100k 0.975 ms 0.196 ms 5.0x
CriticalZone_Dense_Scan 0.105 ms 0.023 ms 4.6x
CriticalZone_100k_Cloud 0.441 ms 0.062 ms 7.1x

Strix is between 3.2x and 7.1x faster than the Orin AGX at max power on every kernel in the suite. The gap is largest on the smallest kernels (CriticalZone_100k_Cloud 7.1x), where the launch, atomic, memory-bandwidth ratio favors the wider RDNA 3.5 GPU + LPDDR5x bandwidth on Strix. The cost evaluator and mapper kernels, which actually saturate compute, still see a 3-5x gap. Worth highlighting that this is the same exact Mojo source on both targets.

Since Strix Halo is a newer platform, I am quite eager to see the results against Jetson Thor (I don’t have one right now).

Observations across the two edge platforms

A couple of things are becoming recognizable now that the data spans 2 vendors:

  • CriticalZone_100k_Cloud is the consistent SYCL win on the higher-throughput devices. It’s the kernel with the smallest per-point work, so any per-launch or per-atomic overhead has the most influence. Will dig into this.
  • Mojo’s run-to-run variance is much tighter on Strix Halo, e.g. CostEval stddev ±0.02 vs SYCL ±0.09, CZ_Dense_Scan ±0.0005 vs ±0.010. ~4-20x lower noise floor across the suite. Not sure yet whether thats Mojo’s scheduling path being more deterministic on this APU or measurement noise on the SYCL side.

Next Steps

  • Benchmark run on Jetson Thor and comparison with existing numbers.

Incredible results!