Hello everyone! Migrating a discussion that started in the Discord over to the forums so we can track progress, and collaborate more easily.
Background
We are building EMOS (the open-source Embodied OS for Physical AI). A major bottleneck in robotics is autonomous navigation—specifically, running high-frequency control loops in unstructured environments. Traditional stacks (like ROS2 Nav2) rely on CPU-bound, sequential polling.
We recently rebuilt our navigation engine (Kompass) as a parallel event engine, moving the entire control stack to the GPU using SYCL (via AdaptiveCpp). The results are great:
- 3,106x speedup on trajectory cost evaluation
- 1,850x speedup on dense occupancy grid mapping (Full context and benchmark charts in this X thread)
Goal: Mojo Port
We want to replicate these benchmarks natively in Mojo. Robotics is the ultimate stress test for heterogeneous compute, particularly on shared-memory edge systems.
Since EMOS handles pub/sub and I/O through middleware (e.g. ROS2/Zenoh), right now we are not targeting Mojo to handle any low-level drivers or sockets. We strictly need Mojo for high-performance compute kernel generation. We want to pass data buffers from C++/Python into compiled Mojo kernels, execute the parallel math, and return the result.
Scope (What we are benchmarking):
To get real numbers against our SYCL baseline, we are looking to port three specific, compute-heavy navigation operations:
- Trajectory Cost Evaluator: Evaluating 5,000 generated trajectories over a 10-second horizon against a dynamic costmap.
- Local Mapper (Raycasting): Projecting 3,600 LiDAR points into a dense 400x400 occupancy grid at 5cm resolution.
- Critical Zone Checker: Checking a 100,000-point 3D cloud against the robot’s kinematic footprint for safety stops.
Hardware Targets:
We are specifically targeting shared/unified memory architectures currently the standard in edge robotics:
- NVIDIA Jetson: Orin, Xavier AGX, Thor
- AMD: Strix Halo APUs
How to Collaborate:
I am currently setting up the scaffolding repository to make it easy to write, compile, and test these specific kernels.
All are welcome, specially those interested in:
- Writing the initial Mojo kernels for these math operations.
- Hardware-specific kernel optimizations for Jetson/Strix Halo.
- Zero-copy memory management strategies between C++ and Mojo.
If you have Jetson boards, access to Strix Halo, or just love squeezing raw performance out of compiler architectures, we’d love your help. Let’s see what Mojo can do in the physical world! ![]()
(Repo link for the scaffolding will be posted in this thread shortly).