CPU benchmark finding: Mojo vs Numba sensitive to default thread/runtime behavior — best practices for Mojo defaults?

Hi Mojo team,

I’ve been benchmarking Mojo against NumPy/Numba/JAX/CuPy on a single HPC node (96 cores visible, H100s present), mostly with xarray-backed weather data and float64 kernels.

I’m seeing a consistent pattern on CPU stencils (e.g., 2D Laplacian / grad2d):

  • Mojo is very strong vs NumPy.

  • Mojo can be close to or better than Numba at some sizes.

  • But results are sensitive to runtime/thread defaults, and Numba behavior changes a lot depending on environment/thread setup.

I added a controlled sweep (multiple sizes, p50/p90/mean, repeated trials) and standardized envs (NUMBA_NUM_THREADS, OMP_NUM_THREADS) for fairness. This made outcomes much more stable and changed conclusions for some sizes.

My questions for Mojo runtime/compiler experts

  1. What are Mojo’s default CPU threading/scheduling policies for parallelize?

    • How is worker count chosen by default?

    • Is it based on logical cores, affinity mask, cpuset, or something else?

    • Is there work-stealing / chunking policy documentation?

  2. Is there a Mojo-native equivalent of “set thread count” at runtime (like Numba’s set_num_threads) that users should call explicitly in benchmarks and production?

  3. Could Mojo expose a stronger default policy for CPU kernels in shared HPC environments (NUMA/cgroups/cpuset) so users get stable near-optimal behavior without manual tuning?

  4. Are there recommended env vars or flags today for:

    • worker count,

    • chunk size / grain size,

    • affinity / pinning,

    • scheduler mode (throughput vs latency)?

  5. Compiler/codegen side: for memory-bound stencils, are there known current limitations or preferred patterns to help auto-vectorization and vector stores in Mojo?

Why I’m asking

From a user perspective, Numba feels “optimized by default” more often for this class of CPU kernels, while Mojo appears to benefit from more explicit runtime control and kernel shaping. I’d love guidance on:

  • best-practice defaults today,

  • and whether Mojo could adopt smarter out-of-the-box CPU runtime defaults so users get closer to peak performance without manual setup.

If useful, I can share a minimal reproducer (Laplacian kernel + sweep harness).

Thanks!

2 Likes

CPU has been a little neglected in Mojo, so it’s not really at a place I’m super happy with. Right now, there’s a lack of numa awareness, and it’s thread per core (ignoring little cores on BIG.little architectures) without work stealing. Mojo should respect cpuset, but I don’t know if it’s actually documented to respect anything else. You can control the worker count for parallelize by passing different counts of num_workers, and it will hand each worker num_work_items / num_workers tasks.

Quite honestly, Mojo hasn’t put a lot of effort into the CPU side so a lot of thing are much slower than they should be and it don’t handle CPUs (especially modern ones with split L3) very well by the standards that Mojo tries to hold itself to. At present, you’re effectively looking at a very naive implementation of threading, to the degree that I’m surprised Mojo is holding up at all.

MAX, the graph compiler, may be slightly better at handling NUMA due to some recent work and is also generally a decent amount faster than normal Mojo since it allows the compiler to focus on just mathematical operations. Some of the CPU-based HPC work I’ve tossed at MAX (mainly plasma simulations) has substantially outperformed existing libraries used by my institution, so I’d highly recommend using it if you aren’t already.

Mojo does not have autovectorization. Right now, doing it properly is still basically an unsolved problem, so instead it was decided to go the route of building the entire language around portable SIMD. If you’re familiar with highway or xsimd, it’s a bit like that, except that the entire language and standard library is designed around it. If you’re using UnsafePointer, store has a width parameter. This was done so that users always get the behavior they expect, instead of having to hope that the compiler can figure out what they meant to do, without sacrificing portability like using intrinsics asks you to.

Absolutely. One of the proposals I’ve put forwards, for when the language is a bit more complete is to allow users to make use of hwloc (from OpenMPI) data at compile time. This would allow for you to metaprogram kernels for arbitrary NUMA topologies. Given that Mojo, via MAX’s JIT, can have access to information about work sizes at compile time, this should enable library developers to make more informed decisions about whether using multiple threads/chiplets/sockets is warranted for a particular specialization of a kernel. My hope here is to make the kind of environment variable tuning you’re talking about less necessary, since it kernel authors can write code to work through their process to design a kernel for one particular input shape, on a particular piece of hardware.

2 Likes

Thank you for clarifying.

I am so excited for the future of Mojo but can you perhaps clarify how hard it would be for Mojo/Max to be available for CUDA 12 and say driver 550+ as they are right now (stdetc included)?

I don’t think there’s a lot stopping it. The upgrades to CUDA 13 were mainly for DGX Spark compatibility as far as I’m aware. I’ll point you to @BradLarson for what Modular can do for you here.

The claim is valid. Mojo doesn’t have to hope the compiler handles loop parallelization. But since we have MLIR we can still sneak in auto-vectorization in Mojo.:joy:

To use MAX and Mojo with an older NVIDIA driver or CUDA version, you need to first install the CUDA toolkit for your local version of CUDA and then set the environment variable MODULAR_NVPTX_COMPILER_PATH to the path where your toolkit installed ptxas :

export MODULAR_NVPTX_COMPILER_PATH=/usr/local/cuda/bin/ptxas

That’ll let you use whatever NVIDIA driver and CUDA version you have on your system. I’ve used that successfully across Mojo and MAX with CUDA 12 and driver versions older than 550.

1 Like

I work on an HPC where I can not install or update anything and I tried setting MODULAR_NVPTX_COMPILER_PATHbased on a readlink -f "$(which ptxas)" but that still did not let me unfortunately.

I just answered so nobody tries this and wastes their time :slight_smile:

For

(runtime unavailable: At oss/modular/mojo/stdlib/std/gpu/host/device_context.mojo:3290:17: MAX doesn't support your current NVIDIA GPU driver. MAX requires a minimum driver version of 580 and CUDA version 13.0. Your driver version is 570.124.06)

I did

export MODULAR_NVPTX_COMPILER_PATH=/usr/local/cuda-12.8/bin/ptxas

And got

(runtime unavailable: At oss/modular/mojo/stdlib/std/gpu/host/device_context.mojo:1915:17: CUDA call failed: CUDA_ERROR_INVALID_IMAGE (device kernel image is invalid)

I think that might be an unfortunate side-effect of compilation caching. When you hit the first error, it had attempted to compile binaries for CUDA 13, but setting the environment variable doesn’t invalidate the cache and so it tried to use those with your CUDA 12 driver.

If you’re in a Pixi environment, you can run a pixi clean to wipe the local caches and rebuild with that environment variable set. Otherwise, they’re sometimes located in ~/.modular/cache/ and can be deleted and re-built. You can also change your Mojo or MAX code to trigger a rebuild. Sorry about the hassle, changes in environment variables don’t yet register as cache invalidation conditions for compiled Mojo code.