Marrow — Apache Arrow in Mojo


Apache Arrow is a universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics — a language-independent, column-oriented memory format organized for efficient analytic operations on modern hardware. It powers Pandas 2.0, Polars, Spark, DataFusion and virtually every modern data tool and formats like Apache Parquet. PyArrow alone — just one implementation out of a dozen, just one distribution channel — is downloaded 300 million times a month. Arrow isn’t a niche format; it’s infrastructure. Mojo needs it as a first-class citizen.

That’s why I started Marrow: a native Apache Arrow implementation in Mojo.

Where it is today

The core abstractions are in place — arrays, builders, compute kernels, Python bindings, and zero-copy interop with PyArrow via the C Data Interface. The implementation is actively growing toward feature parity with other Arrow implementations, with new types and kernels added regularly. There is also early experimental GPU support via Mojo’s DeviceContext.

Performance

Early benchmarks are promising — take them with a grain of salt since PyArrow backed by Arrow C++ is a heavily optimized library and benchmarking is still ongoing. On Python-to-Arrow conversions, Marrow is already 1.3–3.9x faster than PyArrow for numeric, string, and nested list types. Bitmap operations (popcount, AND, OR, invert) benefit nicely from Mojo’s clean SIMD abstractions. Even at this experimental stage, computations with pre-loaded Arrow arrays on GPU show promising numbers.

Come contribute

Lots of room to grow — datetime/decimal/dictionary types, C Data Interface completion, more kernels. If you’re into systems programming, data formats, or GPU compute in Mojo, jump in.

https://github.com/kszucs/marrow

16 Likes

This is awesome, I’ve heard of a lot of people interested in Mojo for data processing. I’m thrilled to see this!

This is great. Have you considered taking advantage of Mojo’s support for custom Literals to do something like:

var x:Int64Array = [10,20,40]

ref: modular/mojo/proposals/collection-literal-design.md at main · modular/modular · GitHub

Good idea, I wasn’t aware of that. Implemented at feat(arrays): add list literal support for PrimitiveArray and StringA… · kszucs/marrow@04b68b7 · GitHub

3 Likes

Glad to hear that! Mojo sits in the sweet spot for data processing and arrow is the bridge to that ecosystem.

1 Like

amazing! :fire:

I had the same idea in mind for month now of building an Arrow lib in Mojo as it is fundamental in todays data stack :nerd_face:

Initially was thinking (very ambitious / unrealistic) of rebuilding (basic functinality of) data processing frameworks like polars / spark / DataFusion which ofc is yeeeeeears++ of work but depending on how much the AI/Agents evolve slightly more realistic :laughing:

1 Like

Do you see Marrow as simply the implementation of the Arrow standard in Mojo or do you see it becoming more tightly coupled to higher level abstractions like the relationship between arrow-rs and `polars` ?

I would love to see and collaborate with someone on a geospatial observation data ingest, QC and even post processing.

I originally planned to implement a low-level only arrow library suitable for the memory representation and interoperability but I was too curious about what compute performance could it achieve so I went ahead and implemented a basic expression system (long run can be somewhat similar to ibis) and a compute layer. After some tuning I managed to reach e.g. 2x better single-threaded join performance than polars (despite that polars is highly optimized) and several times better performance than Arrow C++’s compute layer. Essentially 80-90 percent of my existing benchmarks show better performance than the other two libraries which makes me pretty optimistic.

So probably it will (already) include an execution layer. I’m generally open to include relevant features.