Cairo and Datashader rewrites in Mojo opinion needed

I work with Dask Backed Xarray Datasets that contain Numerical Weather Prediction model outputs (think temperature field at 67 levels on a global grid).

I want to accelerate two things.

  1. Rasterizing and visualization of said Xarray.Datasets, I am now using Datashader backed Bokeh plots through HoloViz or Cairo backed static image production using Matplotlib. I am wondering if Gemini is right to suggest that Mojo would be able to accelerate rasterizing and visualizing my interactive and static plots?
  2. Regridding. Essentially I want to accelerate what Xesmf does and take advantage of the GPU. Xesmf allows to pass a input and output xarray dataset and a method of regridding (say nearest neighbour) and it will calculate the weights for the regridding for me. I want to replace the underlying C++/Fortran Library for obvious reasons but I wonder once again if it is a good fit as of right now?

Any comments are welcome and I am hoping to hear why or why not :blush:.

PS I understand Mojo is in active development and might be missing some important parts but that’s part of my question.

1 Like

As far as plotting software goes, matplotlib is very, very slot, even by python’s standards. Mojo doesn’t really have an equivalent yet unless you’re willing to write your own, so I’d suggest Plotly for the actual “drawing the graph” part since I’ve found that to handle large volumes of data much better.

For the number crunching part, Mojo should be able to help you out, since pure number crunching is the “most complete” area of Mojo. However, Xesmf has a lot of code behind it so replacing that is a non-trivial task. If you’re up to it, I would take a look at a rewrite, since a lot of the assumptions made in CPU-based libraries don’t hold up that well once you want to jump to being GPU capable. Using Mojo’s native LayoutTensor will also help reduce friction a bit so long as it does what you need.

2 Likes

@duck_tape Any thoughts?

I wish this was a topic I knew much about! Topically - fully agree with Owen that matplotlib generally going to be pretty slow. Outside that, I just don’t know that much about visualization stacks to weigh in here.

Starting the process of Mojo-native plotting would be a big lift, but incredibly valuable.

1 Like

Thank you so much for the feedback.

Part of geospatial AI is regridding, Spherical harmonics, grid based calculus and at the end visualisation.

I am a huge fan of what Modular is doing but I guess this will be a weekends project of mine. I am not good enough to be a maintainer of a Mojo equivalent of Cairo or ESMPY but hopefully I can get a prototype going that will spark the interest of someone who is :blush:

I wouldn’t sell yourself short! If you start on it and want help setting up the project structure / packaging once you have things working-ish, let me know and I’d be happy to help out on that end!

2 Likes

I took a first stab last weekend at autogen binding of Cairo in mojo, I have the low-level Cairo API functional. I faced some problems when trying to wrap it in a higher level API probably due to mojo origins. If you are interested, I can open this repository on GitHub as a starting point.

3 Likes

Could you please? I am very curious as to how it works. I am unfortunately unable to test out Mojo at work (HPC-RedHat) but I can definitely test it out at home and see if I can contribute.

Here are the bindings to libcairo at current stage

1 Like

Interesting you seem to have the bindings to the cairo library but I was wondering if parts or maybe the whole of Cairo is worth rewriting so we can take advantage of Mojo’s strong points.

I have not architectured such large projects but it seems to me you wouldn’t get all the benefits unless you rewrite the lib and structure it in a certain way (Data structures, control flow, compute flow etc.).

What is the largest Mojo library at the moment that is complex but very well written that can serve as a guide for large librairies in the future?

1 Like

The community package repo is

where you can find a variety of repos in carrying states. Remember that mojo+MAX is constantly evolving.

Thank you, I imagine numojo is the best to model myself off of since it does the whole Zero Copy array calculation bit that would accelerate a lot of what Cairo and Datashader are doing.

As to my other question, is there a way in which I should be thinking about architecturing Mojo libraries to take full advantage of the magic or can I simply replace individual functions or create bindings for parts of a library in Mojo and expect it to perform at its best? (I assumed to take advantage of the MLIR compiler magic I would need to have global data structures or use Mojo as the driver of my code)

I am not sure if the problem is in cairo. Cairo is pretty optimized and battle tested C library. the problem with python visualization is the layers of indirections on top of the backend. matplotlib is pretty slow with large data as it has to take the data figure out its shape and mapping, convert the data points into plot coordinates and make the plot layout itself. this is very slow to do in python. also in my experience dask is not a good solution for out of core data as well.

I think the best acceleration would come from a plotting library in mojo (potentially GPU accelerated) similar to Makie.jl in the Julia ecosystem. this will dramatically Enhance the experience of your application.

What mojo ecosystem needs is a way to model tabular or ND data, (NuMojo) can be a really good candidate for this as underlying library. it also needs bindings to graphics backends and finally we need the plotting library itself, to transform data points to plot coordinates with all the needed transformations. imagine a Grammer of graphics plotting library but GPU-accelerated and in native language.

This is where I am at as well. I work with geospatial data but NDArrays are big in bio and AI as well so having a library like Numojo as a backend for Xarray would be super helpful.

However wouldn’t we lose the SIMD optimization if we use Mojo to pass data to Cairo and such?

And it seems like Mojo lacks a couple other features to allow for such a library NuMojo/numojo/core/ndarray.mojo at main · Mojo-Numerics-and-Algorithms-group/NuMojo · GitHub

Cairo is (slower) that other backends because it is a vector graphics library. there are faster options like AGG or even GPU accelerated ones.

AFAIK, the main bottleneck in python visualization is python itself.

Yes I use AGG for some semi-production level batch jobs, but I can’t shake the feeling that I could have gone faster if I could be more efficient with my IO.

have given a try to duckdb?

Absolutely! Love DuckDB and ClickHouse and DeltaLake (most of all) for columnar data but NWP model outputs are not compatible with DuckDB Spatial unfortunately.

can I simply replace individual functions or create bindings for parts of a library in Mojo and expect it to perform at its best

You can probably get 90% of the way there just doing that, but you probably want to leverage MAX for the last 10% (which could actually be most of the performance in some cases).

Right, I guess I am a weird case because I care disproportionately about the Tb-Pb scale or batch scale or HPC side of these.

I do not know how much faster thand Numpy or Dask backed Xarray, a Numojo backed Xarray would be. Anyways Mojo has so much potential in my field (geospatial data science and AI) that I am just excited to see it progress :slight_smile: