On the new json module

I see in the recent nightlies that a json module was introduced to the stdlib, which I think is a good thing, but I’m curious why we didn’t consider porting in the EmberJSON implementation. At the time I started the project I was told that a json module was unlikely to be included in the stdlib, so naturally I built it as a community library, but since that is no longer the case. I think it would make sense to take advantage of the work already done, rather than reinvent the wheel?

At the time of writing this EmberJSON has the following benefits

  • 0 Dependencies
  • 114 existing tests
  • An existing benchmarking suite
  • Superior performance to the current stdlib implementation
  • UTF-8 support
  • Fast float point parsing using the simdjson algorithm, with a truncating fallback implementation for floats with many decimal points
  • Distinct parsing of integers and floats to avoid precision loss for large integers
  • SIMD-accelerated minifier
  • SIMD-accelerated string parsing
  • Tree-based object representation to improve large object parsing performance
  • Pretty-print formatting

Granted I do employ a fair amount of unsafe code to achieve this, so if the team is unwilling to adopt it for that reason, then I respect that. I just figured I would prod a bit here out if curiosity.

- Brian

5 Likes

We should at least blindly port all the tests.

Also as I’ve pointed at Discord, current naming is far away from being good. load as free function is just a strict copy from Python which API is confusing. load and loads are not informative and have sense only with module name context.

from json import load

var _ = load(...) # not enough context to make `load` unambigous

As stated here there are no objections to have different API at Mojo side with extension support for Python compatibility.

Overall I am +1 for using EmberJson as a base for future improvements

@joe gentle ping as you are stdlib team leader

FWIW if anyone ever builds something that uses json for compute-heavy workloads, then they are probably going to search for libraries like EmberJSON.

I’m guessing, but I imagine the thought process behind the decision was somewhere along the lines of: “less lines, less maintenance”. The stlib implementation stands currently at around 1270 lines.

I do however sincerely hope that EmberJSON is fully incorporated into the stdlib, it is faster and much better organized. I think many in the community would be willing to step in as maintainers. Unstructured JSON parsing is something that should be SOTA in the stdlib, maybe we can leave out structured JSON for external libraries.

FWIW I also feel that I would become quite distraught if something similar were to happen with an inferior datetime library being incorporated into the stdlib instead of the one I’ve worked on for months with the express goal of integrating it into the stdlib. IMO Modular should not disregard the community like this…

2 Likes

Hi Bgreni,

I wasn’t aware of the EmberJSON project - thanks for bringing it to my attention. I’d definitely encourage you to submit a PR replacing our existing JSON module with what you’ve developed. Your implementation looks to be of higher quality than what I’ve put together.

To be transparent, I wrote the current JSON code during a plane ride as a quick solution, so there’s nothing particularly special about it. I’d be happy to see it replaced with your more robust implementation.

Looking forward to your contribution!

Best regards

4 Likes

Thanks Abdul, I appreciate the update! I’ll touch base with the stdlib team later to figure out how best to move forward with that since it would be a fairly hefty contribution

1 Like

Just FYI: No need to integrate anything. I’d be happy to delete the json.mojo file and replace it with your implementation if that’s the easiest approach.

I’ve done a few things that Joe and others might take issue with (like write number parsing and stringification from scratch) from a maintenance burden point of view. So mostly there might need to be some discussion around that

Purely for coordination purposes, here is the PR for fast float parsing. If everything goes well, I should get a review this week.

2 Likes

Hey @bgreni I’ll take a look at your library and let’s work together with @gabrieldemarmiesse as there’s likely room for overlap in creating space for a “format” module to contain these shared utilities.

2 Likes

Thanks Joe! Admittedly the documentation is a bit sparse, so let me know if you have any questions

I think there might be room for a json-specific implementation or at least splitting up/parametrizing atof, because there are things like 1e10_0 which aren’t part of the json standard

If I had it my way I would probably prefer keeping them separate for such reasons, as well as I am inlining things much more aggressively then we probably should be for the general use atof function, but in the end it’s up to the team of course.

2 Likes

@bgreni @joe @adakkak, I’m loving the discussion about potentially including emberjson into the stdlib. I’d like to suggest a few thoughts related to data serialization / deserialization and data format support. I’m not sure if this is the right place to have that discussion?

The topics can roughly be organized as follows:

  1. Ser/De and data-formats beyond json
  2. What formats belong in std-lib

1. Serde beyond json

One of the things that both go and rust did really well was expose common mechanisms / traits for struct serde to and from various data formats. This is something that’s lacking in pure python but would be great to have in the stdlib for mojo.

  • Go exposed the conventional way of using tags + a few common interfaces (the various Mashaller and Unmarshaller interfaces in package encoding), to drive data marshaling. This is probably one of the language features that’s allowed go to be used heavily to define APIs and data types across the cloud-native ecosystem (eg the entire kubernettes API is canonically defined as go structs)

  • Rust obviously has the serde crate for a long time now. I haven’t followed recent developments of alternatives (eg sud, merde, etc) so some of the thoughts around state of the art might have evolved here

2. Implementations

I think the ecosystem would benefit from defining the core ser/de or marshalling traits in the std lib but not all data formats need to live there. I’m not sure if something like this has been discussed yet, but in a world where it’s easy to fetch packages with a package manager, it might be nice to keep the std lib smaller and setup an set of official extension packages similar to how go structures its various “x” packages Sub-repositories - Go Packages (ie those part of the go project but out of tree).

For things that might have an open ended number of variants like data formats or hashing algorithms, especially ones that might got stale over the years, it might be nice to set up an out of tree package structure that proactively resilient to future changes while still providing “official” implementations.