Lazy parsing arbitrary structs in EmberJSON

With all the powerful new type reflection utilities that have come down the pipeline, I was able to easily implement structured de/serialization in EmberJSON. Then recently I got the idea to explore adding support for Lazy parsing for arbitrary types as well!

Before I dig into that, for those unfamiliar with Mojo’s current reflection toolkit I will give a quick overview of how I’ve implemented structured parsing so far.

Reflection based JSON parsing

My approach uses a recursive strategy to traverse the fields of a struct until it finds a type that
conforms to JsonDeserializable. This trait has been implemented for most reasonable stdlib types via the experimental __extension syntax. Any plain structs are simply treated as JSON objects, with the only requirement being that they conform to Movable & ImplicitlyDestructible and for types that include fields with non-trivial destructors they also must conform to Defaultable.

from emberjson import *


struct Foo(Movable, Writable):
    var a: Int
    var b: Float64


struct Bar(JsonDeserializable, Writable):
    var a: Int
    var b: Float64

    @staticmethod
    fn deserialize_as_array() -> Bool:
        return True


fn main() raises:
    var ob = '{"a": 10, "b": 345.234532}'
    var arr = "[1234, 2.435]"

   # Foo(a=10, b=345.234532)
   # Bar(a=1234, b=2.435)
    print(deserialize[Foo](ob))
    print(deserialize[Bar](arr))

Lets dig deeper and see how this is actually implemented. The bread of butter is these three stdlib functions.

from std.reflection import (
    struct_field_count,
    struct_field_types,
    struct_field_names,
    get_type_name,
)


struct Foo:
    var a: Int
    var b: Bool


fn main():
    print(struct_field_count[Foo]())  # 2
    comptime types = struct_field_types[Foo]()
    comptime names = struct_field_names[Foo]()
    comptime for i in range(struct_field_count[Foo]()):
        comptime name = names[i]
        # Int a
        # Bool b
        print(get_type_name[types[i]](), name)

Now that we have a way of programmatically inspecting the fields of a struct without needing
explicit, ahead of time knowledge about what struct we are working with. We can start building the logic for parsing arbitrary Mojo structs.

The deserialize function is a thin wrapper around the _deserialize_impl which either dispatches
a types specific from_json implementation, or recursively calls the _default_deserialize function
to check all the target structs fields for types that do conform to the trait.

fn _deserialize_impl[
    origin: ImmutOrigin, options: ParseOptions, //, T: _Base
](mut p: Parser[origin, options], out s: T) raises:
    comptime assert is_struct_type[T](), non_struct_error

    comptime if conforms_to(T, JsonDeserializable):
        s = downcast[T, JsonDeserializable].from_json(p)
    else:
        s = _default_deserialize[T, False](p)

The JsonDeserialize also houses extra configurations to customize parsing behaviour without
the need for creating a completely custom from_json implementation (for now it only supports
deserializing a struct from an array instead of an object).

comptime _Base = ImplicitlyDestructible & Movable


trait JsonDeserializable(_Base):
    @staticmethod
    fn from_json[
        origin: ImmutOrigin, options: ParseOptions, //
    ](mut p: Parser[origin, options], out s: Self) raises:
        s = _default_deserialize[Self, Self.deserialize_as_array()](p)

    @staticmethod
    fn deserialize_as_array() -> Bool:
        return False

@always_inline
fn _default_deserialize[
    origin: ImmutOrigin,
    options: ParseOptions,
    //,
    T: _Base,
](mut p: Parser[origin, options], out s: T) raises:
    ...

    comptime field_count = struct_field_count[T]()
    comptime field_names = struct_field_names[T]()
    comptime field_types = struct_field_types[T]()

    comptime if is_array:
        ...
    else:
        p.expect(`{`)

        var seen = InlineArray[Bool, field_count](fill=False)

        while p.peek() != `}`:
            var ident = p.read_string()
            p.expect(`:`)

            var matched = False
            comptime for i in range(field_count):
                comptime name = field_names[i]

                if ident == name:
                    if unlikely(seen[i]):
                        raise Error("Duplicate key: ", name)
                    seen[i] = True
                    matched = True
                    ref field = __struct_field_ref(i, s)
                    comptime TField = downcast[type_of(field), _Base]

                    field = _deserialize_impl[TField](p)
              ...
        p.expect(`}`)

Despite how intimidating the type system wizardry may appear. The logic here is actually quite simple. We loop through each field in the object string, read the identifier string, try and match that identifier against the field names in our target struct. Upon finding a match we fetch a reference to that particular field using __struct_field_ref. Then we use downcast to confirm that the type of
the target field is ImplicitlyDestructible & Movable so the type checker accepts it as a parameter in _deserialize_impl where the value of the field will be parsed and returned. The recursive process continues until the entire JSON structure has been parsed.

Lazy parsing

With all that in our toolkit, let’s turn out attention to how we can use this to lazy parse arbitrary structs as well. We already have all we need to perform the final parsing of these structures, so all we need is an additional layer for first collecting a view of the bytes that contain the target value.

Introducing the Lazy wrapper struct.


comptime ReadBytesFn[origin: ImmutOrigin] = fn(
    mut Parser[origin]
) raises -> Span[Byte, origin]
comptime ParseFn[T: _Base, origin: ImmutOrigin] = fn(
    Span[Byte, origin]
) raises -> T


fn __pick_byte_expect[T: _Base, origin: ImmutOrigin]() -> ReadBytesFn[origin]:
    comptime if conforms_to(T, JsonDeserializable) and downcast[
        T, JsonDeserializable
    ].deserialize_as_array():
        return _get_array_bytes[origin]
    else:
        return _get_object_bytes[origin]

@fieldwise_init
struct Lazy[
    T: _Base,
    origin: ImmutOrigin,
    parse_value: ReadBytesFn[origin] = __pick_byte_expect[T, origin](),
    extract_value: ParseFn[T, origin] = _deserialize_bytes[T, origin],
](Hashable, JsonDeserializable, JsonSerializable, TrivialRegisterPassable):
    var _data: Span[Byte, Self.origin]

    @staticmethod
    fn from_json[
        o: ImmutOrigin, options: ParseOptions, //
    ](mut p: Parser[o, options], out s: Self) raises:
        s = {Self.parse_value(rebind[Parser[Self.origin]](p))}

    fn write_json(self, mut writer: Some[Serializer]):
        writer.write(StringSlice(unsafe_from_utf8=self._data))

    fn get(self) raises -> Self.T:
        return Self.extract_value(self._data)

Once again this snippet may seem intimidating, but is actually fairly simple. We have a Lazy struct which takes 4 parameters. T which is the target type for when we need to fully deserialize the value, origin which is the origin of the source data being parsed. Then there is parse_value and extract_value which are each responsible for one of the two steps in the parsing process. parse_value is some function that given a Parser instance, will return a Span containing the byte representation of the target value. For example, if T is some plain struct then the default _get_object_bytes will return all the bytes for the next JSON object.

When the user needs the concrete value they can invoke the get() method which will simply invoke extract_value and return the result. Aliases for the baseline JSON types are already implement like so.

comptime LazyInt[origin: ImmutOrigin] = Lazy[
    Int64, origin, _get_int_bytes[origin]
]

As a result users can easily choose particular fields in a struct to be evaluated lazily.

struct Foo[origin: ImmutOrigin](Movable, Writable):
    var a: Int
    var b: LazyFloat[Self.origin]


fn main() raises:
    var j = '{"a": 12, "b": 3.435}'

    var f = deserialize[Foo[origin_of(j)]](j)

    print(f.a) # 12
    print(f.b.get()) # 3.435

Or just lazily parse an entire arbitrary struct.

struct Foo(Movable, Writable):
    var a: Int
    var b: Float64


fn main() raises:
    var j = '{"a": 12, "b": 3.435}'

    var f = deserialize[Lazy[Foo, origin_of(j)]](j)

    print(f.get().a) # 12

Thanks to @joe for pushing reflection forward in Mojo. I have been having a blast seeing how far I can push these new features!

If anyone would like to try out these new features, you can depend on the emberjson git repo on mojo nightly directly using the pixi-build mojo backend (thank you @duck_tape)
emberjson = {git = "https://github.com/bgreni/EmberJson.git"}

6 Likes

Wow, really cool!

One question concerning syntax. Did you consider using [] (__getitem__()) instead of .get() for lazy evaluation? This is common in Mojo for resolving an indirection like in Pointer etc. and would make the code more concise:

print(f.b[]) 
#vs.
print(f.b.get())

print(f[].a) 
#vs.
print(f.get().a)

Really amazing work (as I said, I use a limited version of your reflection based recursive deserialization idea to decode DuckDB query results into Mojo types).

With these ingredients, a trait like JsonDeserializable, static reflection and __extension I feel like we’re getting close to having actual type class support (including some automatic type class derivation support for structs) in Mojo:

You can define some capability in a trait (like from_json in JsonDeserializable), then implement them for any concrete type or let the compiler do it for us via reflection (_default_deserialize).

With __extension we can add JsonDeserializable to any struct after the fact (i.e. for stdlib types), enabling the ad-hoc polymorphism of type classes.

But the real power comes from the fact that once we implement JsonDeserializable for generic types like List (in fact any generic struct), we now can deserialize into List[T] for any T that implements JsonDeserializable itself.

I think this is a really powerful concept and is heavily used in languages like Haskell or Scala. In Scala, essentially all JSON libraries like circe use type classes.

1 Like

I did but I wasn’t sure if that was appropriate since it is a much more expensive operation than [] usually represents

1 Like

But the real power comes from the fact that once we implement JsonDeserializable for generic types like List (in fact any generic struct), we now can deserialize into List[T] for any T that implements JsonDeserializable itself.

This actually already works EmberJson/emberjson/_deserialize/reflection.mojo at main · bgreni/EmberJson · GitHub

And even better T can be any struct, even one that doesn’t implement the trait as long as decomposes eventually into compliant types.

I’ve also implemented the trait for IntLiteral and FloatLiteral so one could do something like turn version mismatches into a parsing error.

struct APIV1(Movable):
    var a: Int
    var version: IntLiteral[(1).value]


fn main() raises:
    var j = '{"a": 12, "version": 1}'
    var j2 = '{"a": 12, "version": 2}'  # fails

    var f = deserialize[APIV1](j)

    print(f.a)
1 Like

I’ve been playing around with this concept more today and turns out you can even compensate for the lack of enums and sum types in Mojo (for now)

var o1 = deserialize[OneOf[String, "red", "green", "blue"]]('"red"')
assert_equal(o1.value, "red")
1 Like