Structured Mojo Kernels - Context Managers vs RAII or Linear Types

I’ve just finished reading through Modular: Structured Mojo Kernels Part 1 - Peak Performance, Half the Code, and I was wondering why context managers were chosen over using linear types for this. Context managers force a resources to be released in the opposite order they were required, which is often fairly sub-optimal, whereas using linear types, the simple “must release” types are easy to handle, can provide the same sequencing when combined with typestate, but allow for more flexibility around resource acquire/release ordering.

After looking into a lot of the types, many seem like they should be able to be handled with RAII in a similar manner to lock guards, for instance EpilogueWarpContext’s __exit__ is

@always_inline
    fn __exit__(self):
        self.dealloc_barrier.signal_complete()

As far as I can tell, there isn’t anywhere in the public kernels where this couldn’t be shoved into the __del__, and this appeared to be true for the other types I checked. I can see the argument that RAII makes it easy to accidentally drop resources, in which case linear types fill that gap nicely. Linear types also handle raising from a destructor, so that shouldn’t cause issues.

I’m clearly missing something, since I’d expect people to want to keep flexibility around resource management ordering or take advantage of ASAP RAII to clean things up as quickly as possible. Can someone explain the reasoning for this in a little more detail?

1 Like

Hi, structured kernels do support Linear Type contexts already, and yes, it allows for more flexibility in the acquire-release sequence. It is a recent feature, and so far, it has only been adopted in the BlackwellBlockwiseFP8MatmulKernel:

    var producer = input_pipeline.producer()
    while work_iter.has_work():
        with work_iter.next() as current:
            work_iter.throttle_signal(ctx.is_first_cta_in_cluster)
    
            for i in range(num_iters):
                # Acquire tiles (waits for consumer to free slot)
                var tiles = producer.acquire_stage()
                Self.load_input_tiles(
                    a_loader,
                    b_loader,
                    a_scales_loader,
                    tiles,
                    ctx.peer_cta_coord,
                    current.coord(),
                    i,
                    ctx.elect_one_cta,
                )
                tiles^.release()  # Advance producer stage
    
    producer.drain()  # Wait for consumer before CTA exits

We’re planning to use this style more extensively in all kernels; we just haven’t gotten around to it yet :sweat_smile:

1 Like

Thanks for the clarification!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.