[Hackathon] Mojo-Lapper: GPU-Accelerated Interval Overlap Detection

Mojo-Lapper: GPU-Accelerated Interval Overlap Detection

TL;DR: I built an interval overlap detection library in Mojo with both CPU and GPU implementations, achieving 60-140x speedups on GPU vs CPU using the BITS algorithm. The same Lapper data structure is usable in Host code and Device kernel code.

What I Built

Mojo-Lapper is an implementation of the BITS (Binary Interval Search Tree) algorithm for fast interval overlap detection. Think genomics (finding overlapping genes), time series analysis (detecting event overlaps), or resource scheduling (conflict detection).

There’s still lots of performance to be gained here, but once again it was impressive how code written for the CPU could pretty much just be dropped on a GPU thread for huge gains. The cool thing here is that the Lapper data structure works on both the GPU and CPU, so you can instantiate it in your kernel and call the same methods as on the CPU.

Core Contribution: GPU BITS Algorithm

The main deliverable is a GPU kernel implementing the BITS algorithm for parallel interval counting:

fn count_overlaps_kernel(...):
    # Each GPU thread processes one query independently
    var lapper = Lapper[owns_data=False](starts, stops, vals, stops_sorted, length, max_len)
    var thread_id = (block_idx.x * block_dim.x) + thread_idx.x
    
    # Extract query range for this thread
    var start = keys[thread_id * 2]
    var stop = keys[thread_id * 2 + 1]
    
    # Execute BITS algorithm: O(log n) dual binary search
    var count = lapper.count(start, stop)
    output[thread_id] = count

Algorithm: Each GPU thread independently executes the BITS algorithm using dual binary searches. The BITS approach counts overlaps as total_intervals - excluded_before - excluded_after, avoiding linear scans.

Performance Results

Test Hardware: Intel Xeon Platinum 8358 vs NVIDIA A10

Operation CPU (ms) GPU (ms) Speedup
Count overlaps (BITS) 1.12 0.008 140x
Count overlaps (Naive) 10.89 0.11 99x

Note: GPU times exclude data transfer. Including transfer overhead, GPU is still 60-120x faster.
*Note: This compares the CPU BITS vs GPU BITS and CPU Naive vs GPU Naive.

Example Usage

# CPU version - owns memory
var cpu_lapper = Lapper(intervals)
var count = cpu_lapper.count(1200, 1800)  # O(log n) BITS algorithm

# GPU version - external memory pointers  
var gpu_lapper = Lapper[owns_data=False](
    starts, stops, vals, stops_sorted, length, max_len
)
var count = gpu_lapper.count(1200, 1800)
# Process thousands of queries in parallel

Try It Yourself

Full details and code: GitHub Repository with Complete README

git clone https://github.com/your-repo/mojo-lapper
pixi run t        # Run 50 tests
pixi run bl && ./lapper_bench  # Build and run benchmarks

Or Install it!

Until this is on modular-community, you can depend on it directly from git.

```toml
# pixi.toml
[workspace]
authors = ["Mojo <mojo@hotmail.com>"]
channels = ["https://conda.modular.com/max-nightly", "conda-forge", "https://prefix.dev/pixi-build-backends", "https://repo.prefix.dev/modular-community",]
platforms = ["osx-arm64"]
preview = ["pixi-build"]

[package]
name = "example"
version = "0.1.0"

[package.build]
backend = { name = "pixi-build-rattler-build", version = "0.1.*" }

[tasks]

[dependencies]
modular = "=25.4.0"
lapper = {git = "https://github.com/sstadick/mojo-lapper.git"}
# main.mojo
from lapper.lapper import Lapper, Interval as Iv


def main():
    var intervals: List[Iv] = [Iv(1, 5, 0), Iv(2, 6, 0), Iv(6, 10, 0)]
    var lapper = Lapper(intervals)
    print(lapper.count(2, 6))

Bonus Details

I started out wanting to implement a k-ary search and lost most of a day trying and failing to get that implemented. Not due to Mojo or the GPU, it was just hard and I can’t find a reference implementation out there, only one or two papers about it.

I wanted k-ary because I assumed that binary search on the GPU was just going to trash global memory access. After giving up on k-ary search I figured why not just try Eytzinger, theoretically it’s cache oblivious. The gpu_bench code will show the results which is that Eytzinger layout binary search is the same perf as regular binary search on GPU. I’m assuming that’s dominated by thread divergence mostly. I did not test sorting the queries first, which might improve memory coalesing.

Surprisingly (to me), the GPU binary search was still very performant cmpared to the CPU version.

Once I had a binary search algo in place for finding the lower bounds of my intervals I was ready to start on my Lapper data structure, which I’ve implemented before in Rust. The Lapper is cool because it’s simple. Just a couple of sorted arrays, and binary search to locate a possible starting point.

That all came together nicely and I then had to spend some time working on the BITS algorithm, a fast method for counting all the intervals that your query interval overlaps without having to visit every overlapping interval.

I got that working along with making the Lapper flexible so that it could take pointers to data and work on the GPU or CPU, which was kind of neat, and easy to do.

Counting overlaps is very useful, but getting all the overlapping intervals is also a common use for interval data structures. I wanted to (and will in the near future) implement the following:

  1. Load everything on the GPU
  2. First pass for each query, run the BITS count kernel to determine how many intervals each query will overlap.
  3. Allocate output space based on that
  4. Run the find kernel, collecting found intervals into the output space.
    4.1. Instead of linear search, do one search per warp and do a SIMT scan to find overlapping intervals
    4.2 Add a SIMD linear scan to the CPU scan.

Overall, continually impressed how easy it has been to get ideas into code, and get the code onto GPUs with Mojo :fire:

5 Likes

This is super cool! Is this an offshoot of the work you presented at the last community meeting, or in a different domain?

1 Like

Thanks Brad! Different domain! Still under the bioinformatics umbrella in general, but different area than alignment.

This is really cool!

1 Like