Mojo-Lapper: GPU-Accelerated Interval Overlap Detection
TL;DR: I built an interval overlap detection library in Mojo with both CPU and GPU implementations, achieving 60-140x speedups on GPU vs CPU using the BITS algorithm. The same Lapper
data structure is usable in Host code and Device kernel code.
What I Built
Mojo-Lapper is an implementation of the BITS (Binary Interval Search Tree) algorithm for fast interval overlap detection. Think genomics (finding overlapping genes), time series analysis (detecting event overlaps), or resource scheduling (conflict detection).
There’s still lots of performance to be gained here, but once again it was impressive how code written for the CPU could pretty much just be dropped on a GPU thread for huge gains. The cool thing here is that the Lapper data structure works on both the GPU and CPU, so you can instantiate it in your kernel and call the same methods as on the CPU.
Core Contribution: GPU BITS Algorithm
The main deliverable is a GPU kernel implementing the BITS algorithm for parallel interval counting:
fn count_overlaps_kernel(...):
# Each GPU thread processes one query independently
var lapper = Lapper[owns_data=False](starts, stops, vals, stops_sorted, length, max_len)
var thread_id = (block_idx.x * block_dim.x) + thread_idx.x
# Extract query range for this thread
var start = keys[thread_id * 2]
var stop = keys[thread_id * 2 + 1]
# Execute BITS algorithm: O(log n) dual binary search
var count = lapper.count(start, stop)
output[thread_id] = count
Algorithm: Each GPU thread independently executes the BITS algorithm using dual binary searches. The BITS approach counts overlaps as total_intervals - excluded_before - excluded_after
, avoiding linear scans.
Performance Results
Test Hardware: Intel Xeon Platinum 8358 vs NVIDIA A10
Operation | CPU (ms) | GPU (ms) | Speedup |
---|---|---|---|
Count overlaps (BITS) | 1.12 | 0.008 | 140x |
Count overlaps (Naive) | 10.89 | 0.11 | 99x |
Note: GPU times exclude data transfer. Including transfer overhead, GPU is still 60-120x faster.
*Note: This compares the CPU BITS vs GPU BITS and CPU Naive vs GPU Naive.
Example Usage
# CPU version - owns memory
var cpu_lapper = Lapper(intervals)
var count = cpu_lapper.count(1200, 1800) # O(log n) BITS algorithm
# GPU version - external memory pointers
var gpu_lapper = Lapper[owns_data=False](
starts, stops, vals, stops_sorted, length, max_len
)
var count = gpu_lapper.count(1200, 1800)
# Process thousands of queries in parallel
Try It Yourself
Full details and code: GitHub Repository with Complete README
git clone https://github.com/your-repo/mojo-lapper
pixi run t # Run 50 tests
pixi run bl && ./lapper_bench # Build and run benchmarks
Or Install it!
Until this is on modular-community, you can depend on it directly from git.
```toml
# pixi.toml
[workspace]
authors = ["Mojo <mojo@hotmail.com>"]
channels = ["https://conda.modular.com/max-nightly", "conda-forge", "https://prefix.dev/pixi-build-backends", "https://repo.prefix.dev/modular-community",]
platforms = ["osx-arm64"]
preview = ["pixi-build"]
[package]
name = "example"
version = "0.1.0"
[package.build]
backend = { name = "pixi-build-rattler-build", version = "0.1.*" }
[tasks]
[dependencies]
modular = "=25.4.0"
lapper = {git = "https://github.com/sstadick/mojo-lapper.git"}
# main.mojo
from lapper.lapper import Lapper, Interval as Iv
def main():
var intervals: List[Iv] = [Iv(1, 5, 0), Iv(2, 6, 0), Iv(6, 10, 0)]
var lapper = Lapper(intervals)
print(lapper.count(2, 6))
Bonus Details
I started out wanting to implement a k-ary search and lost most of a day trying and failing to get that implemented. Not due to Mojo or the GPU, it was just hard and I can’t find a reference implementation out there, only one or two papers about it.
I wanted k-ary because I assumed that binary search on the GPU was just going to trash global memory access. After giving up on k-ary search I figured why not just try Eytzinger, theoretically it’s cache oblivious. The gpu_bench
code will show the results which is that Eytzinger layout binary search is the same perf as regular binary search on GPU. I’m assuming that’s dominated by thread divergence mostly. I did not test sorting the queries first, which might improve memory coalesing.
Surprisingly (to me), the GPU binary search was still very performant cmpared to the CPU version.
Once I had a binary search algo in place for finding the lower bounds of my intervals I was ready to start on my Lapper
data structure, which I’ve implemented before in Rust. The Lapper is cool because it’s simple. Just a couple of sorted arrays, and binary search to locate a possible starting point.
That all came together nicely and I then had to spend some time working on the BITS algorithm, a fast method for counting all the intervals that your query interval overlaps without having to visit every overlapping interval.
I got that working along with making the Lapper flexible so that it could take pointers to data and work on the GPU or CPU, which was kind of neat, and easy to do.
Counting overlaps is very useful, but getting all the overlapping intervals is also a common use for interval data structures. I wanted to (and will in the near future) implement the following:
- Load everything on the GPU
- First pass for each query, run the BITS count kernel to determine how many intervals each query will overlap.
- Allocate output space based on that
- Run the find kernel, collecting found intervals into the output space.
4.1. Instead of linear search, do one search per warp and do a SIMT scan to find overlapping intervals
4.2 Add a SIMD linear scan to the CPU scan.
Overall, continually impressed how easy it has been to get ideas into code, and get the code onto GPUs with Mojo