Blazeseq: Zero-copy, GPU-friendly fastq parsing

Hello all,

I’d like to share a project I started prototyping back in the early days of Mojo (around v0.5). The language has evolved significantly since then and I am happy that this project have evolved out of the prototype stage after a long haitus.

BlazeSeq is a high-throughput FASTQ parser intended as a foundational building block for bioinformatics pipelines. check the repository for full features, extensive examples, and benchmarks:

Performance

  • Single-threaded, zero-copy parsing: 5+ GB/s (I/O permitting)
  • Thread-safe batched parsing: up to 4+ GB/s

Multiple iterator APIs

To accommodate different workload patterns, there are three iteration modes:

  • Owned records
  • Zero-copy record references
  • Batched records (SoA layout)
var parser = FastqParser(FileReader(Path("data.fastq")), batch_size=4096)
for record in parser.records(): 
   pass
for ref_record in parser.ref_records(): # Zero-copy parsing, single threaded.
  pass
for batch in parser.batches(): # SoA, GPU-friendly
   pass

GPU-friendly batch API

BlazeSeq provides a Structure-of-Arrays (SoA) batch representation designed for accelerator workflows and provides helper methods for Host ↔ Device transfers. so the following is possible .

from gpu.host import DeviceContext

ctx = DeviceContext()
parser = FastqParser(FileReader(Path("file.fastq")))
for batch in parser.batches():
    device_batch = batch.to_device(ctx)
   # run_my_gpu_kernel(device_batch)
   

Parallel Gzip decompression

in bioinformatics workloads, the decompression of gziped files is in most cases the bottleneck. BlazSeq integrates rapidgzip, allowing for parallel decompression of standard .gz files. the result is up to 5× faster end-to-end throughput starting from compressed files vs zlib or libdeflate based tools.

Python bindings

There are experimental Python bindings available, so you can:

pip install blazeseq

and use it directly from Python for scripting or integration with other tools.


Please check the repository and the examples and I would be happy for any feedback or suggestions.

3 Likes

“Impressive throughput on the zero-copy parsing! Since you’ve moved to a Structure-of-Arrays (SoA) layout for GPU-friendly batches, have you explored how this memory alignment behaves when bypassing the standard kernel stack—specifically using Direct DMA or NVMe-over-Fabrics? I’m curious if BlazeSeq could handle raw Ethernet frames at wire speed if the source wasn’t a .fastq file, but a high-speed SmartNIC buffer.”

I tested a bit the effect of IO overhead with in-process memory buffer, ramsfs to simulate syscalls and direct reads from NVME, the differences are not that huge (maybe ~10-15%).
I can really say anything about streaming parsing because I think this area of mojo is still evolving.

This is awesome work @MMabrouk! I really want to try wiring up the batch API you have for the GPU to ish.

And cool note on rapidgzip. I tried pugz a longgg time ago and never got it working. If that lives up to the promises made that’s an awesome lib to have some bindings for.

1 Like

Yes, decompression was always the bottleneck that kills the benchmarks numbers.
I played with your zlib binding but it was too slow.
the rapidgzip paper makes big claims about scaling the decompression speed with the core number
https://arxiv.org/pdf/2308.08955

I added benchmarking number for single vs multi-threaded scenarios vs zlib (+kseq) and lib deflate (+ seq_io and needletail)

Multithreaded (4 threads): BlazeSeq/assets/parser_gzip.png at main · MoSafi2/BlazeSeq · GitHub
single threaded: BlazeSeq/assets/parser_gzip_single.png at main · MoSafi2/BlazeSeq · GitHub

the speed scales well up to 12 cors (all what I have) and at 12 cores the throuhput is around ~5-6 times faster than kseq+zlib but that is a lot of cores to through on one file :smiley:

It would be great to integrate the batching support with Ish and how convinent it would be, I tried to split the host-> buffer to staging, and copying, so double buffering to the GPU could be possible.

If you can get your hands on a relatively recent Intel server CPU, give GitHub - intel/qatlib a try. QAT is generally very, very good at decompressing gzip.