Blazeseq: Zero-copy, GPU-friendly fastq parsing

MMabrouk · March 1, 2026, 11:30pm

Hello all,

I’d like to share a project I started prototyping back in the early days of Mojo (around v0.5). The language has evolved significantly since then and I am happy that this project have evolved out of the prototype stage after a long haitus.

BlazeSeq is a high-throughput FASTQ parser intended as a foundational building block for bioinformatics pipelines. check the repository for full features, extensive examples, and benchmarks:

Performance

Single-threaded, zero-copy parsing: 5+ GB/s (I/O permitting)
Thread-safe batched parsing: up to 4+ GB/s

Multiple iterator APIs

To accommodate different workload patterns, there are three iteration modes:

Owned records
Zero-copy record references
Batched records (SoA layout)

var parser = FastqParser(FileReader(Path("data.fastq")), batch_size=4096)
for record in parser.records(): 
   pass
for ref_record in parser.ref_records(): # Zero-copy parsing, single threaded.
  pass
for batch in parser.batches(): # SoA, GPU-friendly
   pass

GPU-friendly batch API

BlazeSeq provides a Structure-of-Arrays (SoA) batch representation designed for accelerator workflows and provides helper methods for Host ↔ Device transfers. so the following is possible .

from gpu.host import DeviceContext

ctx = DeviceContext()
parser = FastqParser(FileReader(Path("file.fastq")))
for batch in parser.batches():
    device_batch = batch.to_device(ctx)
   # run_my_gpu_kernel(device_batch)

Parallel Gzip decompression

in bioinformatics workloads, the decompression of gziped files is in most cases the bottleneck. BlazSeq integrates rapidgzip, allowing for parallel decompression of standard .gz files. the result is up to 5× faster end-to-end throughput starting from compressed files vs zlib or libdeflate based tools.

Python bindings

There are experimental Python bindings available, so you can:

pip install blazeseq

and use it directly from Python for scripting or integration with other tools.

Please check the repository and the examples and I would be happy for any feedback or suggestions.

trojan_x · March 2, 2026, 4:44am

“Impressive throughput on the zero-copy parsing! Since you’ve moved to a Structure-of-Arrays (SoA) layout for GPU-friendly batches, have you explored how this memory alignment behaves when bypassing the standard kernel stack—specifically using Direct DMA or NVMe-over-Fabrics? I’m curious if BlazeSeq could handle raw Ethernet frames at wire speed if the source wasn’t a .fastq file, but a high-speed SmartNIC buffer.”

MMabrouk · March 2, 2026, 11:07am

I tested a bit the effect of IO overhead with in-process memory buffer, ramsfs to simulate syscalls and direct reads from NVME, the differences are not that huge (maybe ~10-15%).
I can really say anything about streaming parsing because I think this area of mojo is still evolving.

duck_tape · March 2, 2026, 11:37am

This is awesome work @MMabrouk! I really want to try wiring up the batch API you have for the GPU to ish.

And cool note on rapidgzip. I tried pugz a longgg time ago and never got it working. If that lives up to the promises made that’s an awesome lib to have some bindings for.

MMabrouk · March 2, 2026, 11:44am

Yes, decompression was always the bottleneck that kills the benchmarks numbers.
I played with your zlib binding but it was too slow.
the rapidgzip paper makes big claims about scaling the decompression speed with the core number
https://arxiv.org/pdf/2308.08955

I added benchmarking number for single vs multi-threaded scenarios vs zlib (+kseq) and lib deflate (+ seq_io and needletail)

Multithreaded (4 threads): BlazeSeq/assets/parser_gzip.png at main · MoSafi2/BlazeSeq · GitHub
single threaded: BlazeSeq/assets/parser_gzip_single.png at main · MoSafi2/BlazeSeq · GitHub

the speed scales well up to 12 cors (all what I have) and at 12 cores the throuhput is around ~5-6 times faster than kseq+zlib but that is a lot of cores to through on one file

It would be great to integrate the batching support with Ish and how convinent it would be, I tried to split the host-> buffer to staging, and copying, so double buffering to the GPU could be possible.

owenhilyard · March 2, 2026, 3:10pm

If you can get your hands on a relatively recent Intel server CPU, give GitHub - intel/qatlib a try. QAT is generally very, very good at decompressing gzip.

MMabrouk · April 1, 2026, 7:04pm

new version v0.3 of blazseq is released:

Added Support for FASTA Format
Added support for Fasta Index (fai) Format
Initial support for Gene Transfer Format (GTF) and General Feature Format v3.x (GFF3).
Initial support for BED format (multiple variants)
Standard API around Views, Records and optionally Batches
new zero-copy delimited parser with configurable LinePolicy (this can be useful in simple delimited )
Added new benchmarks for FASTA vs vs rust noodles and needletail
Expanded correctness and round-trip testing suite.

Topic		Replies	Views
March Community Meeting Community Meetings	1	251	March 20, 2026
Mimage: An image parsing library written in Mojo Community Showcase	5	308	December 30, 2025
Light of Baldr: a full pure-Mojo web stack + a GPU multi-pattern matching kernel (11 repos on Mojo 1.0) Community Showcase	0	95	May 22, 2026
mojoBLAS (v0.1.0) A pure Mojo implementation of BLAS routines Community Showcase	2	280	April 27, 2026
A Benchmark with Files and Bytes (standard benchmark warnings apply) Community Showcase discussion	10	362	July 5, 2025