[Showcase] A Benchmark with Files and Bytes (standard benchmark warnings apply)

WARNING: It’s Friday. This was just for fun. Please remember that all benchmarks are lies and not to be trusted. Mostly this was a useful exercise for improving the performance and APIs in ExtraMojo.

REMINDER: The point of this “benchmark” wasn’t to optimize the problem, but just to follow the template to write code that I often end up writing for parsing files.

I have a common contrived benchmark I like to try in various langagues to get a feel for how they might perform in “real” cases vs most other benchmarks out there. The repo is here (note, just the “records” benchmark for now) but I’ll have all relevant code below.

Input Data is a big TSV generated with the following command:

awk 'BEGIN{for (i=0; i<2000000; i++){print "abcdef\tghijk\tlmnop\tqrstuv\twxyz1234\tABCDEF\tHIJK\tLMNOP\tQRSTUV\tWXYZ123\tabcdef\tghijk\tlmnop\tqrstuv\twxyz1234\tABCDEF\tHIJK\tLMNOP\tQRSTUV\tWXYZ123\tabcdef\tghijk\tlmnop\tqrstuv\twxyz1234\tABCDEF\tHIJK\tLMNOP\tQRSTUV\tWXYZ123\tabcdef\tghijk\tlmnop\tqrstuv\twxyz1234\tABCDEF\tHIJK\tLMNOP\tQRSTUV\tWXYZ123\tabcdef\tghijk\tlmnop\tqrstuv\twxyz1234\tABCDEF\tHIJK\tLMNOP\tQRSTUV\tWXYZ123\tabcdef\tghijk\tlmnop\tqrstuv\twxyz1234\tABCDEF\tHIJK\tLMNOP\tQRSTUV\tWXYZ123"}}' > big.tsv

The Python reference implementation is the following:

import sys

class Record(object):
    __slots__ = ["name", "count"]
    def __init__(self, name, count):
        self.name = name
        self.count = count

def create_record(vals):
    count = len([val for val in vals if "bc" in val[1:4].lower()])
    return Record(vals[0], count)

def main():
    records = []
    for line in sys.stdin:
        records.append(create_record(line.split('\t')))

    print(sum([r.count for r in records]))

if __name__ == '__main__':
    main()

The Rust impl is:

use std::io::prelude::*;
use std::io::stdin;

struct Record {
    pub name: String,
    pub count: usize,
}

impl Record {
    pub fn new(name: String, count: usize) -> Record {
        Record { name, count }
    }
}

fn create_record(line: &str) -> Record {
    let mut iter = line.split('\t');
    let name = iter.next().unwrap();
    let count = std::iter::once(name)
        .chain(iter)
        .filter(|s| s[1..4].contains("bc"))
        .count();
    Record::new(name.to_string(), count)
}

fn main() {
    let mut records = vec![];
    let mut buffer = String::new();
    let stdin = stdin();
    let mut input = BufReader::new(stdin.lock());
    while let Ok(bytes_read) = input.read_line(&mut buffer) {
        if bytes_read == 0 {
            break;
        }
        buffer.make_ascii_lowercase();
        records.push(create_record(&buffer));
        buffer.clear();
    }
    let count: usize = records.iter().map(|r| r.count).sum();
    println!("{}", count);
}

And finally, the Mojo impl, using my library ExtraMojo to cover all the things that aren’t in the stdlib yet:

from ExtraMojo.fs.file import FileReader
from ExtraMojo.bstr.bstr import SplitIterator, find, to_ascii_lowercase
from memory import Span

alias TAB = ord("\t")


@value
struct Record:
    var name: String
    var count: Int


fn create_record(line: Span[UInt8]) raises -> Record:
    var name = String()
    var iter = SplitIterator(line, TAB)
    name.write_bytes(iter.peek().value())

    var count = 0
    for value in SplitIterator(line, TAB):
        if find(value[1:4], "bc".as_bytes()):
            count += 1

    return Record(name, count)


fn main() raises:
    var fh = open("/dev/stdin", "r")
    var reader = FileReader(fh^)

    var buffer = List[UInt8]()
    var records = List[Record]()
    while reader.read_until(buffer) != 0:
        to_ascii_lowercase(buffer)
        records.append(create_record(buffer))

    var count = 0
    for record in records:
        count += record[].count
    print(count)

Results from hyperfine:

# Run on a MacBook Pro with an M1
❯ hyperfine --warmup 3 '< big.tsv ./mojo/count_lines' '< big.tsv ./rust/target/release/count_lines' 'time < big.tsv python3 ./count_lines.py'
Benchmark 1: < big.tsv ./mojo/count_lines
  Time (mean ± σ):      2.070 s ±  0.012 s    [User: 1.970 s, System: 0.087 s]
  Range (min … max):    2.061 s …  2.100 s    10 runs

Benchmark 2: < big.tsv ./rust/target/release/count_lines
  Time (mean ± σ):      2.453 s ±  0.015 s    [User: 2.315 s, System: 0.121 s]
  Range (min … max):    2.439 s …  2.486 s    10 runs

Benchmark 3: time < big.tsv python3 ./count_lines.py
  Time (mean ± σ):     11.492 s ±  0.098 s    [User: 11.178 s, System: 0.196 s]
  Range (min … max):   11.343 s … 11.615 s    10 runs

Summary
  < big.tsv ./mojo/count_lines ran
    1.18 ± 0.01 times faster than < big.tsv ./rust/target/release/count_lines
    5.55 ± 0.06 times faster than time < big.tsv python3 ./count_lines.py

Mojo is FAST. I did take the time to implement a SIMD to ascii lower, as well as a SIMD search for the first occurrence of a character in a buffer. If I turn both of those off, the perf is a tad slower than Rust. However, I think it’s fair to leave them since writing SIMD is one of the major selling points of Mojo.

2 Likes

Fixing my simd code to use aligned reads, and switching Rust to using memchr so it was more fair, the benchmarks are as follows:

❯ hyperfine '< big.tsv python3 ./count_lines.py' '< big.tsv ./mojo/count_lines' '< big.tsv ./rust/target/release/count_lines'
Benchmark 1: < big.tsv python3 ./count_lines.py
  Time (mean ± σ):     12.689 s ±  0.092 s    [User: 12.462 s, System: 0.184 s]
  Range (min … max):   12.590 s … 12.892 s    10 runs
 
Benchmark 2: < big.tsv ./mojo/count_lines
  Time (mean ± σ):      1.333 s ±  0.010 s    [User: 1.259 s, System: 0.070 s]
  Range (min … max):    1.325 s …  1.353 s    10 runs
 
Benchmark 3: < big.tsv ./rust/target/release/count_lines
  Time (mean ± σ):      1.559 s ±  0.010 s    [User: 1.455 s, System: 0.098 s]
  Range (min … max):    1.546 s …  1.574 s    10 runs
 
Summary
  < big.tsv ./mojo/count_lines ran
    1.17 ± 0.01 times faster than < big.tsv ./rust/target/release/count_lines
    9.52 ± 0.10 times faster than < big.tsv python3 ./count_lines.py

Mojo being 17% faster than Rust and close to 10x Python.

That said, there are some flaws with this benchmark and data that I think are biasing in Mojo’s favor where my naive methods are working better than the more complex rust methods on simple, small data.

Based on previous experiments, vectorized operations in rust on Arm-Mac is less performant than Mojo. It would be interesting to see how those results would be different on Intel X86-64

2 Likes

Interesting. I don’t have an intel machine on hand, but will look for an opportunity! It’s weird that it’s faster. I think it partially has to do with my “naive” search being a simd search up to the first character in a substring, then checking serially for subsequent characters. Which works really well when the strings are only two characters.

Now a bit more Pythonic with context managers (ExtraMojo 0.4.0):

from ExtraMojo.fs.file import FileReader
from ExtraMojo.bstr.bstr import SplitIterator, find, to_ascii_lowercase
from memory import Span

alias TAB = ord("\t")


@value
struct Record:
    var name: String
    var count: Int


fn create_record(line: Span[UInt8]) raises -> Record:
    var name = String()
    var iter = SplitIterator(line, TAB)
    name.write_bytes(iter.peek().value())

    var count = 0
    for value in SplitIterator(line, TAB):
        if find(value[1:4], "bc".as_bytes()):
            count += 1

    return Record(name, count)


fn main() raises:
    var buffer = List[UInt8]()
    var records = List[Record]()
    with FileReader(open("/dev/stdin", "r")) as reader:
        while reader.read_until(buffer) != 0:
            to_ascii_lowercase(buffer)
            records.append(create_record(buffer))

    var count = 0
    for record in records:
        count += record[].count
    print(count)

Here are the results on an intel machine, I7-13700K, Ubuntu 24.04, Rustc 1.83, Mojo 24.6

  1. Here the hyperfine results after fixing a small import bug in the rust script Rust 1.83, Mojo 26.6.
Benchmark 1: < big.tsv ./mojo/count_lines
  Time (mean ± σ):      1.216 s ±  0.007 s    [User: 1.123 s, System: 0.095 s]
  Range (min … max):    1.209 s …  1.225 s    10 runs
 
Benchmark 2: < big.tsv ./rust/target/release/count_lines
  Time (mean ± σ):      1.006 s ±  0.008 s    [User: 0.901 s, System: 0.105 s]
  Range (min … max):    0.997 s …  1.016 s    10 runs
 
Summary
  < big.tsv ./rust/target/release/count_lines ran
    1.21 ± 0.01 times faster than < big.tsv ./mojo/count_lines

as expected the rust implementation is more performant on Intel-X86 than on ARM-Mac and now beats mojo.

1 Like

Thanks for running that!

For any future readers:

I’ve updated ExtraMojo with a better implementation of memchr and some small improvements to the buffered reader. I’d expect ExtraMojo to now perform better on intel, as it’s around 30-40% faster on my m-series.

1 Like

Here are the results from my intel machine (same specs as before)

hyperfine --warmup 3 '< big.tsv ./mojo/count_lines' '< big.tsv ./rust/target/release/count_lines'
Benchmark 1: < big.tsv ./mojo/count_lines
  Time (mean ± σ):     893.3 ms ±   8.0 ms    [User: 793.7 ms, System: 101.4 ms]
  Range (min … max):   883.1 ms … 911.5 ms    10 runs
 
Benchmark 2: < big.tsv ./rust/target/release/count_lines
  Time (mean ± σ):      1.012 s ±  0.004 s    [User: 0.897 s, System: 0.116 s]
  Range (min … max):    1.004 s …  1.020 s    10 runs

  < big.tsv ./mojo/count_lines ran
    1.13 ± 0.01 times faster than < big.tsv ./rust/target/release/count_lines

the memchr function is much better now with 40% increase in performance, I think I will use it as well :smiley:

2 Likes

Nice! That’s better than I was expecting on Intel!