HyperLogLog in mojo

:fire: HyperLogLog in Mojo – let’s make it fly

TL;DR
I’ve implemented HyperLogLog in Mojo to stress-test the language on bit-twiddly, SIMD-friendly code.
Repo → GitHub - axiomhq/mojo-hyperloglog
I need fresh eyes on perf hotspots, API feel, and a solid benchmark story.

What’s inside

  • HLL core in ~300 LOC Mojo (precision 4–16)
  • LogLog-Beta bias correction :white_check_mark:
  • Sparse ⇆ dense switch for tiny vs. huge sets
  • Merge-ready registers for sharded counts
  • Raw-byte serde
  • Unit tests + micro-bench skeleton (benchmark_hyperloglog.mojo)

Code stays vanilla for readabilit - plenty of low-hanging fruit for optimisation.

Why Mojo?

  • Python vibes, LLVM muscle
  • First-class SIMD intrinsics
  • Predictable memory layout (cache-friendly)
  • Compile-time generics incoming - perfect for prob-DS hacks

Where I need help

Area Current state What would be awesome
Hashing plain murmur3_64 Swap in XXH3/Metro64 - compare speed + collision
Bit maths naïve loops vector[i] = max(vector[i], r) via 128/256-bit SIMD
Benchmarks single-thread micro Repeatable harness vs. CPython, Go (axiomhq/hyperloglog), Rust
Docs tiny README Diagrams / blog-style walkthrough

(Bias correction is done - no estimator bikeshedding right now.)

How to run

# Needs Mojo SDK ≥ 0.7.0
git clone https://github.com/axiomhq/mojo-hyperloglog
cd mojo-hyperloglog
mojo run test_hyperloglog.mojo        # all green
mojo run benchmark_hyperloglog.mojo   # ops/s + error stats

Contributing

  • PRs / discussions for micro-opts, API tweaks, benchmark scripts (Go, Rust, C++… bring ’em).
  • Faster and clean? :100:
  • MIT licence.

Roadmap (up for debate)

  1. Vectorised add_hash (AVX-512 / Neon flag)
  2. Streaming memory-profiler hook
  3. Blog post: Mojo vs. Go HLL - war stories & numbers

Big thanks in advance - curious what the Mojo crew can squeeze out!

4 Likes

Good job!
Is there a reason for directly calling __setitem__ like:

self.registers.__setitem__(bucket, zeros)

rather than using syntax sugar:

self.registers[bucket] = zeros

Magic methods are not meant to be called directly; that’s why they’re ugly.

good call! changing it… or if you want you can create a PR :folded_hands:

Hey Seif,

nice idea, it’s actually first time that I heard of HyperLogLog (shame on me :grinning_face:) and it seems like a very interesting and cool data structure.

From the first glance I would suggest to convert p (the precision) into a type parameter. This way you can turn many run time vars into compile time aliases. e.g max_zeros, alpha‎ and multiple instances of m. The get_beta function should also get precision as comp time argument so you can add @parameter in front of the if/else cascade, so that the compiler will emit only the specialised computation and you will eliminate the branching completely. For the computation of the alpha in init method of HyperLogLog you would use the same trick. Similar trick can be applied in cardinality method. You can annotate the for i in range(m): with @parameter as m would be compile time known and this will inline the loop. Those are micro optimisations, so I am not sure how much benefit they will provide, but those are the things that are unique to Mojo.

I also see that you implemented murmur3_64 in the unit test. I would suggest to use the hash function shipped with standard library. The nightly default implementation uses AHash as hashing function which should be faster and provided better quality hash values.

You are also welcome to implement Murmur3_64 as a Hasher and compare it against AHasher. The hash function in stdlib allows swapping of the hashing algorithm.

I would also recommend to think about using the SIMD type in order to vectorise some computations. If it is ok with you I could also create a PR at some point.

1 Like

I’d love a PR. Thank you so much for the suggestions :smiley:

Hello, thanks that is awesome.
I too did not knew about this very useful HyperLogLog,
also learned about harmonic average on the way :+1:

Performance Summary

Modified Implementation (Compile-time parameters):

  • Sparse add: 0.0159 ms
  • Sparse cardinality: 0.0157 ms
  • Sparse merge: 0.0533 ms
  • Dense add: 1.0367 ms
  • Dense cardinality: 1.0686 ms
  • Dense merge: 2.1071 ms

Original Implementation (Runtime parameters):

  • Sparse add: 0.0175 ms
  • Sparse cardinality: 0.0173 ms
  • Sparse merge: 0.0519 ms
  • Dense add: 1.1174 ms
  • Dense cardinality: 1.1582 ms
  • Dense merge: 2.2151 ms

Performance Improvements:

  • Sparse add: 9.4% faster
  • Sparse cardinality: 9.2% faster
  • Sparse merge: 2.7% slower
  • Dense add: 7.2% faster
  • Dense cardinality: 7.7% faster
  • Dense merge: 4.9% faster

The compile-time parameter optimization provides 5-10% performance improvement for most operations, with dense operations showing the most consistent gains.

3 Likes

Very nice!

I have another idea for you, vectorizing the beta function. here is something I tried out on my machine:

fn get_beta2[P: Int](ez: Float64) -> Float64:
    """Get the beta constant for a given precision and empirical zero ratio."""
    @parameter
    if 4 <= P <= 16:
        var zl = log2(ez + 1)
        var zl2 = zl * zl
        var zl3 = zl2 * zl
        var zl4 = zl3 * zl
        var zl5 = zl4 * zl
        var zl6 = zl5 * zl
        var zl7 = zl6 * zl
        
        var z = SIMD[DType.float64, 8](ez, zl, zl2, zl3, zl4, zl5, zl6, zl7)

        @parameter
        if P == 4:
            var c = SIMD[DType.float64, 8](-0.582581413904517, -1.935300357560050, 
                                            11.079323758035073, -22.131357446444323, 
                                            22.505391846630037, -12.000723834917984, 
                                            3.220579408194167, -0.342225302271235)
            return (c * z).reduce_add()
        elif P == 5:
            var c = SIMD[DType.float64, 8](-0.7518999460733967, -0.9590030077748760, 
                                            5.5997371322141607, -8.2097636999765520, 
                                            6.5091254894472037, -2.6830293734323729, 
                                            0.5612891113138221, -0.0463331622196545)
            return (c * z).reduce_add()
        elif P == 6:
            var c = SIMD[DType.float64, 8](29.8257900969619634, -31.3287083337725925, 
                                            -10.5942523036582283, -11.5720125689099618, 
                                            3.8188754373907492, -2.4160130328530811, 
                                            0.4542208940970826, -0.0575155452020420)
            return (c * z).reduce_add()
        elif P == 7:
            var c = SIMD[DType.float64, 8](2.8102921290820060, -3.9780498518175995, 
                                            1.3162680041351582, -3.9252486335805901, 
                                            2.0080835753946471, -0.7527151937556955, 
                                            0.1265569894242751, -0.0109946438726240)
            return (c * z).reduce_add()
        elif P == 8:
            var c = SIMD[DType.float64, 8](1.00633544887550519, -2.00580666405112407, 
                                            1.64369749366514117, -2.70560809940566172, 
                                            1.39209980244222598, -0.46470374272183190, 
                                            0.07384282377269775, -0.00578554885254223)
            return (c * z).reduce_add()
        elif P == 9:
            var c = SIMD[DType.float64, 8](-0.09415657458167959, -0.78130975924550528, 
                                            1.71514946750712460, -1.73711250406516338, 
                                            0.86441508489048924, -0.23819027465047218, 
                                            0.03343448400269076, -0.00207858528178157)
            return (c * z).reduce_add()
        elif P == 10:
            var c = SIMD[DType.float64, 8](-0.25935400670790054, -0.52598301999805808, 
                                            1.48933034925876839, -1.29642714084993571, 
                                            0.62284756217221615, -0.15672326770251041, 
                                            0.02054415903878563, -0.00112488483925502)
            return (c * z).reduce_add()
        elif P == 11:
            var c = SIMD[DType.float64, 8](-0.432325553856025, -0.108450736399632, 
                                            0.609156550741120, -0.0165687801845180, 
                                            -0.0795829341087617, 0.0471830602102918, 
                                            -0.00781372902346934, 0.000584268708489995)
            return (c * z).reduce_add()
        elif P == 12:
            var c = SIMD[DType.float64, 8](-0.384979202588598, 0.183162233114364, 
                                            0.130396688841854, 0.0704838927629266, 
                                            -0.0089589397146453, 0.0113010036741605, 
                                            -0.00194285569591290, 0.000225435774024964)
            return (c * z).reduce_add()
        elif P == 13:
            var c = SIMD[DType.float64, 8](-0.41655270946462997, -0.22146677040685156, 
                                            0.38862131236999947, 0.45340979746062371, 
                                            -0.36264738324476375, 0.12304650053558529, 
                                            -0.01701540384555510, 0.00102750367080838)
            return (c * z).reduce_add()
        elif P == 14:
            var c = SIMD[DType.float64, 8](-0.371009760230692, 0.00978811941207509, 
                                            0.185796293324165, 0.203015527328432, 
                                            -0.116710521803686, 0.0431106699492820, 
                                            -0.00599583540511831, 0.000449704299509437)
            return (c * z).reduce_add()
        elif P == 15:
            var c = SIMD[DType.float64, 8](-0.38215145543875273, -0.89069400536090837, 
                                            0.37602335774678869, 0.99335977440682377, 
                                            -0.65577441638318956, 0.18332342129703610, 
                                            -0.02241529633062872, 0.00121399789330194)
            return (c * z).reduce_add()
        elif P == 16:
            var c = SIMD[DType.float64, 8](-0.37331876643753059, -1.41704077448122989, 
                                            0.40729184796612533, 1.56152033906584164, 
                                            -0.99242233534286128, 0.26064681399483092, 
                                            -0.03053811369682807, 0.00155770210179105)
            return (c * z).reduce_add()
        # Unreachable
        alias num_registers = 1 << P
        return 0.7213 / (1.0 + 1.079 / Float64(num_registers)) 
    else:
        # For larger register counts, use the standard beta correction
        alias num_registers = 1 << P
        return 0.7213 / (1.0 + 1.079 / Float64(num_registers)) 

It seems to be a bit faster for dense cardinality computation, although I tested it on Apple M1 which has only 128 bit wide SIMD instructions, on an architecture with 512bit it should be even faster.

Wanna do a PR :smiley: ?

Sadly I am not sure I will have time to make a proper PR as I would need to do a bit more of testing, documentation etc… the experiment above was just a 20 minutes distraction, a PR is more of a commitment :grinning_face_with_smiling_eyes:. And then I would feel bad as I start something new instead of working on Mojo standard library, maintaining my old Mokk projects or porting Dagr (GitHub - mzaks/dagr: A high-performance, type-safe binary serialization framework for Swift.) to Mojo

Hey Seif,

I spent a bit more time on HyperLogLog wanted to see if I can optimise the cardinality for dense sets. And I think it worked quite well, although some stuff I did was a bit questionable, like reducing the precision while calculating the count of empty registers and the beta.

Sorry no direct PR again, but I have a gist:

First things first, I isolated the benchmark to measure just the computation of cardinality, before it measure cardinality + addition of elements to the set. See: Mojo HyperLogLog performance optimisation · GitHub

Then I tried brunch eliminations See: Mojo HyperLogLog performance optimisation · GitHub

I rewrote the computation logic to be branchless and introduced partial loop unrolling.

After that as we already have a branchless algorithm I went for vectorization, which did not bring much with Float64, but did reduce the execution time quite a bit when I switched to Float32. This might be because I am on Apple m1 and the chip can not vectorise Float64 quite well, so it might work better on chips with better (wider) SIMD support. See: Mojo HyperLogLog performance optimisation · GitHub

So all in all here are the number on my Apple M1 chip.

Before optimisations, but already with Float32 instead of Float64:

| name                        | met (ms)             | iters |
| --------------------------- | -------------------- | ----- |
| benchmark_cardinality_dense | 0.032504722316480036 | 35364 |
| benchmark_cardinality_dense | 0.03001104703506339  | 41097 |
| benchmark_cardinality_dense | 0.02934371618741598  | 30499 |
| benchmark_cardinality_dense | 0.03166067746686303  | 37345 |
| benchmark_cardinality_dense | 0.030180023398654582 | 41028 |

With branch elimination:

| name                        | met (ms)             | iters |
| --------------------------- | -------------------- | ----- |
| benchmark_cardinality_dense | 0.023303263947348284 | 52268 |
| benchmark_cardinality_dense | 0.02517179833737572  | 54011 |
| benchmark_cardinality_dense | 0.02423603127368168  | 54103 |
| benchmark_cardinality_dense | 0.02228566186411217  | 51778 |
| benchmark_cardinality_dense | 0.02306314483847325  | 35474 |

With vectorisation and reduction to Float16 while computing exp2 for reg:

| name                        | met (ms)             | iters  |
| --------------------------- | -------------------- | ------ |
| benchmark_cardinality_dense | 0.0096209            | 100000 |
| benchmark_cardinality_dense | 0.007627987633937971 | 180494 |
| benchmark_cardinality_dense | 0.007857564241366358 | 181036 |
| benchmark_cardinality_dense | 0.007399178592552056 | 180909 |
| benchmark_cardinality_dense | 0.007546237119402241 | 180407 |

I also run the test for 100K elements the result is:
Estimated: 101094 Actual: 100000 Error margin: 0.01094

So it is still under 2% error margin

hey Maxim

Thanks so much and sorry for the late reply! I picked up your changes and merged them. I wasn’t sure about the float16 but as long as it doesn’t introduce the error rate >= 2% I think we are good.

Thanks again

1 Like

Oh I see you just incorporated the changes I described here. I also created a PR isolated benchmarks and improved dense cardinality computation by about 10x by mzaks · Pull Request #1 · axiomhq/mojo-hyperloglog · GitHub which had a bit of refactoring specifically in the benchmark code.

Anyways I have one more thing which could be improved. In the cardinality vectorized algorithm I hard coded the upper bound of vector_width to be 64 which works fine on Mac M series chips, but when I run the benchmark on AMD Zen 5 laptop the best width was 4096 so I think we need to introduce the max_width as parameter which defaults to `simd_byte_width() * 4`.

Should I just create a PR on top of the current main and close my previous PR?

Yes please! I’d love a PR :heart: