HyperLogLog in mojo

:fire: HyperLogLog in Mojo – let’s make it fly

TL;DR
I’ve implemented HyperLogLog in Mojo to stress-test the language on bit-twiddly, SIMD-friendly code.
Repo → GitHub - axiomhq/mojo-hyperloglog
I need fresh eyes on perf hotspots, API feel, and a solid benchmark story.

What’s inside

  • HLL core in ~300 LOC Mojo (precision 4–16)
  • LogLog-Beta bias correction :white_check_mark:
  • Sparse ⇆ dense switch for tiny vs. huge sets
  • Merge-ready registers for sharded counts
  • Raw-byte serde
  • Unit tests + micro-bench skeleton (benchmark_hyperloglog.mojo)

Code stays vanilla for readabilit - plenty of low-hanging fruit for optimisation.

Why Mojo?

  • Python vibes, LLVM muscle
  • First-class SIMD intrinsics
  • Predictable memory layout (cache-friendly)
  • Compile-time generics incoming - perfect for prob-DS hacks

Where I need help

Area Current state What would be awesome
Hashing plain murmur3_64 Swap in XXH3/Metro64 - compare speed + collision
Bit maths naïve loops vector[i] = max(vector[i], r) via 128/256-bit SIMD
Benchmarks single-thread micro Repeatable harness vs. CPython, Go (axiomhq/hyperloglog), Rust
Docs tiny README Diagrams / blog-style walkthrough

(Bias correction is done - no estimator bikeshedding right now.)

How to run

# Needs Mojo SDK ≥ 0.7.0
git clone https://github.com/axiomhq/mojo-hyperloglog
cd mojo-hyperloglog
mojo run test_hyperloglog.mojo        # all green
mojo run benchmark_hyperloglog.mojo   # ops/s + error stats

Contributing

  • PRs / discussions for micro-opts, API tweaks, benchmark scripts (Go, Rust, C++… bring ’em).
  • Faster and clean? :100:
  • MIT licence.

Roadmap (up for debate)

  1. Vectorised add_hash (AVX-512 / Neon flag)
  2. Streaming memory-profiler hook
  3. Blog post: Mojo vs. Go HLL - war stories & numbers

Big thanks in advance - curious what the Mojo crew can squeeze out!

3 Likes

Good job!
Is there a reason for directly calling __setitem__ like:

self.registers.__setitem__(bucket, zeros)

rather than using syntax sugar:

self.registers[bucket] = zeros

Magic methods are not meant to be called directly; that’s why they’re ugly.

good call! changing it… or if you want you can create a PR :folded_hands:

Hey Seif,

nice idea, it’s actually first time that I heard of HyperLogLog (shame on me :grinning_face:) and it seems like a very interesting and cool data structure.

From the first glance I would suggest to convert p (the precision) into a type parameter. This way you can turn many run time vars into compile time aliases. e.g max_zeros, alpha‎ and multiple instances of m. The get_beta function should also get precision as comp time argument so you can add @parameter in front of the if/else cascade, so that the compiler will emit only the specialised computation and you will eliminate the branching completely. For the computation of the alpha in init method of HyperLogLog you would use the same trick. Similar trick can be applied in cardinality method. You can annotate the for i in range(m): with @parameter as m would be compile time known and this will inline the loop. Those are micro optimisations, so I am not sure how much benefit they will provide, but those are the things that are unique to Mojo.

I also see that you implemented murmur3_64 in the unit test. I would suggest to use the hash function shipped with standard library. The nightly default implementation uses AHash as hashing function which should be faster and provided better quality hash values.

You are also welcome to implement Murmur3_64 as a Hasher and compare it against AHasher. The hash function in stdlib allows swapping of the hashing algorithm.

I would also recommend to think about using the SIMD type in order to vectorise some computations. If it is ok with you I could also create a PR at some point.

1 Like

I’d love a PR. Thank you so much for the suggestions :smiley:

Hello, thanks that is awesome.
I too did not knew about this very useful HyperLogLog,
also learned about harmonic average on the way :+1:

Performance Summary

Modified Implementation (Compile-time parameters):

  • Sparse add: 0.0159 ms
  • Sparse cardinality: 0.0157 ms
  • Sparse merge: 0.0533 ms
  • Dense add: 1.0367 ms
  • Dense cardinality: 1.0686 ms
  • Dense merge: 2.1071 ms

Original Implementation (Runtime parameters):

  • Sparse add: 0.0175 ms
  • Sparse cardinality: 0.0173 ms
  • Sparse merge: 0.0519 ms
  • Dense add: 1.1174 ms
  • Dense cardinality: 1.1582 ms
  • Dense merge: 2.2151 ms

Performance Improvements:

  • Sparse add: 9.4% faster
  • Sparse cardinality: 9.2% faster
  • Sparse merge: 2.7% slower
  • Dense add: 7.2% faster
  • Dense cardinality: 7.7% faster
  • Dense merge: 4.9% faster

The compile-time parameter optimization provides 5-10% performance improvement for most operations, with dense operations showing the most consistent gains.

2 Likes

Very nice!

I have another idea for you, vectorizing the beta function. here is something I tried out on my machine:

fn get_beta2[P: Int](ez: Float64) -> Float64:
    """Get the beta constant for a given precision and empirical zero ratio."""
    @parameter
    if 4 <= P <= 16:
        var zl = log2(ez + 1)
        var zl2 = zl * zl
        var zl3 = zl2 * zl
        var zl4 = zl3 * zl
        var zl5 = zl4 * zl
        var zl6 = zl5 * zl
        var zl7 = zl6 * zl
        
        var z = SIMD[DType.float64, 8](ez, zl, zl2, zl3, zl4, zl5, zl6, zl7)

        @parameter
        if P == 4:
            var c = SIMD[DType.float64, 8](-0.582581413904517, -1.935300357560050, 
                                            11.079323758035073, -22.131357446444323, 
                                            22.505391846630037, -12.000723834917984, 
                                            3.220579408194167, -0.342225302271235)
            return (c * z).reduce_add()
        elif P == 5:
            var c = SIMD[DType.float64, 8](-0.7518999460733967, -0.9590030077748760, 
                                            5.5997371322141607, -8.2097636999765520, 
                                            6.5091254894472037, -2.6830293734323729, 
                                            0.5612891113138221, -0.0463331622196545)
            return (c * z).reduce_add()
        elif P == 6:
            var c = SIMD[DType.float64, 8](29.8257900969619634, -31.3287083337725925, 
                                            -10.5942523036582283, -11.5720125689099618, 
                                            3.8188754373907492, -2.4160130328530811, 
                                            0.4542208940970826, -0.0575155452020420)
            return (c * z).reduce_add()
        elif P == 7:
            var c = SIMD[DType.float64, 8](2.8102921290820060, -3.9780498518175995, 
                                            1.3162680041351582, -3.9252486335805901, 
                                            2.0080835753946471, -0.7527151937556955, 
                                            0.1265569894242751, -0.0109946438726240)
            return (c * z).reduce_add()
        elif P == 8:
            var c = SIMD[DType.float64, 8](1.00633544887550519, -2.00580666405112407, 
                                            1.64369749366514117, -2.70560809940566172, 
                                            1.39209980244222598, -0.46470374272183190, 
                                            0.07384282377269775, -0.00578554885254223)
            return (c * z).reduce_add()
        elif P == 9:
            var c = SIMD[DType.float64, 8](-0.09415657458167959, -0.78130975924550528, 
                                            1.71514946750712460, -1.73711250406516338, 
                                            0.86441508489048924, -0.23819027465047218, 
                                            0.03343448400269076, -0.00207858528178157)
            return (c * z).reduce_add()
        elif P == 10:
            var c = SIMD[DType.float64, 8](-0.25935400670790054, -0.52598301999805808, 
                                            1.48933034925876839, -1.29642714084993571, 
                                            0.62284756217221615, -0.15672326770251041, 
                                            0.02054415903878563, -0.00112488483925502)
            return (c * z).reduce_add()
        elif P == 11:
            var c = SIMD[DType.float64, 8](-0.432325553856025, -0.108450736399632, 
                                            0.609156550741120, -0.0165687801845180, 
                                            -0.0795829341087617, 0.0471830602102918, 
                                            -0.00781372902346934, 0.000584268708489995)
            return (c * z).reduce_add()
        elif P == 12:
            var c = SIMD[DType.float64, 8](-0.384979202588598, 0.183162233114364, 
                                            0.130396688841854, 0.0704838927629266, 
                                            -0.0089589397146453, 0.0113010036741605, 
                                            -0.00194285569591290, 0.000225435774024964)
            return (c * z).reduce_add()
        elif P == 13:
            var c = SIMD[DType.float64, 8](-0.41655270946462997, -0.22146677040685156, 
                                            0.38862131236999947, 0.45340979746062371, 
                                            -0.36264738324476375, 0.12304650053558529, 
                                            -0.01701540384555510, 0.00102750367080838)
            return (c * z).reduce_add()
        elif P == 14:
            var c = SIMD[DType.float64, 8](-0.371009760230692, 0.00978811941207509, 
                                            0.185796293324165, 0.203015527328432, 
                                            -0.116710521803686, 0.0431106699492820, 
                                            -0.00599583540511831, 0.000449704299509437)
            return (c * z).reduce_add()
        elif P == 15:
            var c = SIMD[DType.float64, 8](-0.38215145543875273, -0.89069400536090837, 
                                            0.37602335774678869, 0.99335977440682377, 
                                            -0.65577441638318956, 0.18332342129703610, 
                                            -0.02241529633062872, 0.00121399789330194)
            return (c * z).reduce_add()
        elif P == 16:
            var c = SIMD[DType.float64, 8](-0.37331876643753059, -1.41704077448122989, 
                                            0.40729184796612533, 1.56152033906584164, 
                                            -0.99242233534286128, 0.26064681399483092, 
                                            -0.03053811369682807, 0.00155770210179105)
            return (c * z).reduce_add()
        # Unreachable
        alias num_registers = 1 << P
        return 0.7213 / (1.0 + 1.079 / Float64(num_registers)) 
    else:
        # For larger register counts, use the standard beta correction
        alias num_registers = 1 << P
        return 0.7213 / (1.0 + 1.079 / Float64(num_registers)) 

It seems to be a bit faster for dense cardinality computation, although I tested it on Apple M1 which has only 128 bit wide SIMD instructions, on an architecture with 512bit it should be even faster.