HyperLogLog in mojo

seiflotfy · July 29, 2025, 10:09am

HyperLogLog in Mojo – let’s make it fly

TL;DR
I’ve implemented HyperLogLog in Mojo to stress-test the language on bit-twiddly, SIMD-friendly code.
Repo → GitHub - axiomhq/mojo-hyperloglog
I need fresh eyes on perf hotspots, API feel, and a solid benchmark story.

What’s inside

HLL core in ~300 LOC Mojo (precision 4–16)
LogLog-Beta bias correction
Sparse ⇆ dense switch for tiny vs. huge sets
Merge-ready registers for sharded counts
Raw-byte serde
Unit tests + micro-bench skeleton (benchmark_hyperloglog.mojo)

Code stays vanilla for readabilit - plenty of low-hanging fruit for optimisation.

Why Mojo?

Python vibes, LLVM muscle
First-class SIMD intrinsics
Predictable memory layout (cache-friendly)
Compile-time generics incoming - perfect for prob-DS hacks

Where I need help

Area	Current state	What would be awesome
Hashing	plain `murmur3_64`	Swap in XXH3/Metro64 - compare speed + collision
Bit maths	naïve loops	`vector[i] = max(vector[i], r)` via 128/256-bit SIMD
Benchmarks	single-thread micro	Repeatable harness vs. CPython, Go (axiomhq/hyperloglog), Rust
Docs	tiny README	Diagrams / blog-style walkthrough

(Bias correction is done - no estimator bikeshedding right now.)

How to run

# Needs Mojo SDK ≥ 0.7.0
git clone https://github.com/axiomhq/mojo-hyperloglog
cd mojo-hyperloglog
mojo run test_hyperloglog.mojo        # all green
mojo run benchmark_hyperloglog.mojo   # ops/s + error stats

Contributing

PRs / discussions for micro-opts, API tweaks, benchmark scripts (Go, Rust, C++… bring ’em).
Faster and clean?
MIT licence.

Roadmap (up for debate)

Vectorised add_hash (AVX-512 / Neon flag)
Streaming memory-profiler hook
Blog post: Mojo vs. Go HLL - war stories & numbers

Big thanks in advance - curious what the Mojo crew can squeeze out!

melodyogonna · July 29, 2025, 11:25am

Good job!
Is there a reason for directly calling __setitem__ like:

self.registers.__setitem__(bucket, zeros)

rather than using syntax sugar:

self.registers[bucket] = zeros

Magic methods are not meant to be called directly; that’s why they’re ugly.

seiflotfy · July 29, 2025, 11:34am

good call! changing it… or if you want you can create a PR

mzaks · July 29, 2025, 1:23pm

Hey Seif,

nice idea, it’s actually first time that I heard of HyperLogLog (shame on me ) and it seems like a very interesting and cool data structure.

From the first glance I would suggest to convert p (the precision) into a type parameter. This way you can turn many run time vars into compile time aliases. e.g max_zeros, alpha‎ and multiple instances of m. The get_beta function should also get precision as comp time argument so you can add @parameter in front of the if/else cascade, so that the compiler will emit only the specialised computation and you will eliminate the branching completely. For the computation of the alpha in init method of HyperLogLog you would use the same trick. Similar trick can be applied in cardinality method. You can annotate the for i in range(m): with @parameter as m would be compile time known and this will inline the loop. Those are micro optimisations, so I am not sure how much benefit they will provide, but those are the things that are unique to Mojo.

I also see that you implemented murmur3_64 in the unit test. I would suggest to use the hash function shipped with standard library. The nightly default implementation uses AHash as hashing function which should be faster and provided better quality hash values.

You are also welcome to implement Murmur3_64 as a Hasher and compare it against AHasher. The hash function in stdlib allows swapping of the hashing algorithm.

I would also recommend to think about using the SIMD type in order to vectorise some computations. If it is ok with you I could also create a PR at some point.

seiflotfy · July 30, 2025, 7:08am

I’d love a PR. Thank you so much for the suggestions

rd4com · July 31, 2025, 10:51am

Hello, thanks that is awesome.
I too did not knew about this very useful HyperLogLog,
also learned about harmonic average on the way

seiflotfy · August 12, 2025, 8:59pm

Performance Summary

Modified Implementation (Compile-time parameters):

Sparse add: 0.0159 ms
Sparse cardinality: 0.0157 ms
Sparse merge: 0.0533 ms
Dense add: 1.0367 ms
Dense cardinality: 1.0686 ms
Dense merge: 2.1071 ms

Original Implementation (Runtime parameters):

Sparse add: 0.0175 ms
Sparse cardinality: 0.0173 ms
Sparse merge: 0.0519 ms
Dense add: 1.1174 ms
Dense cardinality: 1.1582 ms
Dense merge: 2.2151 ms

Performance Improvements:

Sparse add: 9.4% faster
Sparse cardinality: 9.2% faster
Sparse merge: 2.7% slower
Dense add: 7.2% faster
Dense cardinality: 7.7% faster
Dense merge: 4.9% faster

The compile-time parameter optimization provides 5-10% performance improvement for most operations, with dense operations showing the most consistent gains.

mzaks · August 22, 2025, 8:31am

Very nice!

I have another idea for you, vectorizing the beta function. here is something I tried out on my machine:

fn get_beta2[P: Int](ez: Float64) -> Float64:
    """Get the beta constant for a given precision and empirical zero ratio."""
    @parameter
    if 4 <= P <= 16:
        var zl = log2(ez + 1)
        var zl2 = zl * zl
        var zl3 = zl2 * zl
        var zl4 = zl3 * zl
        var zl5 = zl4 * zl
        var zl6 = zl5 * zl
        var zl7 = zl6 * zl
        
        var z = SIMD[DType.float64, 8](ez, zl, zl2, zl3, zl4, zl5, zl6, zl7)

        @parameter
        if P == 4:
            var c = SIMD[DType.float64, 8](-0.582581413904517, -1.935300357560050, 
                                            11.079323758035073, -22.131357446444323, 
                                            22.505391846630037, -12.000723834917984, 
                                            3.220579408194167, -0.342225302271235)
            return (c * z).reduce_add()
        elif P == 5:
            var c = SIMD[DType.float64, 8](-0.7518999460733967, -0.9590030077748760, 
                                            5.5997371322141607, -8.2097636999765520, 
                                            6.5091254894472037, -2.6830293734323729, 
                                            0.5612891113138221, -0.0463331622196545)
            return (c * z).reduce_add()
        elif P == 6:
            var c = SIMD[DType.float64, 8](29.8257900969619634, -31.3287083337725925, 
                                            -10.5942523036582283, -11.5720125689099618, 
                                            3.8188754373907492, -2.4160130328530811, 
                                            0.4542208940970826, -0.0575155452020420)
            return (c * z).reduce_add()
        elif P == 7:
            var c = SIMD[DType.float64, 8](2.8102921290820060, -3.9780498518175995, 
                                            1.3162680041351582, -3.9252486335805901, 
                                            2.0080835753946471, -0.7527151937556955, 
                                            0.1265569894242751, -0.0109946438726240)
            return (c * z).reduce_add()
        elif P == 8:
            var c = SIMD[DType.float64, 8](1.00633544887550519, -2.00580666405112407, 
                                            1.64369749366514117, -2.70560809940566172, 
                                            1.39209980244222598, -0.46470374272183190, 
                                            0.07384282377269775, -0.00578554885254223)
            return (c * z).reduce_add()
        elif P == 9:
            var c = SIMD[DType.float64, 8](-0.09415657458167959, -0.78130975924550528, 
                                            1.71514946750712460, -1.73711250406516338, 
                                            0.86441508489048924, -0.23819027465047218, 
                                            0.03343448400269076, -0.00207858528178157)
            return (c * z).reduce_add()
        elif P == 10:
            var c = SIMD[DType.float64, 8](-0.25935400670790054, -0.52598301999805808, 
                                            1.48933034925876839, -1.29642714084993571, 
                                            0.62284756217221615, -0.15672326770251041, 
                                            0.02054415903878563, -0.00112488483925502)
            return (c * z).reduce_add()
        elif P == 11:
            var c = SIMD[DType.float64, 8](-0.432325553856025, -0.108450736399632, 
                                            0.609156550741120, -0.0165687801845180, 
                                            -0.0795829341087617, 0.0471830602102918, 
                                            -0.00781372902346934, 0.000584268708489995)
            return (c * z).reduce_add()
        elif P == 12:
            var c = SIMD[DType.float64, 8](-0.384979202588598, 0.183162233114364, 
                                            0.130396688841854, 0.0704838927629266, 
                                            -0.0089589397146453, 0.0113010036741605, 
                                            -0.00194285569591290, 0.000225435774024964)
            return (c * z).reduce_add()
        elif P == 13:
            var c = SIMD[DType.float64, 8](-0.41655270946462997, -0.22146677040685156, 
                                            0.38862131236999947, 0.45340979746062371, 
                                            -0.36264738324476375, 0.12304650053558529, 
                                            -0.01701540384555510, 0.00102750367080838)
            return (c * z).reduce_add()
        elif P == 14:
            var c = SIMD[DType.float64, 8](-0.371009760230692, 0.00978811941207509, 
                                            0.185796293324165, 0.203015527328432, 
                                            -0.116710521803686, 0.0431106699492820, 
                                            -0.00599583540511831, 0.000449704299509437)
            return (c * z).reduce_add()
        elif P == 15:
            var c = SIMD[DType.float64, 8](-0.38215145543875273, -0.89069400536090837, 
                                            0.37602335774678869, 0.99335977440682377, 
                                            -0.65577441638318956, 0.18332342129703610, 
                                            -0.02241529633062872, 0.00121399789330194)
            return (c * z).reduce_add()
        elif P == 16:
            var c = SIMD[DType.float64, 8](-0.37331876643753059, -1.41704077448122989, 
                                            0.40729184796612533, 1.56152033906584164, 
                                            -0.99242233534286128, 0.26064681399483092, 
                                            -0.03053811369682807, 0.00155770210179105)
            return (c * z).reduce_add()
        # Unreachable
        alias num_registers = 1 << P
        return 0.7213 / (1.0 + 1.079 / Float64(num_registers)) 
    else:
        # For larger register counts, use the standard beta correction
        alias num_registers = 1 << P
        return 0.7213 / (1.0 + 1.079 / Float64(num_registers))

It seems to be a bit faster for dense cardinality computation, although I tested it on Apple M1 which has only 128 bit wide SIMD instructions, on an architecture with 512bit it should be even faster.

Topic		Replies	Views
I have discovered a suspect efficiency anomaly in the mojo compiler, how to proceed? Mojo discussion , mojo-compiler , 25_1	20	276	March 8, 2025
A Benchmark with Files and Bytes (standard benchmark warnings apply) Community Showcase discussion	10	243	July 5, 2025
Mojo vs max mandelbrot performance Mojo performance	6	187	May 19, 2025
Exploring Metaprogramming in Mojo Content	1	418	May 28, 2025
Metaprogramming with Python in Mojo Mojo discussion	9	451	May 26, 2025

HyperLogLog in mojo