Mojo vs max mandelbrot performance

I decided to compare mandelbrot performance in plain mojo modular/examples/mojo/mandelbrot.mojo at main · modular/modular · GitHub
and in max modular/examples/custom_ops/mandelbrot.py at main · modular/modular · GitHub

i modified both to use

WIDTH = 1024
HEIGHT = 1024
MAX_ITERATIONS = 64
MIN_X = -2.0
MAX_X = 1.0
MIN_Y = -1.5
MAX_Y = 1.5

i modified max version to do inference 1000 times to amortize graph compilation time

for i in range(1000):
    start_time = time.perf_counter()
    result = model.execute()[0]
    assert isinstance(result, Tensor)
    result = result.to(CPU())
    execution_time = time.perf_counter() - start_time
    execution_times.append(execution_time)

i am running on MacBook m3

mojo mandelbrot.mojo
Number of physical cores: 12
Vectorized: 4.213 ms
Parallelized: 0.885 ms

magic run mandelbrot
Execution Time Statistics (1000 runs):
Min time: 0.002 ms
Max time: 0.010 ms
Average time: 0.002 ms
Median time: 0.002 ms

how is it that fast? is it optimized out?

Generally speaking, the custom ops approach involves the graph compiler which takes care memory planning, kernel fusion etc. For the Mandelbrot case it uses a special foreach function which can optimize element-wise algorithms to their fullest extent.

The difference is about factor 1000, which coincides with your number of repetitions in the max case, if I understood you correctly. Just to be on the safe side, could you change the number of repetitions to 10000 and confirm the difference remains factor 1000?

Using device: Device(type=cpu,id=0)

Execution Time Statistics (10000 runs):
Min time: 0.002 ms
Max time: 0.010 ms
Average time: 0.002 ms
Median time: 0.002 ms

Then i changed cols and rows to 64
the difference is now only 10 fold

mojo times
Vectorized: 0.0218 ms
Parallelized: 0.0114 ms

max times the same
Min time: 0.002 ms
Max time: 0.013 ms
Average time: 0.002 ms
Median time: 0.002 ms

@Ehsan why cant mojo compiler do the same optimizations as max does?

To be clear: both of these paths use the Mojo compiler, because the code for the Mandelbrot set in each is defined in Mojo and compiled either for direct execution in a standalone file or as a node in a MAX computational graph.

The two calculations here are performed in slightly different ways. As Ehsan indicated, the example that runs this a graph node uses our highly-efficient foreach abstraction which uses a number of tricks to fully parallelize across available hardware on CPU or GPU. I wouldn’t be surprised if it balances the load better than a manually vectorized implementation like the first.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.