i modified max version to do inference 1000 times to amortize graph compilation time
for i in range(1000):
start_time = time.perf_counter()
result = model.execute()[0]
assert isinstance(result, Tensor)
result = result.to(CPU())
execution_time = time.perf_counter() - start_time
execution_times.append(execution_time)
i am running on MacBook m3
mojo mandelbrot.mojo
Number of physical cores: 12
Vectorized: 4.213 ms
Parallelized: 0.885 ms
magic run mandelbrot
Execution Time Statistics (1000 runs):
Min time: 0.002 ms
Max time: 0.010 ms
Average time: 0.002 ms
Median time: 0.002 ms
Generally speaking, the custom ops approach involves the graph compiler which takes care memory planning, kernel fusion etc. For the Mandelbrot case it uses a special foreach function which can optimize element-wise algorithms to their fullest extent.
The difference is about factor 1000, which coincides with your number of repetitions in the max case, if I understood you correctly. Just to be on the safe side, could you change the number of repetitions to 10000 and confirm the difference remains factor 1000?
Then i changed cols and rows to 64
the difference is now only 10 fold
mojo times
Vectorized: 0.0218 ms
Parallelized: 0.0114 ms
max times the same
Min time: 0.002 ms
Max time: 0.013 ms
Average time: 0.002 ms
Median time: 0.002 ms
@Ehsan why cant mojo compiler do the same optimizations as max does?
To be clear: both of these paths use the Mojo compiler, because the code for the Mandelbrot set in each is defined in Mojo and compiled either for direct execution in a standalone file or as a node in a MAX computational graph.
The two calculations here are performed in slightly different ways. As Ehsan indicated, the example that runs this a graph node uses our highly-efficient foreach abstraction which uses a number of tricks to fully parallelize across available hardware on CPU or GPU. I wouldn’t be surprised if it balances the load better than a manually vectorized implementation like the first.