Democratizing AI Compute, Part 5: What about CUDA C++ alternatives?

GenAI may be new, but GPUs aren’t! Over the years, many have tried to create portable GPU programming models using C++, from OpenCL to SYCL to OneAPI and beyond. These were the most plausible CUDA alternatives that aimed to democratize AI compute, but you may have never heard of them - because they failed to be relevant for AI.

These projects have all contributed meaningfully to compute, but if we are serious about unlocking AI compute for the future, we must critically examine the mistakes that held them back—not just celebrate the wins. At a high level, the problems stem from the challenges of “open coopetition”—where industry players both collaborate and compete—as well as specific management missteps along the way.

Let’s dive in. :rocket:

This is Part 5 of Modular’s “Democratizing AI Compute” series. For more, see:

CUDA C++ Alternatives: OpenCL, SYCL, and More

There are many projects that aimed to unlock GPU programming, but the one I know best is OpenCL. Like CUDA, OpenCL aimed to give programmers a C+±like experience for writing code that ran on the GPU. The history is personal: in 2008, I was one of the lead engineers implementing OpenCL at Apple (it was the first production use of the Clang compiler I was building). After we shipped it, we made the pivotal decision to contribute it to the Khronos Group so it could get adopted and standardized across the industry.

That decision led to broad industry adoption of OpenCL (see the logos), particularly in mobile and embedded devices. Today, it remains hugely successful, powering GPU compute on platforms like Android, as well as in specialized applications such as DSPs. Unlike CUDA, OpenCL was designed for portability from the outset, aiming to support heterogeneous compute across CPUs, GPUs, and other accelerators. OpenCL also inspired other systems like SyCL, Vulkan, SPIR-V, oneAPI, WebCL and many others.

However, despite its technical strengths and broad adoption, OpenCL never became the dominant AI compute platform. There are several major reasons for this: the inherent tensions of open coopetition, technical problems that flowed from that, the evolving requirements of AI, and NVIDIA’s unified strategy with TensorFlow and PyTorch.

Coopetition” at Committee Speed

In 2008, Apple was a small player in the PC space, and thought that industry standardization would enable it to reach more developers. However, while OpenCL did gain broad adoption among hardware makers, its evolution quickly ran into a major obstacle: the speed of committee-driven development. For Apple, this slow-moving, consensus-driven process was a dealbreaker: we wanted to move the platform rapidly, add new features (e.g. add C++ templates), and express the differentiation of the Apple platform. We faced a stark reality - the downside of a committee standard is that things suddenly moved at committee consensus speed… which felt glacial.

Hardware vendors recognized the long-term benefits of a unified software ecosystem, but in the short term, they were fierce competitors. This led to subtle but significant problems: instead of telling the committee about the hardware features you’re working on (giving a competitor a head start), participants would keep innovations secret until after the hardware shipped, and only discuss it after these features became commoditized (using vendor-specific extensions instead).

Coopetition: “cooperation” amongst competitors

This became a huge problem for Apple, a company that wanted to move fast in secret to make a big splash with product launches. As such, Apple decided to abandon OpenCL: it introduced Metal instead, never brought OpenCL to iOS, and deprecated it out of macOS later. Other companies stuck with OpenCL, but these structural challenges continued to limit its ability to evolve at the pace of cutting-edge AI and GPU innovation.

Technical Problems with OpenCL

While Apple boldly decided to contribute the OpenCL standard to Kronos, it wasn’t all-in: it contributed OpenCL as a technical specification—but without a full reference implementation. Though parts of the compiler front-end (Clang) was open source, there was no shared OpenCL runtime, forcing vendors to develop their own custom forks and complete the compiler. Each vendor had to maintain its own implementation (a ”fork”), and without a shared, evolving reference, OpenCL became a patchwork of vendor-specific forks and extensions. This fragmentation ultimately weakened its portability—the very thing it was designed to enable.

Furthermore, because vendors held back differentiated features or isolated them into vendor-specific extensions, which exploded in number and fragmented OpenCL (and the derivatives), eroding its ability to be a unifying vendor-agnostic platform. These problems were exacerbated by weaknesses in OpenCL’s compatibility and conformance tests. On top of that, it inherited all the “C++ problems” that we discussed before.

Developers want stable, well-supported tools—but OpenCL’s fragmentation, weak conformance tests, and inconsistent vendor support made it an exercise in frustration. One developer summed it up by saying that using OpenCL is “about as comfortable as hugging a cactus”! Ouch.

One developer described using OpenCL as “about as comfortable as hugging a cactus.”

While OpenCL was struggling with fragmentation and slow committee-driven evolution, AI was rapidly advancing—both in software frameworks and hardware capabilities. This created an even bigger gap between what OpenCL offered and what modern AI workloads needed.

The Evolving Needs of AI Research and AI GPU Hardware

The introduction of TensorFlow and PyTorch kicked off a revolution in AI research - powered by improved infrastructure and massive influx of BigCo funding. This posed a major challenge for OpenCL. While it enabled GPU compute, it lacked the high-level AI libraries and optimizations necessary for training and inference at scale. Unlike CUDA, it had no built-in support for key operations like matrix multiplication, Flash Attention, or datacenter-scale training.

Cross-industry efforts to expand TensorFlow and PyTorch to use OpenCL quickly ran into fundamental roadblocks (despite being obvious and with incredible demand). The developers who kept hugging the cactus soon discovered a harsh reality: portability to new hardware is meaningless if you can’t unlock its full performance. Without a way to express portable hardware-specific enhancements—and with coopetition crushing collaboration—progress stalled.

One glaring example? OpenCL still doesn’t provide standardized support for Tensor Cores—the specialized hardware units that power efficient matrix multiplications in modern GPUs and AI accelerators. This means that using OpenCL often means a 5x to 10x slowdown in performance compared to using CUDA or other fragmented vendor native software. For GenAI, where compute costs are already astronomical, a 5x to 10x slowdown isn’t just inconvenient—it’s a complete dealbreaker.

NVIDIA’s Strategic Approach with TensorFlow and PyTorch

While OpenCL struggled under the weight of fragmented governance, NVIDIA took a radically different approach—one that was tightly controlled, highly strategic, and ruthlessly effective, as we discussed earlier. It actively co-designed CUDA’s high-level libraries alongside TensorFlow and PyTorch, ensuring they always ran best on NVIDIA hardware. Since these frameworks were natively built on CUDA, NVIDIA had a massive head start—and it doubled down by optimizing performance out of the box.

NVIDIA maintained a token OpenCL implementation—but it was strategically hobbled (e.g., not being able to use TensorCores)—ensuring that a CUDA implementation would always be necessary. NVIDIA’s continued and rising dominance in the industry put it on the path to ensure that the CUDA implementations would always be the most heavily invested in. Over time, OpenCL support faded, then vanished—while CUDA cemented its position as the undisputed standard.

What Can We Learn From These C++ GPU Projects?

The history above is well understood by those of us who lived through it, but the real value comes from learning from the past. Based on this, I believe successful systems must:

  • Provide a reference implementation, not just a paper specification and “compatibility” tests. A working, adoptable, and scalable implementation should define compatibility—not a PDF.
  • Have strong leadership and vision driven by whoever maintains the reference implementation.
  • Run with top performance on the industry leader’s hardware—otherwise, it will always be a second-class alternative, not something that can unify the industry.
  • Evolve rapidly to meet changing requirements, because AI research isn’t stagnant, and AI hardware innovation is still accelerating.
  • Cultivate developer love, by providing great usability, tools and fast compile times. Also, “C++ like” isn’t exactly a selling point in AI!
  • Build an open community, because without widespread adoption, technical prowess doesn’t matter.
  • Avoid fragmentation—a standard that splinters into incompatible forks can’t provide an effective unification layer for software developers.

These are the fundamental reasons why I don’t believe that committee efforts like OpenCL can ever succeed. It’s also why I’m even more skeptical of projects like Intel’s OneAPI (now UXL Foundation) that are notionally open, but in practice, controlled by a single hardware vendor competing with all the others.

What About AI Compilers?

At the same time that C++ approaches failed to unify AI compute for hardware makers, the AI industry faced a bigger challenge—even using CUDA on NVIDIA hardware. How can we scale AI compute if humans have to write all the code manually? There are too many chips, too many AI algorithms, and too many workload permutations to optimize by hand.

As AI’s dominance grew, it inevitably attracted interest from systems developers and compiler engineers—including myself. In the next post, we’ll dive into widely known “AI compiler” stacks like TVM, OpenXLA, and MLIR—examining what worked, what didn’t, and what lessons we can take forward. Unfortunately, the lessons are not wildly different than the ones above:

History may not repeat itself, but it does rhyme. - Mark Twain

See you next time—until then, may the FLOPS be with you! :man_technologist:

-Chris

10 Likes

Very informative and fun read. Thank you for sharing! :folded_hands:t4:
Coincidentally, I had posted these questions for the community meeting on 03/10 before you posted part 5: Democratizing AI Compute: Q&A (Community Meeting) - #3 by nlaanait :slightly_smiling_face:
Part 5 puts the role that hardware vendors’ Coopetition plays in fragmenting the accelerated computing software stack front and center.
Really looking forward to your take on how “new” AI hardware vendors would interact with Modular’s vision of unifying AI software+infrastructure!

1 Like

Backstory about OpenCL at Apple, and the move to Metal, was fascinating to learn about. Core ML offers some much-needed abstraction for heterogeneous compute on top of Accelerate and Metal, but being tied to Apple’s platforms and APIs is not compelling for general numerical computing.

1 Like

I’m happy to chat more next week about any of these details. It’s a bit challenging to fit big and thorny issues into bite-sized morsels that are understandable by a range of folks, but I’m happy to share more of course :slight_smile:

1 Like

Yes, don’t miss the upcoming community meeting on Monday where @clattner will answer questions about the Democratizing AI Compute series!

1 Like