I have most of that set up, including a harness that’s derived from formal verification tooling (TLA+) which acts as a ground truth for externally visible behavior. As for context, I have a dataset that I convert into a RAG db for new models which has most of the information I personally reference when trying to write high performance code, as well as all of the notes from my undergrad and research. It’s enough information that I think I could reasonably expect a human to learn to write high performance code from it.
It’s possible it’s because I am mixing multiple areas which are poorly represented in the training data (network drivers, a custom network stack, large scale distributed systems, very tight integration with relatively new hardware, and taking advantage of hardware capabilities that most OSes don’t actually expose) combined with a few other things that make the problem a bit more tricky, and that’s what throws LLMs off. However, even getting LLMs to do something simple like handle endianness properly is like pulling teeth at times, because most of the examples of the task in the training data are actually incorrectly and I have a feeling “network stack development” doesn’t see as much fine-tuning as ReactJS does from closed models. I’ve even gone to the extent of fine-tuning my own models which, at least subjectively, despite running on consumer hardware, seem to perform much better, although the brute force approach I use tends to ruin the ability of the models to write JS and Python code.
This isn’t a Mojo specific complaint, I find most models have problems writing Rust and C++ at what I consider “speed of the hardware” too. In particular, they really like not doing null pointer checks, and when they do null pointer checks on a buffer they almost never vectorize it.
I probably could fix a lot of these with LLMs, but I can write the correct implementation faster than I can prod the LLM into fixing it.