ZLUDA allows to run unmodified CUDA applications on non-NVIDIA GPUs
Welcome to the newest ZLUDA update. This quarter we doubled the size of our development team (ZLUDA team doubles in size), resolved a critical regression in the AMD driver (comgr ABI break), improved correctness of the code emitted by the compiler (Road to bit-accurate execution), set up automated builds (Automated builds on GitHub), implemented actually useful logging (Improved logging), made tiny progress on PhysX (32 bit PhysX update) and made a much bigger progress on llm.c (llm.c: 0 to 552).
I am pleased to announce that ZLUDA has doubled the size of its full-time development team and now has two developers working on the project.
Welcome Violet. They joined the project less than a month ago and already made significant contributions. For more details, feel free to check out the llm.c: 0 to 552 section.
GPU runtimes (CUDA, ROCm/HIP, ZLUDA, OpenCL, etc.) must be able to compile GPU code during run time of an application. This is necessary to ensure forward compatibility: GPU code developed in the past should be able to compile on new GPU architectures.
ROCm/HIP run time compilation library is comgr (ROCm-CompilerSupport). It's a small, focused library with a verbose, but well-designed interface. It ships on both Linux and Windows and has been unproblematic. Until now.
ROCm/HIP 6.4 shipped with a subtle ABI break in comgr. comgr's interface is fairly generic, it consists of a handful of functions, with the most important being amd_comgr_do_action. This function takes kind parameter: the kind LLVM of action you want to do: compilation, linking, disassembly.
The problem is that in ROCm/HIP 6.4, comgr ships with a new ABI (v3) which reordered the integer values assigned to each of the action. This meant that on the new ABI ZLUDA suddenly started requesting comgr to e.g. do linking instead of compilation and this led to silent, unexplainable failures.
On Windows the problem is even worse: AMD somehow shipped a mixture of v2 and v3. The library advertises itself as version v2.9, but actually uses ABI v3.
This was fixed in #364 and #366. If you are using some other project which uses ROCm/HIP and which suddenly started failing on Linux with ROCm 6.4 and on Windows with Adrenalin 25.5.1 that's why.
The stated goal of ZLUDA is to execute unmodified CUDA binaries on non-NVIDIA GPUs. A consequence of that is that we must execute every on-GPU instruction bit-exactly or, if bit-exact execution is not possible, within error bounds of the NVIDIA cards.
Old, pre-rollback ZLUDA code would cut a lot of corners here and ignore certain instruction modifiers or did not execute them with full precision.
New ZLUDA is doing a lot better on that front. We verify correctness with PTX "sweep" tests: for every relevant instruction with every possible instruction modifier we check that for every possible input ZLUDA produces correct output. The project lives here, outside of the main repo.
This test suite was verified against original NVIDIA's CUDA but was never actually used with ZLUDA. We have now ran ZLUDA under this test suite and uncovered some bugs in the compiler, fixed here: #379. Not every instruction went through this process yet, but already some of the trickiest cases (like cvt instruction) are bit-accurate.
Special thanks to our friends at SCALE lang, who contributed a major refactor to the test suite here.
As of #358, we now post automatic builds to GitHub. You don't have to build from source anymore if you want to try the freshest code. This is not completely finished yet, as we want to post those builds into "prerelease" section on GitHub page for better discoverability.
The first step to enabling any CUDA application on ZLUDA (no matter if it's a game, 3D suite, ML library) is to precisely log all the ways an application interact with CUDA, including calls to Dark API and calls to performance libraries.
As of the (giant) PR #372 we now have a much better logging implementation that logs interactions that were not collected previously. It can even handle intermediate interactions (e.g. can show us when and how cuBLAS uses cuBLASLt or when and how cuDNN uses Driver API).
Minor update on the 32 bit PhysX. @Groowy from ZLUDA Discord started poking at the first step of 32 bit PhysX support: collecting CUDA logs. This quickly uncovered bugs in ZLUDA, tracked in #374. Because some of them might affect 64 bit CUDA this step was pulled into official roadmap. Only this step though, full 32 bit PhysX support will still require open source contributions.
The extensive effort outlined in the previous paragraphs may appear random at first glance; however, they serve as stepping stones toward our first milestone: llm.c. Just the groundwork is not enough, someone must work on the workload directly.
We are focusing on llm.c test project: test_gpt2fp32cu. It's fairly small, but for ZLUDA it's the first project using CUDA Runtime and the first project using CUDA performance libraries (cuBLAS). It does 8186 CUDA calls to 44 different functions
Violet landed a flurry of commits (#377, #380, #381, #382, #383, #386, #387, #388, #389, #390, #391, #394) for llm.c. Before, ZLUDA failed at the very first CUDA call and now it fails at the 552nd. With 16 of 44 functions implemented we hope to have llm.c running soon. All that work of course will apply to other, more complex projects, like PyTorch.
We would love to hear your thoughts in the comments below. Until next time!