ZLUDA

ZLUDA update Q1&Q2 2026 - back to the roots

2026-06-29T00:00:00+00:00

Hi, and welcome to the latest ZLUDA update! Since I skipped the last update, this special issue covers all the developments made in ZLUDA since the start of the year. We now have two new major workloads: PhysX (PhysX pre-alpha</a>) and Blender (Textures support</a>). Much of this overlaps with Much improved Windows support</a>. Additionally, there has been a steady stream of minor features and improvements to existing workloads (Better ML support</a>). These culminated in a new major release (Version 6</a>). Some of you may also be interested in the The new direction of the project</a>.

Version 6</h3>
I am finally marking a new major release. As a reminder, ZLUDA follows a continuous development model. A major release does not represent addition of any particular single feature a compatibility break, but rather signals that significant progress has been made and that it is worth trying it out again. Version 6 is identical to the latest preview build (6-preview.79).
PhysX pre-alpha</h3>
With PC component prices as high as they are, we're all being compelled to revisit gaming classics.
ZLUDA has got you covered.
This long-running PR (#651</a>) is not yet comlete, but it adds support for 32-bit PhysX. This means that, in certain older games that relied on PhysX, you will be able to achieve a higher frame rate with an AMD GPU. In some games, AMD GPU owners will also be able to enjoy additional visual effects such as debris and flame for the first time.
Various PhysX samples running on an AMD GPU:
</video> </video> </video> 
Even more interestingly, here's a screenshot from Mafia II (original 2010 version) built-in benchmark running on an AMD GPU. All settings are maxed out and PhysX is enabled:
</a> ZLUDA OFF (click image to view the full screen)
</a> ZLUDA ON (click image to view the full screen)
Support is not yet complete: fluid simulations can be glitchy, and the current method of loading ZLUDA into Steam games is poor. I only tried it on my own PC, which has an unusual GPU setup. Nevertheless, if you are comfortable with editing the source code and building ZLUDA yourself, you can give it a try. For everyone else, I recommend watching the PR and waiting for it to be merged and included in the preview builds. Plese leave your feedback in the PR or on Discord.
PCGamingWiki maintains a list of PhysX games</a>. Just be aware that the list combines 32-bit PhysX and 64-bit GameWorks. These are two completely different technologies.
Textures support</h3>
ZLUDA now has a texture support (#625</a>). It's very basic and covers only a few use cases, but it's complete enough to support whatever is used by PhysX and Blender. This also means that Blender is now working on ZLUDA.
Much improved Windows support</h3>
Historically, ZLUDA's support for Windows has lagged behind its Linux support. The biggest issue has been the performance libraries (cuBLAS, cuDNN, etc). When you install ROCm on Linux you get everything in one compatible version at once (unless you explicitly opt out): userspace driver, performance libraries, monitoring libraries, and so on. On Windows, you only get runtime driver with your GPU driver (Adrenalin). As for the rest of ROCm, well, you have to find it yourself. You can either use the outdated, officially supported ROCm SDK, or the fresh but buggy nightly builds. While ZLUDA does not solve the problem for you, it is now much more user-friendly and explictly tells you if you are missing library and instructs you how to install it (#612</a>). ZLUDA Windows loader (`zluda.exe</code>), has been made more robust and now handles the loading of performance libraries automatically (instead of expecting the user to pass the right flags).`
`Better ML support</h3>`There's been a trickle of PRs that were driven by ZLUDA traces we've received from users of PyTorch. New instructions in #599</a>, #605</a>, #607</a>, #609</a>, #642</a>, #644</a>, #629</a>. Compiler bugfixes in #583</a>, #588</a>, #585</a>, #596</a>, #610</a>, #601</a>, #603</a>. Improvements to performance libraries in #587</a>, #615</a>, #619</a>, #620</a>, #621</a>, #624</a>. I can't analyze every trace I receive, but I try to look at as many as possible. The new direction of the project</h3> Some of the newly added features may come as a surprise to those of you who keep a close track of ZLUDA development. Most of them were previously explicitly outside ZLUDA's roadmap. There has been a change of plans. ZLUDA development is no longer commercially funded, so it's back to being my weekend project. This means that the priority is no longer what makes commercial sense, but what I find the most entertaining. That's why the sudden addition of textures, PhysX and better Windows support. It also means that Violet has become our first Developer Emeritus. All of this happened roughly three months ago, and ZLUDA has been my fun side project ever since. I still find it interesting, and the development continues. However, I just can't spend as much time on it, so I will probably post updates less frequently than every quarter. However, I hope you will still enjoy new versions of ZLUDA, even if they are released less frequently.

ZLUDA update Q4 2025 - ROCm7, Windows, full llama.cpp and more 2026-01-13T00:00:00+00:00 Hi, and welcome to a new ZLUDA update! It's been a busy quarter and while we didn't quite reach our goal of providing robust PyTorch support by the end of the year. We now have complete llama.cpp support (Full llama.cpp support</a>) and significantly improved Windows support (Better Windows support</a>). We've also made several other improvements in preparation for PyTorch (ZLUDA now ships with a bundled LLVM</a>, Compiler performance improvements</a>, ROCm 7 works now</a>, PyTorch support underway</a>). ZLUDA now ships with a bundled LLVM.</h1> Historically, ZLUDA has used the AMD-provided comgr library ("Code Object Manager API"), which is installed on your system as part of ROCm. This library is a wrapper around LLVM. It works well for the most part, but there are two caveats: AMD LLVM is often buggy and AMD does not backport LLVM fixes. If a user has an older, "stable" version of LLVM bundled with ROCm, we cannot fix bugs in it. This is especially important for PyTorch because there are patterns in PyTorch GPU code that cause crashes in AMD GPU LLVM purely because due to bugs in the AMDGPU target.</li> We can't perform all the optimizations we want. Although we strive to emit the best possible LLVM bitcode, the ZLUDA compiler simply is not an optimizing, SSA-based compiler. There are certain optimizations relevant to machine learning workloads that are beyond our reach without custom LLVM optimization passes.</li> </ul> We have wanted to ship a ZLUDA-patched LLVM for a long time. As part of the work on llama.cpp and PyTorch we finally did so and started shipping LLVM with ZLUDA. The ZLUDA side was merged in #555</a> with required LLVM work done here</a> and here</a>. There's one downside to our new approach: LLVM is a massive project, so building it is time-consuming. Our automatic builds are mostly unaffected because we use sccache. However if you are compiling ZLUDA yourself, expect much longer build times. We recommend that users who want to try work-in-progress builds download prerelease binaries from GitHub. We build binaries for every merged pull request. ROCm 7 works now</h1> ROCm 7 has been out for several months already, but we did not start working on it right away. This is partially because we've been busy addressing the issues described in this update and partially because ROCm version updates can introduce hard breaks that take a long time to resolve. Our goal is for ZLUDA to work equally well with both ROCm 6 and ROCm 7, so that we don't have to maintain two separate builds. Upgrading from ROCm 5 to ROCm 6 was a difficult and involved many breaking changes. Granted, most of the changes were necessary and made sense, but it was still a lot of extra work. Thankfully, ROCm 7 is mostly backward compatible. Only two functions broke the ABI, but ZLUDA did not use them. They were also clearly marked as experimental. This time, the problem stems from some unfortunate packaging choices and interactions on Windows. The AMD graphics driver ships with all ROCm 5, ROCm 6 and ROCm 7 runtimes simultaneously (respectively amdhip64.dll</code>, amdhip64_6.dll</code> and amdhip64_7.dll</code>). Which is fine; backwards compatibility is good. The problem is with the performance libraries. For some reason, the latest official version of the performance libraries on Windows is 6.4. This version is slightly outdated and relies on amdhip64_6.dll. This can lead to some unfortunate interactions. ZLUDA works with either amdhip64_6.dll</code> or amdhip64_7.dll</code> and prefers to load amdhip64_7.dll</code>. It's possible for ZLUDA to launch, load amdhip64_7.dll</code> , but then load a performance library that loads amdhip64_6.dll</code>. Having both ROCm 7 and ROCm 6 runtimes loaded in a process at the same time and trying to interoperate between them simply does not work and leads to mysterious crashes. Since we can't time-travel, we settled on the next best solution: preloading performance libraries into the process and then scanning it for the presence of either amdhip64_6.dll</code> or amdhip64_7.dll</code>. See the details in #579</a>. This makes ZLUDA startup slightly slower, but should be fine. In the future, we might consider shipping the ROCm performance libraries with ZLUDA. What do you think? Too slow for katago</h1> As mentioned in the last update, we now have a mechanism for collecting execution traces that doesn't require you to be a developer. Happily, some of you provided traces of your favorite application. One of those apps was Katago, an ML-based library for playing Go. The trace revealed two things. First, the majority of the required CUDA support for Katago is already in place. Second, the missing component, cuDNN support, is on our priority list. Katago uses CUDA in a fairly straightforward manner. However, it utilizes some cuDNN features for which there is no direct ROCm support. After adding missing functionality and various workarounds, Katago finally ran! However, it ran extremely slowly, thousands of times slower than on NVIDIA GPUs. A quick investigation revealed the source of the problem: certain Katago operations have a slow, naive implementation in MIOpen and an optimized implementation in cuDNN. Unfortunately, there's nothing we can do about it. Hopefully, AMD will one day optimize those convolutions and make AMD GPUs a viable target for Katago. You can see the whole story here #541</a>. Please keep sending us traces. While we might not be able to work on each of them right away, they are important for prioritization and debugging. Better Windows support</h1> For most of modern ZLUDA's history, Windows support was an afterthought. Although we ensured that every change compiled on Windows, the loader that injects ZLUDA into an executable was in poor condition. In the last update, we asked you to share traces from your favorite CUDA application, and most of the traces you sent us were from Windows. Or, at least, you attempted to do so given how poorly the ZLUDA loader functioned. This forced us to face reality: we must fix the loader, or we will receive no traces. This led to a major (19100 lines added, 2377 lines removed) PR #550</a> which brought zluda.exe</code> on Windows to a more acceptable quality and allows ZLUDA to work as well as on Linux. Full llama.cpp support</h1> In the previous update, we announced the initial release of Llama.cpp support. Since then, we have started benchmarking ZLUDA's Llama.cpp support against the top models from Hugging Face. This process revealed several issues, but we are pleased to announce that we currently have full llama.cpp support. Our performance is nearly identical to that of the native ROCm backend. See the documentation here</a> for details on how to build Llama.cpp with the CUDA backend so that it works at full performance with ZLUDA. Compiler performance improvements</h1> One way PyTorch differs from other CUDA projects is that it ships with large PTX modules, which can be several megabytes in size. This poses a challenge for the ZLUDA compiler, which is tuned to produce high-quality code but not to run quickly. In particular, there was one compiler pass: instruction_mode_to_global_mode</code>, which, on large PTX modules, could account for 60% of the total compilation time. After pull request #552</a> was merged, this time decreased to less than 1%, with the majority of the time now being spent in LLVM. We have also added an experimental precompilation tool that can scan a directory, detect all CUDA binaries, and precompile all PTX modules. This tool uses all the cores on your machine, making it much faster than the CUDA runtime's one-by-one compilation. The effect varies by application: some applications will try to load (and compile) every possible GPU kernel, even if it's not actually being used. There are also applications that compile only the subset of kernels strictly required by a given workload. In our experience, the former category is much larger than the latter. PyTorch support underway</h1> We are working hard on PyTorch. It's impossible to support every PyTorch configuration and project, so we are focusing on someting specific. Our primary objective is vLLM, and to that end, we have been landing multiple PRs that make ZLUDA run more and more of vLLM. See recent PRs: #580</a>, #583</a>, #585</a>, #590</a>, #596</a> and more. We don't have much PyTorch support to show yet, and we've made many significant changes, such as bundling LLVM and adding ROCm 7 support. Therefore, there is no new major release. However, we encourage you to try out the prerelease builds and report any problems you encounter. Until next time ZLUDA update Q3 2025 - ZLUDA 5 is here 2025-10-02T00:00:00+00:00 We're happy to announce the release of ZLUDA version 5. This release brings with it new debugging tools, better correctness and preliminary support for llama.cpp. ZLUDA version 5 includes a tool called zluda_trace</code>. Community members often ask us what they can do to help with the project. One of the most impactful things you can do without needing programming skills is to run your favorite workload using zluda_trace</code> and create a bug report issue</a> with the trace attached. Just make sure you collect logs on Linux, we are not yet ready accept logs from Windows (more here</a>). You can find more information on our Troubleshooting</a> page. And if you are interested in writing code, we have a list of issues labeled help wanted</a> on our repository. zoc (ZLUDA offline compiler)</h1> ZLUDA includes an NVIDIA PTX to AMD RDNA compiler. Previously, this compiler was only accessible from the ZLUDA library - it runs when cuModuleLoadData</code> or cuLibraryLoadData</code> are called. However, for ZLUDA developers, it is useful for debugging purposes to have a command line interface as well, similar to NVIDIA's ptxas</code> tool. For this purpose, JoelleJS</a> has contributed zoc</code> in #344</a>, the ZLUDA offline compiler. This compiler takes a PTX file as input, and will output the LLVM IR generated by ZLUDA before and after linking and the RDNA assembly for your GPU generated by the ROCm compiler. We've been using this tool intensively and merged minor ergonomics improvements in #491</a> and #504</a>. Machine Learning Workloads</h1> Our focus on running machine learning inference workloads continues. In this release, we have prioritized correctness over performance, which will be an area of focus for future updates. llm.c</h2> We hit our first ML milestone. llm.c's test_gpt2fp32cu</code> and test_gptc2cu</code> now both run on ZLUDA, when built without Multi-GPU and without Flash Attention. Support for Flash Attention is right now blocked by missing APIs in MIOpen. We plan to backfill them in the future. This took a large number of commits across our host API implementation and compiler, including #402</a>, #406</a>, #412</a>, #417</a>, #409</a>, #421</a>, #427</a>, #454</a>, #463</a>, #468</a>, #496</a>, #500</a>, #501</a>, #503</a>, and #511</a>, in addition to the performance library work mentioned below. llama.cpp</h2> We hit our second ML milestone. The CUDA backend for llama.cpp can now run on ZLUDA. We've done some preliminary measurements and found the performance to be within range of the results measured by Phoronix on ROCm (Latest Open-Source AMD Improvements Allowing For Better Llama.cpp AI Performance Against Windows 11 - Phoronix</a>). We're interested in your feedback, if it doesn't work or you are getting worse performance than with ROCm, please share in the issues</a>. Much of the required functionality in the host API and compiler was implemented at this point, so this took a relatively fewer number of commits to enable, including #509</a>, #515</a>, and #518</a>. Initial PyTorch work</h2> We did not yet hit our third ML milestone We've been continuing to work on PyTorch support, which is our next big milestone. Other than work on host functions and instruction support in the compiler, we have added the zluda_ld</code> library (#447</a> and #508</a>). PyTorch uses the DT_RPATH</code> attribute in its executables to hard-code the path to the CUDA library and so ignores the LD_LIBRARY_PATH</code> attribute we normally use to nudge applications to load ZLUDA on Unix. zluda_ld</code> can be used with the little-known LD_AUDIT</code> environment variable to get around this problem, and force loading ZLUDA. PyTorch is far from being ready: we are blocked by the slowness of our compiler, missing performance libraries (cuBLAS, cuDNN, etc) coverage and bugs (or missing features) in LLVM AMDGPU target. This quarter we will be focusing on all those problems - please check our prerelease builds from time to time. Other improvements</h1> Kernel cache</h2> We have added a kernel caching mechanism to our PTX module loader (#465</a>). When a CUDA application loads a GPU code module, we need to extract the PTX from the fat binary provided and then compile it to machine code for the specific GPU being used. This can be a costly operation and significantly slow down runtime for a workload with many modules. We now avoid this by locally caching the machine code for kernels. Performance libraries</h2> We have added initial support for running applications that use cuBLAS, cuBLASLt, and nvml (#440</a>, #444</a>, #449</a>, #452</a>, #455</a>, #457</a>, #481</a>). The number of supported operations is still small, but it's set up for rapid additions. Expect to see the list of supported functions grow (and the addition of cuDNN). More testing</h2> We've set up CI, including running unit tests for every PR (#401</a>) and running our PTX sweep test suite nightly. This will help us prevent regressions and measure our conformance to CUDA behavior. We also have set up some initial scaffolding for testing the host API. Prerelease builds</h2> We've started publishing preview (prerelease) builds. Now, after every code change we automatically compile and publish the binary into the Release section on Github</a>. Try them out. There's no longer any reason to do build from sources by yourself (unless you are a developer). Final bit of correctness</h2> CI improvements unlocked a flurry of compiler fixes in #416</a>, #467</a> and more. According to our testing we are now bit-accurate (we return results within CUDA-documented precision) with NVIDIA GPUs across almost all supported operations and their variants with all floating-point subnormal and rounding control modes. There are two exceptions: 32-bit floating-point square root with non-default rounding modes (which is very rare)</li> 64-bit floating-point transcendentals (division, square root, etc.)</li> </ul> Some widely used instructions are still not supported by the compiler, but that number gets smaller every day. Overall, this is a major improvement over pre-rollback ZLUDA, which would cut corners and not always be bit-accurate. ZLUDA update Q2 2025 - bigger team, more groundwork, less bugs 2025-07-02T00:00:00+00:00 Welcome to the newest ZLUDA update. This quarter we doubled the size of our development team (ZLUDA team doubles in size</a>), resolved a critical regression in the AMD driver (comgr ABI break</a>), improved correctness of the code emitted by the compiler (Road to bit-accurate execution</a>), set up automated builds (Automated builds on GitHub</a>), implemented actually useful logging (Improved logging</a>), made tiny progress on PhysX (32 bit PhysX update</a>) and made a much bigger progress on llm.c (llm.c: 0 to 552</a>). ZLUDA team doubles in size</h3> I am pleased to announce that ZLUDA has doubled the size of its full-time development team and now has two developers working on the project. Welcome Violet. They joined the project less than a month ago and already made significant contributions. For more details, feel free to check out the llm.c: 0 to 552</a> section. comgr ABI break</h3> GPU runtimes (CUDA, ROCm/HIP, ZLUDA, OpenCL, etc.) must be able to compile GPU code during run time of an application. This is necessary to ensure forward compatibility: GPU code developed in the past should be able to compile on new GPU architectures. ROCm/HIP run time compilation library is comgr (ROCm-CompilerSupport). It's a small, focused library with a verbose, but well-designed interface. It ships on both Linux and Windows and has been unproblematic. Until now. ROCm/HIP 6.4 shipped with a subtle ABI break in comgr. comgr's interface is fairly generic, it consists of a handful of functions, with the most important being amd_comgr_do_action. This function takes kind parameter: the kind LLVM of action you want to do: compilation, linking, disassembly. The problem is that in ROCm/HIP 6.4, comgr ships with a new ABI (v3) which reordered the integer values assigned to each of the action. This meant that on the new ABI ZLUDA suddenly started requesting comgr to e.g. do linking instead of compilation and this led to silent, unexplainable failures. On Windows the problem is even worse: AMD somehow shipped a mixture of v2 and v3. The library advertises itself as version v2.9, but actually uses ABI v3. This was fixed in #364</a> and #366</a>. If you are using some other project which uses ROCm/HIP and which suddenly started failing on Linux with ROCm 6.4 and on Windows with Adrenalin 25.5.1 that's why. Road to bit-accurate execution</h3> The stated goal of ZLUDA is to execute unmodified CUDA binaries on non-NVIDIA GPUs. A consequence of that is that we must execute every on-GPU instruction bit-exactly or, if bit-exact execution is not possible, within error bounds of the NVIDIA cards. Old, pre-rollback ZLUDA code would cut a lot of corners here and ignore certain instruction modifiers or did not execute them with full precision. New ZLUDA is doing a lot better on that front. We verify correctness with PTX "sweep" tests: for every relevant instruction with every possible instruction modifier we check that for every possible input ZLUDA produces correct output. The project lives here</a>, outside of the main repo. This test suite was verified against original NVIDIA's CUDA but was never actually used with ZLUDA. We have now ran ZLUDA under this test suite and uncovered some bugs in the compiler, fixed here: #379</a>. Not every instruction went through this process yet, but already some of the trickiest cases (like cvt instruction) are bit-accurate. Special thanks to our friends at SCALE lang</a>, who contributed a major refactor to the test suite here</a>. Automated builds on GitHub</h3> As of #358</a>, we now post automatic builds to GitHub. You don't have to build from source anymore if you want to try the freshest code. This is not completely finished yet, as we want to post those builds into "prerelease" section on GitHub page for better discoverability. Improved logging</h3> The first step to enabling any CUDA application on ZLUDA (no matter if it's a game, 3D suite, ML library) is to precisely log all the ways an application interact with CUDA, including calls to Dark API and calls to performance libraries. As of the (giant) PR #372</a> we now have a much better logging implementation that logs interactions that were not collected previously. It can even handle intermediate interactions (e.g. can show us when and how cuBLAS uses cuBLASLt or when and how cuDNN uses Driver API). 32 bit PhysX update</h3> Minor update on the 32 bit PhysX. @Groowy from ZLUDA Discord started poking at the first step of 32 bit PhysX support: collecting CUDA logs. This quickly uncovered bugs in ZLUDA, tracked in #374. Because some of them might affect 64 bit CUDA this step was pulled into official roadmap. Only this step though, full 32 bit PhysX support will still require open source contributions. llm.c: 0 to 552</h3> The extensive effort outlined in the previous paragraphs may appear random at first glance; however, they serve as stepping stones toward our first milestone: llm.c. Just the groundwork is not enough, someone must work on the workload directly. We are focusing on llm.c test project: test_gpt2fp32cu. It's fairly small, but for ZLUDA it's the first project using CUDA Runtime and the first project using CUDA performance libraries (cuBLAS). It does 8186 CUDA calls to 44 different functions Violet landed a flurry of commits (#377</a>, #380</a>, #381</a>, #382</a>, #383</a>, #386</a>, #387</a>, #388</a>, #389</a>, #390</a>, #391</a>, #394</a>) for llm.c. Before, ZLUDA failed at the very first CUDA call and now it fails at the 552nd. With 16 of 44 functions implemented we hope to have llm.c running soon. All that work of course will apply to other, more complex projects, like PyTorch. We would love to hear your thoughts in the comments below. Until next time! ZLUDA update Q1 2025 - roadmap update, LLVM tests, denormals 2025-04-03T00:00:00+00:00 Welcome to the new ZLUDA update. Read about our plans for the nearest future (that include PyTorch and PhysX) in Roadmap update</a> and about progress made this quarter in LLVM bitcode unit tests</a> and Correct rounding and denormal modes on AMD GPUs</a>. Roadmap update</h3> PyTorch</h4> PyTorch remains my top priority and I still aim at being able to have PyTorch running on ZLUDA Q3/Q4 this year. Before PyTorch is up and running I am aiming for an intermediate goal: llm.c. You can see the progress towards getting llm.c up and running here</a>. PhysX</h4> As you might have read here</a>, here</a> and on multiple other sites, NVIDIA dropped support for 32-bit PhysX in their latest generation of GPUs, leaving a number of older games stranded. This reignited the debate about ZLUDA’s PhysX support</a>. After reading through it several times, it’s clear to me that there is a path in ZLUDA to rescuing those games and getting them to run on both AMD and NVIDIA GPUs. I broke down the implementation into tasks here</a>. If you can program Rust and want to make a lot of people happy, I encourage you to contribute. I won't be able to work on it myself because I'll be busy with PyTorch support, but I'll help in any way I can. LLVM bitcode unit tests</h3> The ZLUDA compiler is the cornerstone of the project. It processes PTX modules by applying a series of transformations, ultimately generating LLVM bitcode. This LLVM bitcode is subsequently fed into the installed ROCm/HIP driver, which compiles it into a binary suitable for the currently installed GPU. The compiler codebase includes multiple unit tests. Each test asserts that for: given PTX source code</li> given input data</li> given output data</li> </ul> It can compile successfully and execute compiled binary with input data and produce the output data. While this covers the entire end-to-end flow, there is a valuable sub-flow hiding here that could be tested too: the compilation from PTX to the LLVM bitcode. For each PTX source module, we could commit the compiled LLVM bitcode in a textual format and implement tests to ensure it remains unchanged. This approach is particularly useful for newly written complex compiler transformations that modify the emitted LLVM across the board. By using LLVM bitcode tests, you can observe how your modifications impact LLVM generation across various use cases, even those you might assume are unrelated. This feature sat on the "help wanted" list for quite some time and I’m happy to see the first external contributor address this issue. JoelleJS merged it in #324</a>. Just in time for a significant feature that will use these tests. Correct rounding and denormal modes on AMD GPUs</h3> This is an important feature that I have wanted to do for years. It is not present even in the old (pre-rollback) ZLUDA. The priority was always given to enabling new workloads, instead of making everything perfectly correct. Now we are out of proof-of-concept mode and can spend some time on correctness. As you will read below, it is a complex feature that is quite often invisible to the end user. It was acceptable for old ZLUDA do things incorrectly. Warning The remainder of this article assumes you know what PTX, floating-point numbers, control flow graphs, and basic blocks are. You don't need to be an expert, but a lack of familiarity with these concept will make everything below incomprehensible. </blockquote> If you know what floating-point denormals and rounding modes are you can skip to the next section (Previously on "ZLUDA" ...</a>). First, some definitions. What exactly is denormal mode, and what are denormal numbers? Denormals (subnormals), represent a category of very small floating-point values. For the most common floating point size (32 bit), these values fall within the range of -3.4×1038 to 3.4×1038 (excluding 0). Due to the encoding of floating-point numbers, this category necessitates additional processing and has historically been either unsupported or supported with reduced performance. When we say "unsupported," it means that denormal values are treated as zeros. In the context of PTX, denormal mode refers to a flag (.ftz</code>) on floating-point instructions that determines whether they process denormal values or treat them as zeros, "flushing to zero." In general, modern, mainstream hardware architectures can handle basic operations - add, multiply, fused multiply add, etc. - with denormal values at full speed. Now rounding mode. Most of the "simple" operations floating-point operations are formally defined as "performs the operation with infinite precision and then rounds infinite value to a finite value using chosen mode". Usual rounding modes are "round to nearest even", "round to zero", "round to positive infinity", "round to negative infinity". Rounding mode effectively controls the least-significant bit of the mantissa of the floating-point result. Although a single least-significant bit may seem insignificant, it can have a noticeable impact. For instance, consider two values that differ only by the least significant bit: 1.0000000 and 1.0000001. In certain contexts, the difference of 0.0000001 can be substantial. Now that we understand the denormal and rounding part, let's focus on the mode part. Typically, CPUs will do some mix of integer calculations and floating-point calculations, with the specific proportions varying based on the workload. In contrast, GPUs—regardless of whether they are tailored for gaming, high-performance computing (HPC), or machine learning—primarily dedicate their processing cycles to floating-point operations. This focus prompts GPU architects to prioritise floating-point support in their hardware designs. One notable feature found in NVIDIA hardware, and consequently in PTX, is the per-instruction control for denormal and rounding operations. In a CPU, a common approach to managing this issue is to implement a global control (as seen in x86 and ARM architectures) or to forgo denormal control altogether (as in RISC-V). While this design choice is beneficial for programmers, it presents unique challenges for ZLUDA when translating to an AMD GPU which uses global control (like a CPU). Previously on "ZLUDA" ...</h4> Pre-rollback ZLUDA used the simplest possible approach that almost works: For denormal mode (which is either "flush-to-zero" or "preserve denormals") hold a "vote" for each function. Count the number of instructions using each mode and then just use the more prolific mode across the function</li> For rounding mode, ignore it completely and always use "round to nearest even"</li> </ul> PTX module compiled from C++ CUDA sources will usually use the same denormal mode across the whole module with particular mode depending on the compiler flags. Rounding mode use is somewhat uncommon. Sure, this approach is not correct, but it worked somewhat okayish and it led to only a single major bug (that I’ve noticed). Still, ZLUDA is now out of proof-of-concept mode and we are now doing things correctly. Dead end#1: LLVM & HIP/ROCm</h4> When implementing a new compiler feature in ZLUDA, the first step is to check if it's implemented by the baseline LLVM. The perfect LLVM support would allow ZLUDA to do a trivial per-instruction transformation like this: from (PTX pseudocode): z = add.ftz x, y a = add.ftz b, c</code></pre> to (LLVM pseduocode): old_fpstate1 = llvm.get_fpstate() llvm.set_ftz(true) z = add x, y llvm.set_fpstate(old_fpstate1) old_fpstate2 = llvm.get_fpstate() llvm.set_ftz(true) a = add b, c llvm.set_fpstate(old_fpstate2)</code></pre> and have LLVM optimize that to (AMD GPU assembler pseudocode): S_DENORM_MODE flush, flush V_ADD_NC_U32 z, x, y V_ADD_NC_U32 a, b, c</code></pre> The initial research on LLVM floating point builtins appeared promising, as this collection of intrinsics seemed to address our specific use case: llvm.get.fpenv/llvm.set.fpenv</code></li> llvm.get.fpmode/llvm.set.fpmode</code></li> llvm.experimental.*</code> family</li> </ul> Sadly, they are all deficient in some way. They either compile down to poor, unoptimized AMD GPU code or do not work at all . Granted, llvm.experimental.*</code> support is being worked on by AMD and should appear in the future ROCm versions, but this does not help us today. This raises the question: in CUDA C++ you have a bunch of builtins to do operations with the specified rounding mode, e.g. __fadd_rz</code> for floating point addition with "round-to-zero" mode. What happens on ROCm? Further exploration revealed (source</a>): Only the nearest-even rounding mode is supported by default on AMD GPUs. The _rz</code>, _ru</code>, and _rd</code> suffixed intrinsic functions exist in the HIP AMD backend if the OCML_BASIC_ROUNDED_OPERATIONS</code> macro is defined. </blockquote> Ok. You can use those functions, but they are hidden behind a define. That’s weird. Time to try it! The HIP/ROCm source code: #include <hip/hip_runtime.h> __global__ void foobar(int* array, int n) { int tid = blockDim.x * blockIdx.x + threadIdx.x; array[tid] = __fadd_rz(array[tid], array[tid]); array[tid+1] = __fadd_rz(array[tid+1], array[tid+1]); }</code></pre> When using ROCm 6.3, compiles down to this (some output omitted for clarity): 0000000000001900 <__ocml_add_rtz_f32>: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) s_setreg_imm32_b32 hwreg(HW_REG_MODE, 0, 2), 3 v_add_f32_e32 v0, v0, v1 s_setreg_imm32_b32 hwreg(HW_REG_MODE, 0, 2), 0 s_setpc_b64 s[30:31] ... 0000000000001a00 <_Z6foobarPii>: ... s_getpc_b64 s[0:1] s_add_u32 s0, s0, 0xfffffeac s_addc_u32 s1, s1, -1 ... s_swappc_b64 s[30:31], s[0:1] v_cvt_f32_i32_e32 v1, v5 v_cvt_i32_f32_e32 v4, v0 v_mov_b32_e32 v0, v1 s_swappc_b64 s[30:31], s[0:1] s_delay_alu instid0(VALU_DEP_1) v_cvt_i32_f32_e32 v5, v0 global_store_b64 v[2:3], v[4:5], off s_endpgm</code></pre> Ok, mystery solved. I too, would like to hide this compiler output. For those of us who are not proficient in AMD GPU assembly: every use of __fadd_rz</code> requires a function call (s_swappc_b64</code>, expensive) and two calls to set rounding mode (s_setreg_imm32_b32</code>, also expensive). This is simply too much overhead to be acceptable. We are going to build our own support. Our goal, for the code above, is a single instruction to set the rounding mode (or even zero instructions as we will see later). Building support in ZLUDA</h4> Our new goal is to write a complete transformation (compiler pass) in ZLUDA that will insert instructions that set the global modes (rounding and denormal). We want to insert as few instructions as possible for the best possible performance - there’s no LLVM pass that is going to optimize the insertions for us. Let’s take half of a step back. We know that the trivial (and slow) approach is to simply set the global mode before every instruction that makes use of a mode. It can be improved by omitting the mode-setting instructions if we know that the previous instruction uses the same mode. We can always track this in straight-line code, but what happens if there are branches? What happens if there are multiple branches with from different sources, but into the same target? It seems to be sufficient to figure out which branches require mode change and which do not. This leads us to a new reformulation. We can express this problem as a control flow graph augmented with a little bit of extra information: for each mode (denormal, rounding) each node (basic block) will have "entry" state and "exit" state. Entry state for a basic block is the mode of the first mode-using instruction in the basic block. Similarly, exit mode is the mode of the last mode-using instruction. This simplifies problem quite a bit. We must now compute which edges (jumps) in the control flow graph require an insertion of mode change. For illustrative purpose we will only consider mode that takes two values: true (green) and false (red). Picture below is node "A" that has a "true" entry mode and "false" exit mode and jumps to node B that has "true" entry mode and "false" exit mode: Dead end #2: mode forward propagation</h4> Something I did not mention explicitly, but is important: some nodes lack both entry and exit modes. Consider the following example: There’s no need to enter mode-setting instruction - node B will propagate the "false" value, but in this example: we need to insert mode change from "false" to "true" somewhere between nodes A and C. My first instinct was to propagate modes forward: for each node propagate its exit mode to all its successor nodes. While it is instinctively correct and solves two examples above, there are two problems: It’s relatively awakward to implement. Remember, a node can have more than one predecessor nodes. What happens if there is a node with an empty incoming edge and a "true" incoming edge? Should we do post-processing? </li> More concretely, this does not really handle codependence patterns like this: In this example node A can’t propagate its mode to B or C outright because they have more incoming edges. B and C can’t propagate their mode either because they have no mode - they depend on A. </li> </ul> Better approach: backward propagation</h4> Dependency problems from the previous solution hint at a better approach: backward propagation. Instead of propagating the exit mode we can compute the set of incoming modes. This set is the set of all possible values a given mode can have on the first instruction of the basic block. Sounds complex, but can be computed easily if you have our augmented control flow graph. Take all incoming nodes and if an incoming node’s exit mode is non-empty then add that value to the set, if the incoming node’s exit mode is empty then recursively check its incoming nodes. We now have the core of our algorithm, but it’s not a complete solution yet: the realities of AMD GPU hardware make it far more complex. Hardware quirks</h4> When targeting AMD GPUs, there are several hardware properties that we should take into account: Kernel, on startup, has a certain initial state that is controlled by the programmer (or the compiler in our case). Part of the initial state is the initial state of denormal and rounding registers (global modes). We get this initial mode for free, no extra instructions needed </li> Each mode (denormal and rounding) is actually split into two registers (global modes). One for f32 and one for joint f16 and f64. In total there are four registers: denormal f32, denormal f16+f64, rounding f32, rounding f16+f64 </li> </ul> Registers (global modes) of the same kind (denormal, rounding), but with different width (f32, f16+f64) are for our purpose twin registers. One quirk of AMD GPU is that there are three instructions for settings global mode: S_SETREG</code> to set any hardware (non-generic purpose) register and S_ROUND_MODE</code>, S_DENORM_MODE</code> to set just the rounding or denormal mode. S_ROUND_MODE</code>, S_DENORM_MODE</code> are much cheaper than S_SETREG</code>. The annoying limitation of S_ROUND_MODE</code>, S_DENORM_MODE</code> is that they can only set both f32 and f16+f64. For this reason we will only do mode insertions for both f32 and f16+f64 Final algorithm</h4> If you made it this far, congratulations, you made it through the introduction. Now we can start implementing our algorithm. Create control flow graph</h5> Our first step is to compute the control flow graph. Every basic block contains entry and exit mode. For efficiency each node actually contains four entry modes and four exit modes. One for each AMD GPU mode: denormal f32, denormal f16+f64, rounding f32, rounding f16+f64. We handle function calls by including them in the graph. Call from function "foo" to function "bar" is expressed as a node from the caling basic block of "foo" to the first basic block of "bar". We don’t support virtual calls in the current ZLUDA, because they are extremally rare. They can be easily added later. During this step we compute both entry and exit mode for each basic block. Additionally, each kernel starts with an artificial starting node. This node get a special "entry" and "exit" value: the numeric identifier of the kernel. This numeric identifier is used across the whole ZLUDA compiler. It is already present (generated by previous compiler passes) and unique for a kernel. For example: while denormals register can take one of two values: true or false, in our CFG, the values that represent denormals can be true, false or arbitrary numeric id of a kernel. While going from a bounded to an ubounded set does not intuitively sound like a good decision, it’s temporary. We will optimize it back to the bounded set soon. Compute minimal insertions</h5> Our next goal is, for each of the four modes, compute minimal set of insertions. In other words: figure out which basic blocks can be reached with different mode than expected by the first instruction. We do this computation for each of the four modes separately. We start by computing two sets: required insertions and potential insertions. We choose nodes which have an entry mode (we skip the nodes with empty entry mode and kernel nodes with numeric ids). Then, for each node, we compute the set of incoming modes: If the set contains a value that is different from the node’s entry mode then we add the node to required insertions</li> If the set of incoming modes is purely a set of kernel numeric ids (with no conflicting specific mode values) then we add the node id along with its mode and kernel ids to the potential insertions</li> </ul> Required insertions are set in stone: if we jump from another node with different mode then we must insert a mode set instruction. Potential insertions on the other hand can be omitted: for a given node, if all the related kernels have the same initial value as the node then we can skip the mode set instruction. E.g. if we have kernels "foo" and "bar" that both call function "asdf" and "asdf" entry mode is "true", then we should set initial mode for "foo" and "bar" to "true" and avoid inserting additional mode-setting instructions. The problem is easy to solve in the example above, the general case is not trivial. I could not come up with a non-brute force algorithm and opted to encode the problem as an integer linear programming problem and use an external solver. This excellent post</a> helped encode my constraints. As for the solver I went with microlp</a>, mainly because it’s a relatively small dependency. I wanted to avoid dragging something big like SCIP or even Z3 into the project. Our problem sizes are not going to be big. PTX modules tend to have a handful of kernels and simple control flow. Compute full insertions</h5> Now we have: Provisional control flow graph (with some nodes empty and kernel starting nodes containing numeric ids instead of specific values)</li> List of nodes that require a mode change on entry (if the incoming mode is different - there might be multiple nodes incoming, each with their own mode)</li> For kernels that were subject to optimization in the previous step: its initial state</li> </ul> We are almost ready to start inserting S_ROUND_MODE</code> and S_DENORM_MODE</code>. We have all the necessary information, we just need to do some more preprocessing. Specifically we need to know two things: What is the effective entry mode for each block Note that even though mode instructions are inserted along edges in the CFG (jumps in code), we don’t explictly store edges. That’s because when inserting mode-setting instructions in a basic blocks we will implictly calculate exit mode anyway. And since we know what identifier we jump into, as long as we have information what are the modes of our jump target we know if they are different and in consequence if the jump requires a mode change </li> What is the exit mode for a function This is necessary because functions calls are mechanically different from normal jumps. Function calls terminate a basic block and we need to know if the new basic block starting from the first post-call instruction requires a mode change. Since a function can be called from many places it is a responsibility of the caller to do post-call mode adjustments (if necessary) </li> </ul> Computing both of those is relatively straightforward. First, we take our incomplete control flow graph and resolve all empty nodes and special kernel nodes. For empty nodes we compute the incoming set - if the set contains more than a single value, we use a special value "conflict". For special starting kernel nodes we have a list of kernel with their initial values from the previous optimization pass. Lastly, we join four separate logical CFGs (each for one AMD GPU mode) into two lookup tables. One lookup table contains all the necessary information to support mode changes for branches, the other lookup table contains all the necessary information to support mode changes for functions calls. Apply mode control</h5> In this stage we walk through every function (kernel and non-kernel) and modify it accordingly: If necessary, insert mode change "prelude" basic block before each basic block</li> If necessary, redirect branch to go into mode change "prelude"</li> Insert all mode changes inside a basic block. We fold twin registers together. For example pseudocode like this:add.ftz.f32 a, b, c; add.no_ftz.f16 x, y, z;</code></pre>gets converted into this pseudocode:set_denormal.f32.f16 ftz, no_ftz; add.f32 a, b, c; add.f16 x, y, z;</code></pre></li> </ul> After all this hard work we now get a new module with a small number of freshly inserted mode change instructions. It’s not optimal in the absolute sense, but it’s much better than the alternatives. The AMD GPU code is now as correct as we can make it. Unfortunately, after all this hard work, our code can still miscompute some code. Read below for more. LLVM sadness</h4> Sadly, there are still some issues outside of our control. Firstly, a minor issue. As mentioned previously, for each AMD GPU kernel we can sat initial denormal mode and initial rounding mode. This is true in the general sense, but for some reason LLVM AMDGPU backend exposes the control for initial denormal mode, but not for initial rounding mode. Right now, we set initial rounding mode by inserting the instruction for it at the start of the kernel. We could skip this single instruction with better LLVM AMD GPU support. Secondly, a bigger issue. Hardware-agnostic LLVM passes don’t understand AMD GPU instructions that set global state. So this pseudocode: set_denormal.f32.f16 ftz, ftz; add.f32 x, b, c; set_denormal.f32.f16 no_ftz, no_ftz; add.f32 y, b, c;</code></pre> after LLVM optimizations ends up as: set_denormal.f32.f16 ftz, ftz; add.f32 x, b, c; mov.f32 y, x;</code></pre> Which gives incorrect result. While it’s rare to see the same input being computed twice with different modes, it’s concerning. Fixing this would require deeper changes in LLVM (making mode part of the instruction, like in llvm.experimental.constrained.*) and probably porting this pass to LLVM. We might do eventually do it, but that’s enough effort for now. If you made it this far, let me know in the comments what do you think. See you next time. ZLUDA update Q4 2024 2024-12-31T00:00:00+00:00 Hello everyone, it's the first of many ZLUDA updates. I've been working hard and I'm happy to announce that we reached the first milestone: we have a new version of ZLUDA with an actual working application. ZLUDA can run Geekbench 5. This update also includes a few words on how to contribute (Contributing to ZLUDA</a>) and changes in the internals of the "new" ZLUDA (New parser</a>, Atomics modulo</a>). Geekbench 5</h3> While Geekbench is far from being the most requested application, it's important for ZLUDA's development: It uses a relatively small CUDA API surface, which makes it easy for ZLUDA to support (at least easy when compared to Blender or PyTorch).</li> It's closed-source, so it's not possible to port it to HIP (via HIPIFY or other means).</li> It has both a generic OpenCL backend and an NVIDIA-specific CUDA backend, so we can measure the performance gain when using ZLUDA.</li> </ul> The "old" ZLUDA was about 1% faster than the native OpenCL. I was worried that the fresh new code would be slow, but the "new" ZLUDA turned out to be even better than the "old" one and is approximately 10% faster than the native OpenCL. Note that this performance improvement is Geekbench specific and not generalizable. Still, I'm happy with how things turned out. If you are interested in the technical details read the Atomics modulo</a> section down below. (The graphs below show slightly inconsistent results because the top graph uses previously collected numbers for OpenCL and ZLUDA 3, the bottom graph uses freshly collected numbers for OpenCL) Next on the roadmap is llm.c. Contributing to ZLUDA</h3> I regularly get questions about how to contribute to ZLUDA, here's how (this information is now also in the project's README): ZLUDA project has a commercial backing and does not accept donations. ZLUDA project accepts pull requests and other non-monetary contributions. If you want to contribute a code fix or documentation update feel free to open a Pull Request. There's no architecture document (yet). Two most important crates in ZLUDA are ptx</code> (PTX compiler) and zluda</code> (AMD GPU runtime). A good starting point to tinkering the project is to run one of the ptx unit tests under a debugger and understand what it is doing. cargo test -p ptx -- ::add_hip</code> is a simple test that adds two numbers. Github issues tagged with "help wanted"</a> are tasks that are self-containted. Their level of difficulty varies, they are not always good beginner tasks, but they defined unambiguously. If you have questions feel free to ask on #devtalk channel on Discord</a>. New parser</h3> This is the first time I've written an extensive write-up about an issue like this and I'm curious to know what do you think. Is this too detailed? Not detailed enough? Should all issues be broken down like this? Leave a comment. Commit 193eb29</a> finally brought a major feature that solves one of the least visible and hardest to fix problems in ZLUDA. First, you need to understand what PTX is. PTX is the NVIDIA GPU intermediate language. Intermediate languages work like this: Programmer writes source code</li> Programmer compiles their source code into an intermediate language X and sends it to the user</li> User runs the application. At some point, the intermediate code X is compiled (finalized) into binary for his particular hardware</li> </ul> Intermediate languages are a fairly common solution: Java has JVM bytecode .NET has CIL, gaming GPUs have SPIR-V, LLVM has LLVM IR. They all solve slightly different problems, but in the GPU context they are used to to avoid the forward compatibility problem. That's why GPU code written ten years ago works just fine on modern GPUs even though your GPU vendor has made major changes to his GPU architecture. What if your software stack does not have an intermediate language? Then either: You declare your hardware to be strictly forward-compatible. All changes are strictly additive: code compiled for older hardware will work on the newer hardware, but will not be able to take advantage of the hardware features. This is what the x86 CPU family does</li> You simply ignore the forward compatibility and compile from scratch for each new hardware target. This is the AMD GPU way</li> </ul> The CUDA driver ships with a compiler that compiles (finalizes) from PTX to the particular NVIDIA GPU architecture and of course ZLUDA does the same, but for AMD GPUs. The compilation itself is divided into several steps and the first step is parsing: converting from textual representation (PTX is a text format) to in-memory representation. PTX, being a language, follows certain grammatical rules. For example, this line: ld.global.cs.b32 r1, [addr1];</code></pre> means "load (ld</code>) from global address space (.global</code>) with streaming cache behavior (cs</code>) 32-bit integer (.b32</code>) into variable r1</code> from address stored in variable addr1</code>". You don't need to understand what all this means, just that there is an order to words in an instruction: operand, operands, registers. If the same instruction were written this way, it would violate grammar rules and result in an error: ld r1, [addr1] .global.cs.b32;</code></pre> Writing a PTX parser is not hard. As long as you are familiar with a parser generator you can get a high quality parser working relatively quickly and painlessly. ZLUDA used lalrpop</a> for this task It turns out that there is an important undocumented "feature" of the PTX language. Although the documentation lays out a certain language grammar and the NVIDIA PTX-generating compiler follows it, the NVIDIA PTX-consuming (finalizing) compiler is more permissive. NVIDIA PTX-consuming (fnalizing) compiler allows some (but not all) words in an instruction to be passed out-of-order, so both ld.global.cs.b32 r1, [addr1];</code> and ld.cs.global.b32 r1, [addr1];</code> are accepted. For 99.99% of the code out there, it's not a problem: the compiler will correctly generate all the instructions in the documented form. The problem is "inline assembly". The CUDA the programming language (dialect of C++) allows programmers to write PTX instructions directly. And programmers get the PTX grammar wrong all the time. NVIDIA's PTX parser is tolerant of the mistakes, but ZLUDA's old parser was strict and was special cased for every new project that got its PTX instructions out-of-order. ZLUDA's parser is strict because we want to have a strongly-typed representation of instructions as soon as possible and carry the same representation through all stages of compilation. Strongly-typed means that invalid combinations of operands are not only rejected by the parser but impossible to even express in the code. I can only speculate about NVIDIA's PTX parser, but its tolerance for out-of-order operands is probably an artifact of a more weakly typed internal representation or a two-stage parsing strategy (first do a simple parse to a weakly-typed representation and then validate and convert weakly-typed to strongly-typed). Back to ZLUDA's parser: it's easy enough to support the previous example: just have one rule for ld.<address_space>.<cache_hint>.<type></code> and one for ld.<cache_hint>.<address_space>.<type></code>. The problem is that ld operation can be very long. Its full form is: ld{.weak}{.ss}{.cop}{.level::cache_hint}{.level::prefetch_size}{.vec}.type</code></pre> With 5 possible operands (ld</code> is always at the start, .vec</code> and .type</code> are always at the end), there are up to 120 separate rules. And this does not even take into account optionality (every segment in {</code> }</code> brackets is optional). "Out-of-orderness" is difficult to express well in a lalrpop-style parser (very few grammars want this "feature"). I replaced our old parser with the one based on winnow</a>. Since ZLUDA tries to be strongly-typed this had a knock-on changes across all the compiler passes. But we now support all the broken PTX in the wild (which funnily enough comes mostly from NVIDIA's own libraries). Atomics modulo</h3> NVIDIA hardware supports a weird little atomic modulo increment/decrement instruction (atom.inc</code>/atom.dec</code>) with semantics like this: unsigned atomic_inc(unsigned volatile* p, unsigned modulo) { unsigned result; atomic { result = *p; *p = (result >= modulo) ? 0 : result+1; } return result; }</code></pre> For the longest time, I simply did not realize that AMD hardware natively supports this instruction and ZLUDA emulated it with a cmpxchg</code> loop. Now that it is natively supported in ZLUDA, code using it is much faster. Unfortunately, other than GeekBench, there really aren't that many users of this instruction, so it won't have much performance impact overall. To my knowledge, this instruction is not commonly available on CPUs. Do you know of any algorithms or data structures that benefit from this instruction? If so, let us know in the comments, I've been wondering about this for a few years now. Bonus content: interview</h3> I was interviewed about ZLUDA for Youtube channel "Tech over Tea". Watch it here</a>. ZLUDA's third life 2024-10-04T00:00:00+00:00 ZLUDA is back. For the last few months, I've been trying to find a commercial organization that would guarantee the continued development of the project. I am happy to announce that I have found one that is not only willing to fund further development, but also has an excellent vision for the future of ZLUDA. I share their long-term vision and I can't wait to talk more about it. We don’t want to disclose everything just yet, but for now, we know that we want to make ZLUDA better. If you think ZLUDA is a cool project, we have even cooler projects in the works. Development has begun, and as soon as we have something to share, we will. What I can talk about now is the current state and the direction of ZLUDA itself. Where we are now:</h3> The code has been rolled back to the pre-AMD state and I've been working furiously on improving the codebase. I’ve been writing the improved PTX parser I always wanted and laid the groundwork for the rebuild. Currently, some very simple synthetic GPU test programs can be run on an AMD GPU, but we are not yet at the point where ZLUDA can support a full application. Where we are going:</h3> The year of rebuild</h5> The ultimate goal is to bring "new" ZLUDA to a similar state as before the rollback in one year (Q3 2025). "Similar state" is very subjective here. I don't have precise criteria, but an application of similar complexity should work just as well. Not every pre-rollback application will be supported again due to new priorities (more below). </li> Focus on machine learning</h5> In the past, ZLUDA focused mainly on professional creator workloads. This meant focusing on applications like Arnold Render, Blender, 3DF Zephyr, etc. We even had a working prototype of GameWorks. While all of these workloads are important extremely satisfying to have running, machine learning workloads are in much higher demand. We are targeting for llm.c, llama.cpp, PyTorch, TensorFlow and others. Additionally, HIP support for anything image-related is disappointing. The time saved by skipping layers of workarounds can be spent more productively writing more tests and enabling more applications. </li> Raytracing is gone</h5> This is related to the previous point. Not many people realized it, but ZLUDA had an OptiX implementation. While ZLUDA-OptiX only supported just a handful of OptiX demos and simple Arnold scenes, it required a lot of code and broke all the time. Considering how underpowered it was and how much maintenance it required, it is a feature that is unlikely to ever come back. </li> GPU support</h5> The new ZLUDA will be built to support multiple GPU architectures. The mainline development will happen on AMD GPUs as that's what most of our users have. Still, I do realize there is lot of interest in other GPUs (e.g. Intel) and hopefully this will lead to more code contributions and new backends. Pre-rollback ZLUDA stayed on ROCm 5 mainly because I did not want to re-test all the version-specific workarounds. Since we are starting with a clean slate, AMD backend will target ROCm 6.1+. </li> More modest set of supported AMD GPUs</h5> We will only support RDNA1 and newer non-server AMD GPUs. Supporting pre-RDNA1 and server GPU architectures was an additional support burden and never worked as well as RDNA1+ GPUs due to the wavefront 64 configuration they use. Note that this applies to the current architectures. AMD recently announced the merging of RDNA and CDNA into a single architecture (UDNA). I have high hopes for this new architecture and expect it to simplify porting CUDA → HIP and to bring ZLUDA to server GPUs. </li> Downgraded Windows support</h5> Windows will still work and be supported, it will just be less user-friendly. zluda.exe</code> will be gone. Windows developers have invented several imaginative ways to load CUDA into a process. zluda.exe</code> has tried to support all of them, and even succeeded. Most of the time. As a user, you have to fashion some other way to load ZLUDA into the target process. Usually copying the ZLUDA binaries to the application is sufficient. We will provide ready-to-download Windows binaries. </li> Code improvements</h5> The current ZLUDA code is not the worst, but there is clearly room for improvement. During its second life, ZLUDA was written as a proof-of-concept solution for closed-source graphics applications (Arnold, 3DF Zephyr, etc.). This had two important consequences. First, since we were only concerned with one-time proof, it was enough to enable an application once and move on to the next without worrying about regressions. Second, some floating-point operations were handled with too little (or too much) precision - if you are rendering a scene, you can probably live with some pixels being imperceptibly different shades of red. Now that the concept has been thoroughly proven, ZLUDA will maintain application-level testing and more rigorously test for floating-point correctness (and document differences where strict compatibility is not possible). </li> Let’s talk</h5> If you think there's not enough ZLUDA in your life, there's now a ZLUDA Discord channel here</a>. Feel free to drop by and say hello. I will also try to post development updates from time to time. I hope that learning about all the creative ways developers are abusing CUDA APIs will be as exciting as it is for me to implement them. </li> </ul> Of course, ZLUDA will remain open source. This means that any features that are not part of the plan are fair game if someone steps up and submits a pull request. Personally, I think there is no better way to express your undying love for your Radeon VII than to add support for it in ZLUDA.

More testing</h2>
We've set up CI, including running unit tests for every PR (#401</a>) and running our PTX sweep test suite nightly. This will help us prevent regressions and measure our conformance to CUDA behavior. We also have set up some initial scaffolding for testing the host API.</p>

PyTorch</h4>
PyTorch remains my top priority and I still aim at being able to have PyTorch running on ZLUDA Q3/Q4 this year. Before PyTorch is up and running I am aiming for an intermediate goal: llm.c. You can see the progress towards getting llm.c up and running here</a>.</p>

`Bonus content: interview</h3>I was interviewed about ZLUDA for Youtube channel "Tech over Tea". Watch it here</a>.</p>`

ZLUDA

ZLUDA update Q1&Q2 2026 - back to the roots

ZLUDA update Q4 2025 - ROCm7, Windows, full llama.cpp and more

ZLUDA update Q3 2025 - ZLUDA 5 is here

ZLUDA update Q2 2025 - bigger team, more groundwork, less bugs

ZLUDA update Q1 2025 - roadmap update, LLVM tests, denormals

ZLUDA update Q4 2024

ZLUDA's third life

ZLUDA

ZLUDA update Q1&Q2 2026 - back to the roots

ZLUDA update Q4 2025 - ROCm7, Windows, full llama.cpp and more

ZLUDA update Q3 2025 - ZLUDA 5 is here

ZLUDA update Q2 2025 - bigger team, more groundwork, less bugs

ZLUDA update Q1 2025 - roadmap update, LLVM tests, denormals

Dead end#1: LLVM & HIP/ROCm</h4> When implementing a new compiler feature in ZLUDA, the first step is to check if it's implemented by the baseline LLVM. The perfect LLVM support would allow ZLUDA to do a trivial per-instruction transformation like this:</p> from (PTX pseudocode):</p>

Hardware quirks</h4> When targeting AMD GPUs, there are several hardware properties that we should take into account:</p>

ZLUDA update Q4 2024

ZLUDA's third life