ZLUDA

ZLUDA allows to run unmodified CUDA applications on non-NVIDIA GPUs

ZLUDA update Q4 2025 - ROCm7, Windows, full llama.cpp and more
2026-01-13

Hi, and welcome to a new ZLUDA update! It's been a busy quarter and while we didn't quite reach our goal of providing robust PyTorch support by the end of the year. We now have complete llama.cpp support (Full llama.cpp support) and significantly improved Windows support (Better Windows support). We've also made several other improvements in preparation for PyTorch (ZLUDA now ships with a bundled LLVM, Compiler performance improvements, ROCm 7 works now, PyTorch support underway).

ZLUDA now ships with a bundled LLVM.

Historically, ZLUDA has used the AMD-provided comgr library ("Code Object Manager API"), which is installed on your system as part of ROCm. This library is a wrapper around LLVM. It works well for the most part, but there are two caveats:

We have wanted to ship a ZLUDA-patched LLVM for a long time. As part of the work on llama.cpp and PyTorch we finally did so and started shipping LLVM with ZLUDA. The ZLUDA side was merged in #555 with required LLVM work done here and here.

There's one downside to our new approach: LLVM is a massive project, so building it is time-consuming. Our automatic builds are mostly unaffected because we use sccache. However if you are compiling ZLUDA yourself, expect much longer build times. We recommend that users who want to try work-in-progress builds download prerelease binaries from GitHub. We build binaries for every merged pull request.

ROCm 7 works now

ROCm 7 has been out for several months already, but we did not start working on it right away. This is partially because we've been busy addressing the issues described in this update and partially because ROCm version updates can introduce hard breaks that take a long time to resolve. Our goal is for ZLUDA to work equally well with both ROCm 6 and ROCm 7, so that we don't have to maintain two separate builds. Upgrading from ROCm 5 to ROCm 6 was a difficult and involved many breaking changes. Granted, most of the changes were necessary and made sense, but it was still a lot of extra work. Thankfully, ROCm 7 is mostly backward compatible. Only two functions broke the ABI, but ZLUDA did not use them. They were also clearly marked as experimental.

This time, the problem stems from some unfortunate packaging choices and interactions on Windows. The AMD graphics driver ships with all ROCm 5, ROCm 6 and ROCm 7 runtimes simultaneously (respectively amdhip64.dll, amdhip64_6.dll and amdhip64_7.dll). Which is fine; backwards compatibility is good. The problem is with the performance libraries. For some reason, the latest official version of the performance libraries on Windows is 6.4. This version is slightly outdated and relies on amdhip64_6.dll. This can lead to some unfortunate interactions. ZLUDA works with either amdhip64_6.dll or amdhip64_7.dll and prefers to load amdhip64_7.dll. It's possible for ZLUDA to launch, load amdhip64_7.dll , but then load a performance library that loads amdhip64_6.dll. Having both ROCm 7 and ROCm 6 runtimes loaded in a process at the same time and trying to interoperate between them simply does not work and leads to mysterious crashes.

Since we can't time-travel, we settled on the next best solution: preloading performance libraries into the process and then scanning it for the presence of either amdhip64_6.dll or amdhip64_7.dll. See the details in #579. This makes ZLUDA startup slightly slower, but should be fine. In the future, we might consider shipping the ROCm performance libraries with ZLUDA. What do you think?

Too slow for katago

As mentioned in the last update, we now have a mechanism for collecting execution traces that doesn't require you to be a developer. Happily, some of you provided traces of your favorite application. One of those apps was Katago, an ML-based library for playing Go. The trace revealed two things. First, the majority of the required CUDA support for Katago is already in place. Second, the missing component, cuDNN support, is on our priority list.

Katago uses CUDA in a fairly straightforward manner. However, it utilizes some cuDNN features for which there is no direct ROCm support. After adding missing functionality and various workarounds, Katago finally ran! However, it ran extremely slowly, thousands of times slower than on NVIDIA GPUs. A quick investigation revealed the source of the problem: certain Katago operations have a slow, naive implementation in MIOpen and an optimized implementation in cuDNN. Unfortunately, there's nothing we can do about it. Hopefully, AMD will one day optimize those convolutions and make AMD GPUs a viable target for Katago.

You can see the whole story here #541. Please keep sending us traces. While we might not be able to work on each of them right away, they are important for prioritization and debugging.

Better Windows support

For most of modern ZLUDA's history, Windows support was an afterthought. Although we ensured that every change compiled on Windows, the loader that injects ZLUDA into an executable was in poor condition. In the last update, we asked you to share traces from your favorite CUDA application, and most of the traces you sent us were from Windows. Or, at least, you attempted to do so given how poorly the ZLUDA loader functioned. This forced us to face reality: we must fix the loader, or we will receive no traces.

This led to a major (19100 lines added, 2377 lines removed) PR #550 which brought zluda.exe on Windows to a more acceptable quality and allows ZLUDA to work as well as on Linux.

Full llama.cpp support

In the previous update, we announced the initial release of Llama.cpp support. Since then, we have started benchmarking ZLUDA's Llama.cpp support against the top models from Hugging Face. This process revealed several issues, but we are pleased to announce that we currently have full llama.cpp support. Our performance is nearly identical to that of the native ROCm backend. See the documentation here for details on how to build Llama.cpp with the CUDA backend so that it works at full performance with ZLUDA.

Compiler performance improvements

One way PyTorch differs from other CUDA projects is that it ships with large PTX modules, which can be several megabytes in size. This poses a challenge for the ZLUDA compiler, which is tuned to produce high-quality code but not to run quickly. In particular, there was one compiler pass: instruction_mode_to_global_mode, which, on large PTX modules, could account for 60% of the total compilation time. After pull request #552 was merged, this time decreased to less than 1%, with the majority of the time now being spent in LLVM.

We have also added an experimental precompilation tool that can scan a directory, detect all CUDA binaries, and precompile all PTX modules. This tool uses all the cores on your machine, making it much faster than the CUDA runtime's one-by-one compilation. The effect varies by application: some applications will try to load (and compile) every possible GPU kernel, even if it's not actually being used. There are also applications that compile only the subset of kernels strictly required by a given workload. In our experience, the former category is much larger than the latter.

PyTorch support underway

We are working hard on PyTorch. It's impossible to support every PyTorch configuration and project, so we are focusing on someting specific. Our primary objective is vLLM, and to that end, we have been landing multiple PRs that make ZLUDA run more and more of vLLM. See recent PRs: #580, #583, #585, #590, #596 and more.

We don't have much PyTorch support to show yet, and we've made many significant changes, such as bundling LLVM and adding ROCm 7 support. Therefore, there is no new major release. However, we encourage you to try out the prerelease builds and report any problems you encounter.

Until next time