ZLUDA

ZLUDA allows to run unmodified CUDA applications on non-NVIDIA GPUs

ZLUDA update Q3 2025 – ZLUDA 5 is here
2025-10-02

We're happy to announce the release of ZLUDA version 5. This release brings with it new debugging tools, better correctness and preliminary support for llama.cpp.

ZLUDA version 5 includes a tool called zluda_trace. Community members often ask us what they can do to help with the project. One of the most impactful things you can do without needing programming skills is to run your favorite workload using zluda_trace and create a bug report issue with the trace attached. Just make sure you collect logs on Linux, we are not yet ready accept logs from Windows (more here). You can find more information on our Troubleshooting page.

And if you are interested in writing code, we have a list of issues labeled help wanted on our repository.

zoc (ZLUDA offline compiler)

ZLUDA includes an NVIDIA PTX to AMD RDNA compiler. Previously, this compiler was only accessible from the ZLUDA library – it runs when cuModuleLoadData or cuLibraryLoadData are called. However, for ZLUDA developers, it is useful for debugging purposes to have a command line interface as well, similar to NVIDIA's ptxas tool.

For this purpose, JoelleJS has contributed zoc in #344, the ZLUDA offline compiler. This compiler takes a PTX file as input, and will output the LLVM IR generated by ZLUDA before and after linking and the RDNA assembly for your GPU generated by the ROCm compiler.

We've been using this tool intensively and merged minor ergonomics improvements in #491 and #504.

Machine Learning Workloads

Our focus on running machine learning inference workloads continues. In this release, we have prioritized correctness over performance, which will be an area of focus for future updates.

llm.c

We hit our first ML milestone.

llm.c's test_gpt2fp32cu and test_gptc2cu now both run on ZLUDA, when built without Multi-GPU and without Flash Attention. Support for Flash Attention is right now blocked by missing APIs in MIOpen. We plan to backfill them in the future.

This took a large number of commits across our host API implementation and compiler, including #402, #406, #412, #417, #409, #421, #427, #454, #463, #468, #496, #500, #501, #503, and #511, in addition to the performance library work mentioned below.

llama.cpp

We hit our second ML milestone.

The CUDA backend for llama.cpp can now run on ZLUDA. We've done some preliminary measurements and found the performance to be within range of the results measured by Phoronix on ROCm (Latest Open-Source AMD Improvements Allowing For Better Llama.cpp AI Performance Against Windows 11 - Phoronix). We're interested in your feedback, if it doesn't work or you are getting worse performance than with ROCm, please share in the issues.

Much of the required functionality in the host API and compiler was implemented at this point, so this took a relatively fewer number of commits to enable, including #509, #515, and #518.

Initial PyTorch work

We did not yet hit our third ML milestone

We've been continuing to work on PyTorch support, which is our next big milestone. Other than work on host functions and instruction support in the compiler, we have added the zluda_ld library (#447 and #508). PyTorch uses the DT_RPATH attribute in its executables to hard-code the path to the CUDA library and so ignores the LD_LIBRARY_PATH attribute we normally use to nudge applications to load ZLUDA on Unix. zluda_ld can be used with the little-known LD_AUDIT environment variable to get around this problem, and force loading ZLUDA.

PyTorch is far from being ready: we are blocked by the slowness of our compiler, missing performance libraries (cuBLAS, cuDNN, etc) coverage and bugs (or missing features) in LLVM AMDGPU target. This quarter we will be focusing on all those problems – please check our prerelease builds from time to time.

Other improvements

Kernel cache

We have added a kernel caching mechanism to our PTX module loader (#465). When a CUDA application loads a GPU code module, we need to extract the PTX from the fat binary provided and then compile it to machine code for the specific GPU being used. This can be a costly operation and significantly slow down runtime for a workload with many modules. We now avoid this by locally caching the machine code for kernels.

Performance libraries

We have added initial support for running applications that use cuBLAS, cuBLASLt, and nvml (#440, #444, #449, #452, #455, #457, #481). The number of supported operations is still small, but it's set up for rapid additions. Expect to see the list of supported functions grow (and the addition of cuDNN).

More testing

We've set up CI, including running unit tests for every PR (#401) and running our PTX sweep test suite nightly. This will help us prevent regressions and measure our conformance to CUDA behavior. We also have set up some initial scaffolding for testing the host API.

Prerelease builds

We've started publishing preview (prerelease) builds. Now, after every code change we automatically compile and publish the binary into the Release section on Github. Try them out. There's no longer any reason to do build from sources by yourself (unless you are a developer).

Final bit of correctness

CI improvements unlocked a flurry of compiler fixes in #416, #467 and more. According to our testing we are now bit-accurate (we return results within CUDA-documented precision) with NVIDIA GPUs across almost all supported operations and their variants with all floating-point subnormal and rounding control modes. There are two exceptions:

Some widely used instructions are still not supported by the compiler, but that number gets smaller every day.

Overall, this is a major improvement over pre-rollback ZLUDA, which would cut corners and not always be bit-accurate.