Skip to content

Run AI on Your AMD GPU on Windows (No ZLUDA)

· 5 min read · Read in Español
Share:

If you own a Radeon and ever wanted to do AI for real, you know the wall: the whole ecosystem runs on CUDA — meaning NVIDIA. Search “PyTorch + AMD + Windows” and you get three answers: use Linux, use ZLUDA (emulate CUDA), or pay for the cloud. The implicit conclusion is always the same — for AI, your AMD is worthless.

We refused to accept that and went digging. Spoiler: it works, on Windows, natively, and without emulating anything. No tutorial here — just the idea and the path, so you know it exists.

What we proved

We got native PyTorch running on ROCm on an AMD Radeon RX 6800 XT under Windows, running real AI models locally, for free, no cloud and no APIs.

To stress-test it we picked a demanding case: a text-to-speech model with voice cloning running in real time. It worked — we gave our assistant, Axon Terminal, a voice of its own. But the voice isn’t the point: once you have PyTorch running on your AMD, it runs any model — language, images, transcription, whatever.

The path (the idea, not the step-by-step)

Native ROCm, not ZLUDA. This is the key that changes everything. ZLUDA translates CUDA to AMD: it emulates, with the overhead that implies, and the app still “thinks” it’s talking to NVIDIA. What we used is different: PyTorch compiled directly for ROCm, AMD’s compute stack. The GPU runs in its native language, no interpreter. The piece that makes it possible is the wheels from the TheRock project (AMD’s own), which already ship builds for RDNA2 cards that “official” Windows support doesn’t list yet.

And then, the walls. Because there are some, and nobody mentions them (almost nobody comes this way). They’re details, but they stop you dead if you don’t know they exist:

  • AMD’s kernel library (MIOpen) wouldn’t compile because of a few missing C++ headers in the environment. They were right there (they ship with Visual Studio); you just had to point it at them.
  • Everything stuttered the first time each task showed up. It wasn’t a lack of power: MIOpen was exhaustively searching for the best kernel for each new case (tens of seconds frozen). One environment variable (MIOPEN_FIND_MODE in fast mode) crushed it. From unusable to smooth.
  • Plus the usual Windows tax: encodings that mangle accents, caches that don’t persist between boots, orphan processes fighting over the GPU…

None of them unsolvable. All of them invisible until you hit them — and that, precisely, is NVIDIA’s only real advantage here.

So what do you lose with AMD? (the honest part)

Here’s what we expected to find and didn’t: that the AMD would be slower. It isn’t. In our case, the Radeon generates speech at the same rate as an equivalent NVIDIA RTX — we measured practically identical times. The bottleneck was never AMD’s silicon; it was the setup. With the walls above cleared, performance is on par.

What NVIDIA gives you on top isn’t speed: it’s convenience. With CUDA, you install and it works. With AMD on Windows, you have to fight through what we describe here. That’s the honest difference — hours of tinkering, not frames per second.

The argument nobody puts on the table: VRAM per dollar

And here’s what actually matters for local AI. The factor that decides which models you can run isn’t speed: it’s VRAM. A model fits on the card or it doesn’t. And that’s where AMD charges you less for the same gigabytes:

  • 16 GB on AMD: an RX 9060 XT 16GB runs around $560.
  • 16 GB on NVIDIA: the RTX 5060 Ti 16GB jumps to $680-750. And here’s the detail that seals it: it’s one of the weakest cards in its lineup — an RTX 5070 performs 30-40% better for almost the same price. On NVIDIA, affordable 16GB only comes on an entry-level card: you pay the premium for the memory, not the performance. (And with the 2026 VRAM crisis that inflated everything, NVIDIA even cut production of that 16GB version.)
  • Second-hand? An RX 6800 XT like ours —16 GB— goes for a lot less than any NVIDIA with that memory.

Same amount of VRAM, $120-190 less. The question stops being “how fast is it?” (just as fast) and becomes “what does it cost me to get 16 GB for AI?” — and there AMD wins. To run models locally, those gigabytes are what decide whether the model loads or not.

Why it matters

The wall was never the hardware — it was the software and the lack of a documented path. Once that’s torn down, your AMD GPU stops being “just for gaming” and becomes a local AI platform: private, free, dependent on no one. With the same performance as the expensive alternative, and more VRAM for your money.

You don’t need to switch to NVIDIA or move to the cloud to get started. You just need to know that it’s possible — and now you do.


At Neuralflow we walked this path to give Axon Terminal a voice of its own, but it works for anyone with a Radeon. If you want the full how-to, write to us.

Keep exploring

Found this useful? Share it

Share:

You might also like