According to reports on September 7th, NVIDIA’s RTX 5090 and RTX PRO 6000 graphics cards have been found to suffer from a reproducible virtualization reset vulnerability. This issue renders the graphics cards completely unresponsive, requiring a physical reboot of the host system to restore functionality.
CloudRift, a GPU cloud service provider, detailed their analysis of the problem after encountering it on multiple systems equipped with Blackwell chips in their production environment. They have publicly offered a $1000 reward for anyone who can identify a solution or the root cause of this bug.
CloudRift’s logs indicate that the vulnerability occurs after the GPU has been passed through to a virtual machine using KVM and VFIO. When the virtual machine is shut down or the GPU is reassigned, the host system attempts a PCIe Function Level Reset (FLR).
However, unlike normal operation, the GPU does not return to a healthy state. Instead, it stops responding, with the kernel reporting: “Not ready after 65535ms of FLR attempt; giving up.”
At this point, the graphics card becomes unreadable by tools like `lspci`, which then throw an error indicating an “unknown header type 7f.” CloudRift noted that the only way to restore normal operation is to perform a full power cycle of the machine.
AI startup Tiny Corp has also replicated CloudRift’s findings and questioned whether there might be a hardware defect in the RTX 5090 and RTX PRO 6000. They stated they have investigated but have been unable to find a solution so far.
Discussions within the community reveal that many home users and early adopters of the RTX 5090 have reported similar issues. One user described their entire host system freezing after closing a Windows virtual machine, with the GPU failing to reinitialize even after an operating system-level restart.
Users have confirmed that adjusting PCIe ASPM or ACS settings does not alleviate the problem. Notably, there have been no reports of similar issues with older generation graphics cards, such as the RTX 4090. This suggests that the vulnerability may be specific to NVIDIA’s Blackwell series architecture.
