Anyone else fighting Blackwell GSP timeout in production passthrough? How are you handling recovery without a host reboot?

“`html

The post discusses issues encountered with NVIDIA RTX Pro 5000 (Blackwell) GPUs in a Linux VM environment, where the GSP firmware experiences heartbeat timeouts during initialization or driver reloads. This leads to the GPU entering an unrecoverable “bad state”, causing errors such as assertion failures and inability to initialize the GSP firmware.

Users have attempted various methods like secondary bus reset (SBR) on upstream bridges, forcing D3cold via root port power management, or driver reloads without success. The only reliable “fix” is a full cold reboot of the host system. This problem persists even for other user-submitted Blackwell cards where these methods are known to work.

This issue highlights significant challenges in managing GPU passthrough on Linux systems, particularly with newer architectures like Blackwell.
It underscores the need for more robust and reliable firmware and driver support for these GPUs in production environments.
The lack of a non-reboot solution is concerning for users relying on such hardware for workloads.

“`

“`plaintext
– This issue demonstrates significant challenges with GPU passthrough, especially for newer architectures like Blackwell.
– It points to the need for more reliable firmware and driver support in production environments.
– The absence of a non-reboot solution is a critical concern for users relying on these GPUs.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.