This appears to work for now, the only change I made was to UNCOMMENT the "GRUB_TERMINAL=console" line.
GRUB_CMDLINE_LINUX_DEFAULT="ixgbe.allow_unsupported_sfp=1 iommu=pt intel_iommu=on nomodeset video=vesafb:off video=efifb:off"
GRUB_GFXMODE=800x600
GRUB_GFXPAYLOAD_LINUX=keep...
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-12-pve)
Dual Intel x5690 CPU (also tested with Intel 26xx v1/v2 series)
Nvidia GT720 VGA outout (also tested with motherboard VGA (matrox) output)
Using VFIO to pass through cards to several VM's.
Problem: As soon as the console booting...
If you're wanting to single-box it, manage the local disks with Proxmox/ZFS, then share folders with LXC containers via Bind Mounts. Beautiful and quick.
Another vote for the E5-2650L V2 -- 10C/20T, low power, perfect fit for a "lots of VM's, virtualization" environment. Just bought a couple last week - got the pair for $19 shipped.
GOT IT!
Looking at the kernel messages for wx-4100/rx550/rx560 in Ubuntu guest, I only see one primary thing different.
550:
[ 10.409863] amdgpu 0000:06:10.0: amdgpu: Using BACO for runtime pm
Maybe onto something?
*...
Status Update:
It appears to be some sort of interaction between the kernel, KVM, Ubuntu, and the AMD drivers.
Pulled spare hardware
* installed fresh 7.4 pmx
* tried both 5.15 and 5.19 kernel
* installed both the rx560 & rx550 in same server, vendor-reset, etc
* Ubuntu 22 guest, AMD 5.5...
Have done that as well - hoping it made a difference. No change in behavior. (tried several of the hookscripts - no change)
REFERENCE: https://www.nicksherlock.com/2020/11/working-around-the-amd-gpu-reset-bug-on-proxmox/
@aaron : Thank you for providing a specific work around. However, this feels like one of those "old" ideas in desperate need of an update.
I challenge you : When is rebooting the entire cluster on the loss of a network element the preferred behavior? You're taking a communication issue and...
Here's the complaint INSIDE the VM when I run ffmpeg, for example, which has amd support compiled in (The one that comes with jellyfin) - which then must be powered off (hangs on PCI when trying to shut it down)
74.958805] BUG: kernel NULL pointer dereference, address: 00000000000000d8
[...
almost identical configs as the OP, except "AMD-Vi: Interrupt remapping enabled". Same blacklists, same VFIO, same kernel switches. (proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve))
Same card(s) same issue. RX560 (1002:67ef) works but RX550 (1002:699f) does not. Same physical machine(s) -...
I did another lab cluster - 5 nodes, still on 7.4, but upgrading to Quincy as a prep to upgrade to 8.0. Ran into a serious blocker following the directions above. (https://pve.proxmox.com/wiki/Ceph_Pacific_to_Quincy)
The Manager daemons wouldn't restart, got the "masked" message. Turns out...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.