I've been experiencing this issue as well, just had my 4th hang in the past two weeks:
- Proxmox 8.2.2 on kernel 6.8.4-3-pve (latest at time of writing)
- Ceph 18.2.2 on NVMe (Kioxia KCM5DRUG3T84)
- AMD CPU (Ryzen 7900) with 64GB (non-ECC) RAM
- 3 node HA cluster
- Intel 82599 10GE NIC (10Gtek)
The cluster has been running stable since early Feb, which admittedly is not a lot of time, but it's been solid up until now.
What's noteworthy is that only one of my nodes is hanging, and this particular node is unique in that I am using a VM with PCI passthrough of the host's SATA controller. I am using cputype=host with most of my VMs as well, but that applies to all VMs on all three hosts, whereas the only one that's hung so far (now 4 times) is the only one doing host PCI passthrough.
Is anyone else hitting this problem using hostpci devices in guests?
Edit: somehow missed the last page of comments. If the fix is only present in Linus' 6.10-rc branches, I wonder if Proxmox could cherrypick it into the pve kernel?