Sudden Bulk stop of all VMs ?

At our side :
-Nodes with Supermicro Boards are stable until now. (32 days uptime)
-Nodes with B650D4U with suspicious serial number (M80-GC025XXXXXX) always end up rebooting, even with latest bios/microdode/kernel parameters (max uptime 24 days)
-Nodes with older than 6 Months B650D4U and unsuspicous serial number are stable (130+ days uptime)
-Nodes with revent B650D4U and unsuspicious serial number are maybe stable (23 days) with latest microcode and kernal parameters but 23 days is not enough to validate.

We had no time to do cpu related testing yet.
 
It's difficult to claim the issue as fixed until you see its not. When the reboot only happens after 12-14 days (in my case) you have to wait at least that long to know it's really fixed :)

Anyway I changed the CPU of all my VMs to "qemu64" instead of "host" and also updated both nodes I have to the latest version. My problematic node has now an uptime of 7 days
 
Thanks for sharing all those details!

One of our new Hetzner servers uses almost the same mainboard (ASRockRack B665D4U-1L) and has the same problem.
Our serial is M80-G4007900353, so kinda below your highest good one, but that doesn't really say too much especially with the slightly different model.

Hetzner did perform hardware tests yesterday with no result (i.e. hardware is considered OK), and we upgraded to kernel 6.8.8-4-pve, and already had another reboot. I'll ask to be transfered to an ASUSTeK "Pro WS 665-ACE" or the like, which runs our other nodes smoothly.

Hetzner was like "we don't usually do this, but we make an exception". It did solve our reboots on that machine.
With some rescue system magic (they've integrated on-the-fly installation of ZFS support) we were able to avoid a reinstallation and make the PVE on the moved harddiscs bootable.

Sorry for the late reply, I thought I had already posted that.


We also had similar problems on a 6 year old NUC, but the verdict there, from yesterday, is broken hardware. A few hours later that machine stopped even trying to boot.

On a third machine we seemed to have solved the problem by switching Debian and PBS VMs from network mode virtio to e1000 (and no the network performance is still at full GBit speed resp. some 930 MBit/s).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!