Sudden Bulk stop of all VMs ?

fiona · Aug 9, 2024

For those that don't like running random third-party scripts, here is the official documentation regarding CPU firmware (as already mentioned in comment #6): https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu

netswitch · Aug 9, 2024

Thank you fiona.
I installed kernel 6.8.8.8-3-pve to match mrpops2ko 's setup.
I updated the microcode according to the docs but could not get the latest verison so I used the "random third party script".
I adapted the grub cmdline to "GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_acs_override=downstream,multifunction amd_iommu=on iommu=pt initcall_blacklist=sysfb_init amd_iommu=force_enable"

leading to the below configuration and versions.

Now let's put some load and see if it can run stable...

proxmox-ve: 8.2.0 (running kernel: 6.8.8-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8.8-3-pve: 6.8.8-3
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-3-pve-signed: 6.5.13-3
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
amd64-microcode: 3.20240116.2+nmu1
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

netswitch · Aug 10, 2024

Unfortunately it did not do the trick for us, we had a crash after 8 hours of uptime with the configuration quoted in the previous post...
Open to any suggestion anyone might have.

mrpops2ko · Aug 11, 2024

netswitch said:
Unfortunately it did not do the trick for us, we had a crash after 8 hours of uptime with the configuration quoted in the previous post...
Open to any suggestion anyone might have.

did you do update-grub andupdate-initramfs -u after making the changes?

netswitch · Aug 11, 2024

I believe so but I can do it again, won't hurt.
But I have found something else, out ouf all our boards (AsrockRack B650D4U) I checked the serial numbers of a few of them :
M80-GC025700831 => random reboot
M80-GC025700764 => random reboot
M80-GC025700102 => random reboot
M80-GB010200215 => stable
M8P-FC000500019 => stable
M8P-FC000500021 => stable
M8P-FC000500037 => stable
So at our side this might be an hardware problem with motherboards with serial beginning with M80-GC025700XXX.

Just ordered a replacement board and will do a motherboard swap this week in an unstable node.

Christoph Lechleitner · Aug 14, 2024

Thanks for sharing all those details!

One of our new Hetzner servers uses almost the same mainboard (ASRockRack B665D4U-1L) and has the same problem.
Our serial is M80-G4007900353, so kinda below your highest good one, but that doesn't really say too much especially with the slightly different model.

Hetzner did perform hardware tests yesterday with no result (i.e. hardware is considered OK), and we upgraded to kernel 6.8.8-4-pve, and already had another reboot. I'll ask to be transfered to an ASUSTeK "Pro WS 665-ACE" or the like, which runs our other nodes smoothly.

netswitch · Aug 16, 2024

Updates at our side :
We ordered and recieved a new B650D4U and a Supermicro H13SAE-MF.
New MB Serial is M80-H3015802308 and is also affected. (held 6 hours with our test load on a machine with the software configuration pasted previously).
We migrated the test workload on the Supermicro board for the weekend and will see if it remains stable.
There is another post about his issue on ServerTheHome forum where I gathered 2 other serail numbers for B650D4U, one stalbe one unstable.
Summary :
M80-GC025700831 => random reboot
M80-GC025700764 => random reboot
M80-GC025700102 => random reboot
M80-GC006700000 => random reboot
M80-H3015802308 => random reboot (ordered 10/08/2024)

M80-GB010200215 => stable
M8P-FC000500019 => stable
M8P-FC000500021 => stable
M8P-FC000500037 => stable
M80-G5005000000 => stable

Tested kernels : 6.5.13-1, 6.8.8-3 (and a few others). Proxmox versions 8.1 and 8.2.

netswitch · Aug 17, 2024

@ProxyUser , @intelliIT have you sorted the issues you had ?
Can you tell us the motherboard you use and if B650D4U provide us with the serial number ? (dmidecode will give it to you)

poberholzer · Aug 19, 2024

I hope I'm not writing this too early, but I believe my 1 node that had reboots every couple of days is now fixed, hopefully.
It's running for 10 days already

I have 2 Hetzner nodes with AMD Ryzen 9 5950X. Both are very identical but the crashing one n1 is having a NEWER microcode then the one running stable since 39days uptime

microcode stable node : 0xa20102b
crashing node: 0xa20120e

So the crashing node actually had the newest microcode, I'm afraid to update the working node to this version to see if it will also crash.

What fixed it (hopefully) for me is adding this grub cmdline from https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/post-691945:
GRUB_CMDLINE_LINUX_DEFAULT="ro quiet pcie_acs_override=downstream,multifunction amd_iommu=on iommu=pt initcall_blacklist=sysfb_init amd_iommu=force_enable"

and update-grub

my cmdline looks like this now
root@n1: # cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.8-4-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs ro quiet pcie_acs_override=downstream,multifunction amd_iommu=on iommu=pt initcall_blacklist=sysfb_init amd_iommu=force_enable crashkernel=384M-:512M

where the stable node is still normal like this
root@n2:~# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.8-2-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet

itbyhf · Aug 19, 2024

Christoph Lechleitner said:
Thanks for sharing all those details!

One of our new Hetzner servers uses almost the same mainboard (ASRockRack B665D4U-1L) and has the same problem.
Our serial is M80-G4007900353, so kinda below your highest good one, but that doesn't really say too much especially with the slightly different model.

Hetzner did perform hardware tests yesterday with no result (i.e. hardware is considered OK), and we upgraded to kernel 6.8.8-4-pve, and already had another reboot. I'll ask to be transfered to an ASUSTeK "Pro WS 665-ACE" or the like, which runs our other nodes smoothly.

Had the same behavior with one of our servers which has the B665D4U-1L (M80-GC015102252)
We for now just set the cpu type to the x86-64-v2-AES instead of host to avoid crashes (3d uptime since then instead of the couple hours we had before). The kernel line parameters did not work (at least the ones I found a couple days ago)

CrawfordHulk · Aug 19, 2024

Had same issue trying the microcode update

poberholzer · Aug 21, 2024

Looks like I wrote too early, after 12 days my system rebooted again. And it's not a crash it looks like a proper reboot just don't know what caused the reboot

Aug 21 10:45:32 n1 systemd[1]: Stopping user@0.service - User Manager for UID 0...
Aug 21 10:45:32 n1 systemd[2491568]: Activating special unit exit.target...
Aug 21 10:45:32 n1 systemd[2491568]: Stopped target default.target - Main User Target.
Aug 21 10:45:32 n1 systemd[2491568]: Stopped target basic.target - Basic System.
Aug 21 10:45:32 n1 systemd[2491568]: Stopped target paths.target - Paths.
Aug 21 10:45:32 n1 systemd[2491568]: Stopped target sockets.target - Sockets.
Aug 21 10:45:32 n1 systemd[2491568]: Stopped target timers.target - Timers.
Aug 21 10:45:32 n1 systemd[2491568]: Closed dirmngr.socket - GnuPG network certificate management daemon.
Aug 21 10:45:32 n1 systemd[2491568]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Aug 21 10:45:32 n1 systemd[2491568]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Aug 21 10:45:32 n1 systemd[2491568]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Aug 21 10:45:32 n1 systemd[2491568]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Aug 21 10:45:32 n1 systemd[2491568]: Removed slice app.slice - User Application Slice.
Aug 21 10:45:32 n1 systemd[2491568]: Reached target shutdown.target - Shutdown.
Aug 21 10:45:32 n1 systemd[2491568]: Finished systemd-exit.service - Exit the Session.
Aug 21 10:45:32 n1 systemd[2491568]: Reached target exit.target - Exit the Session.
Aug 21 10:45:32 n1 systemd[1]: user@0.service: Deactivated successfully.
Aug 21 10:45:32 n1 systemd[1]: Stopped user@0.service - User Manager for UID 0.
Aug 21 10:45:32 n1 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Aug 21 10:45:32 n1 systemd[1]: run-user-0.mount: Deactivated successfully.
Aug 21 10:45:32 n1 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Aug 21 10:45:32 n1 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Aug 21 10:45:32 n1 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
Aug 21 10:45:32 n1 systemd[1]: user-0.slice: Consumed 7.795s CPU time.
-- Boot 479325ae04fb458f96a20dda9eae4b26 --
Aug 21 10:47:07 n1.srvx.ch kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()

journalctl -b -1 -p 3 shows a lot of these errors which I don't see on the working node:
Aug 21 10:06:22 n1 pveproxy[2463072]: got inotify poll request in wrong process - disabling inotify

SagnikS · Aug 23, 2024

We have been having this problem on 2 hypervisors running 7950X3D and 7950X CPUs, we run about 50+ in total, 8 of which are 7950X3D's, and one is a 7950X. I am confident this is not a hardware problem, ran memtests, and after migrating all VMs to another server, this kept reoccurring. We use ASRockRack B650D4U for the 7950X, and SM H13SAE-MF for the 7950X3D's.

This issue initially presented with random reboots every 3-4 days. A few days back, when deploying a Windows VM manually, I have been able to replicate this fairly consistently however. It pretty much always crashes after starting the installation, at the "Getting Files Ready" step, right at about 10-12%. It doesn't always crash the host, but I can say it crashes about 80% of the time.

I have downgraded the kernel to 6.5. and 6.2 without any improvements. Temporarily, the server seemed to be fine after I changed the Windows VMs on this hypervisor to Epyc-IBPB instead of host, however in order to avoid manually setting arguments or custom CPU models for nested virtualization, I set them to max, and that also resulted in a crash after about 29 hours of uptime.

I am attaching a snippet of the kernel logs right as it reboots:

Code:

Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a211fff] ACPI NVS
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x0000000009aff000-0x0000000009ffffff] reserved
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009afefff] usable
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-provided physical RAM map:
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel:   zhaoxin   Shanghai
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel:   Centaur CentaurHauls
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel:   Hygon HygonGenuine
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel:   AMD AuthenticAMD
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel:   Intel GenuineIntel
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: KERNEL supported cpus:
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: Command line: BOOT_IMAGE=/vmlinuz-6.8.12-1-pve root=UUID=d90b4fc8-f902-4684-8373-c016b9391ece ro quiet
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) ()
-- Boot 00fad2572155416a9b059adf6e113050 --
Aug 23 08:17:01 ds-hv-kvmcompute-30 CRON[579120]: pam_unix(cron:session): session closed for user root
Aug 23 08:17:01 ds-hv-kvmcompute-30 CRON[579121]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Aug 23 08:17:01 ds-hv-kvmcompute-30 CRON[579120]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 23 08:15:12 ds-hv-kvmcompute-30 sshd[578103]: Disconnected from authenticating user root 218.92.0.22 port 13925 [preauth]
Aug 23 08:15:12 ds-hv-kvmcompute-30 sshd[578103]: Received disconnect from 218.92.0.22 port 13925:11:  [preauth]
Aug 23 08:13:50 ds-hv-kvmcompute-30 sshd[577393]: Disconnected from authenticating user root 61.177.172.140 port 13133 [preauth]
Aug 23 08:13:50 ds-hv-kvmcompute-30 sshd[577393]: Received disconnect from 61.177.172.140 port 13133:11:  [preauth]

mrpops2ko · Aug 23, 2024

netswitch said:
I believe so but I can do it again, won't hurt.
But I have found something else, out ouf all our boards (AsrockRack B650D4U) I checked the serial numbers of a few of them :
M80-GC025700831 => random reboot
M80-GC025700764 => random reboot
M80-GC025700102 => random reboot
M80-GB010200215 => stable
M8P-FC000500019 => stable
M8P-FC000500021 => stable
M8P-FC000500037 => stable
So at our side this might be an hardware problem with motherboards with serial beginning with M80-GC025700XXX.

Just ordered a replacement board and will do a motherboard swap this week in an unstable node.

could you tell us more about the history of
M80-GC025700831 => random reboot
M80-GC025700764 => random reboot
M80-GC025700102 => random reboot

were these purchased at launch? did they have original bios for a long period?

i'm wondering if theres any relationship to this

jeenam · Aug 23, 2024

Chiming in here. I'm not using enterprise gear, just a lowly mini PC. The system is a Minisforum UM780 XTX which has the Ryzen 7840HS CPU. The problem would seem to be directly related to Zen 4 CPU's and the Linux kernel. We've been discussing this same problem for over a month now in the Unofficial Proxmox Discord. What we've tracked it down to is running Windows VM's with CPU Type = host. My system has rebooted randomly from day one with this issue. There is no difference in behavior between kernel 6.5 or 6.8. We've tried all kinds of kernel parameters to no avail.

This is the initial thread that I communicated with other folks regarding the issue:

https://forum.proxmox.com/threads/win11-vm-opening-many-tabs-at-once-crashes-proxmox-host.140670/

I have been successful with getting the system to stop randomly rebooting by NOT using CPU Type = host. It is the only fix I know of at this time. This is my CPU config in the VM conf file:

cpu: x86-64-v4,hidden=1,flags=+virt-ssbd;+amd-ssbd;+aes

There is an open ticket with the Linux kernel bugtracker that was opened on 7/6/2024 that has had zero traction. I think we need to start making some noise so the kernel developers resolve this issue because all signs point to the kernel being the problem based on the testing of myself and other people.

https://bugzilla.kernel.org/show_bug.cgi?id=219009

Also, a reddit thread that describes the same exact issue - https://old.reddit.com/r/Proxmox/comments/1cym3pl/nested_virtualization_crashing_ryzen_7000_series/

Swapping out hardware isn't going to fix the problem. It is software related.

netswitch · Aug 23, 2024

At our side we have the Proxmox node randomly rebooting even when there is no load at all. (no vm configured, just proxmox booted and connected to cluster and nfs shares).
The server just reboots after a few hours, a few days..

We memtested it during 48 hours without issue, so I would be keen on trusting the hardware.
Especially as we have multiple configuration with the exact same hardware with some stable for 100 days BUT maybe thre is combinaison of factors.

@mrpops2ko I ll check the dates we ordered the boards but I would thing the whole batch had a bios update before we put them into production.

@SagnikS at our side the server wich we swapped the board from B650D4U to H13SAE is now running since 9 days but only kvm linux guests..
We hav ethe issue with 7950X 7950X3D and even 7900X.

SagnikS · Aug 23, 2024

netswitch said:
At our side we have the Proxmox node randomly rebooting even when there is no load at all. (no vm configured, just proxmox booted and connected to cluster and nfs shares).
The server just reboots after a few hours, a few days..

We memtested it during 48 hours without issue, so I would be keen on trusting the hardware.
Especially as we have multiple configuration with the exact same hardware with some stable for 100 days BUT maybe thre is combinaison of factors.

@mrpops2ko I ll check the dates we ordered the boards but I would thing the whole batch had a bios update before we put them into production.

@SagnikS at our side the server wich we swapped the board from B650D4U to H13SAE is now running since 9 days but only kvm linux guests..
We hav ethe issue with 7950X 7950X3D and even 7900X.

Glad to hear it's not just me! I had initially (wrongly?) suspected that Windows VMs were at fault, but we have other hypervisors on the same hardware which seem to be just fine, only two are affected. I deployed a spare one recently, and it also starts to reboot randomly after I live migrate the VMs from one to the other. Both servers are fully memtested, etc. One is a B650D4U and the other is a H13SAE-MF. However, the node without VMs appears to be stable somewhat (for atleast a few days, the one with VMs on it reboots consistently after every few hours, hasn't crossed 3-4 days of uptime ever).

SagnikS · Aug 23, 2024

netswitch said:
Especially as we have multiple configuration with the exact same hardware with some stable for 100 days BUT maybe thre is combinaison of factors.

And yes, exactly the same here, some have uptime upwards of 3 months! However I think the common factor here is that only AMD Ryzen 7000 series appears to be affected.

jeenam said:
I have been successful with getting the system to stop randomly rebooting by NOT using CPU Type = host.

I was having crashes even with the CPU set to max. I assume it must be a problem with nested virtualization then.

netswitch · Aug 24, 2024

I was betting on the mothreboard so I had great hopes with the SUpermicro H13SAE (which is stable since it has been put in production - 9 days ago)
Now if you have the same issue, that would lead three options :
-the processor
-a software bug
-a mix of both

The stability of some nodes would point to the CPU, I will put a B650D4U with a new CPU and test. I don't think we tryed CPU swap aldready.

jeenam · Sep 14, 2024

Any updates on this issue?

Sudden Bulk stop of all VMs ?

Proxmox Staff Member

Member

Member

New Member

Member

Renowned Member

Member

Member

New Member

Member

New Member

New Member

Well-Known Member

New Member

Active Member

Member

Well-Known Member

Well-Known Member

Member

Active Member

We value your privacy