For those that don't like running random third-party scripts, here is the official documentation regarding CPU firmware (as already mentioned in comment #6): https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu
proxmox-ve: 8.2.0 (running kernel: 6.8.8-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8.8-3-pve: 6.8.8-3
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-3-pve-signed: 6.5.13-3
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
amd64-microcode: 3.20240116.2+nmu1
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2
did you doUnfortunately it did not do the trick for us, we had a crash after 8 hours of uptime with the configuration quoted in the previous post...
Open to any suggestion anyone might have.
update-grub
andupdate-initramfs -u
after making the changes?Had the same behavior with one of our servers which has the B665D4U-1L (M80-GC015102252)Thanks for sharing all those details!
One of our new Hetzner servers uses almost the same mainboard (ASRockRack B665D4U-1L) and has the same problem.
Our serial is M80-G4007900353, so kinda below your highest good one, but that doesn't really say too much especially with the slightly different model.
Hetzner did perform hardware tests yesterday with no result (i.e. hardware is considered OK), and we upgraded to kernel 6.8.8-4-pve, and already had another reboot. I'll ask to be transfered to an ASUSTeK "Pro WS 665-ACE" or the like, which runs our other nodes smoothly.
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a211fff] ACPI NVS
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x0000000009aff000-0x0000000009ffffff] reserved
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009afefff] usable
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: BIOS-provided physical RAM map:
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: zhaoxin Shanghai
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: Centaur CentaurHauls
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: Hygon HygonGenuine
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: AMD AuthenticAMD
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: Intel GenuineIntel
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: KERNEL supported cpus:
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: Command line: BOOT_IMAGE=/vmlinuz-6.8.12-1-pve root=UUID=d90b4fc8-f902-4684-8373-c016b9391ece ro quiet
Aug 23 08:19:38 ds-hv-kvmcompute-30 kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) ()
-- Boot 00fad2572155416a9b059adf6e113050 --
Aug 23 08:17:01 ds-hv-kvmcompute-30 CRON[579120]: pam_unix(cron:session): session closed for user root
Aug 23 08:17:01 ds-hv-kvmcompute-30 CRON[579121]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Aug 23 08:17:01 ds-hv-kvmcompute-30 CRON[579120]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 23 08:15:12 ds-hv-kvmcompute-30 sshd[578103]: Disconnected from authenticating user root 218.92.0.22 port 13925 [preauth]
Aug 23 08:15:12 ds-hv-kvmcompute-30 sshd[578103]: Received disconnect from 218.92.0.22 port 13925:11: [preauth]
Aug 23 08:13:50 ds-hv-kvmcompute-30 sshd[577393]: Disconnected from authenticating user root 61.177.172.140 port 13133 [preauth]
Aug 23 08:13:50 ds-hv-kvmcompute-30 sshd[577393]: Received disconnect from 61.177.172.140 port 13133:11: [preauth]
could you tell us more about the history ofI believe so but I can do it again, won't hurt.
But I have found something else, out ouf all our boards (AsrockRack B650D4U) I checked the serial numbers of a few of them :
M80-GC025700831 => random reboot
M80-GC025700764 => random reboot
M80-GC025700102 => random reboot
M80-GB010200215 => stable
M8P-FC000500019 => stable
M8P-FC000500021 => stable
M8P-FC000500037 => stable
So at our side this might be an hardware problem with motherboards with serial beginning with M80-GC025700XXX.
Just ordered a replacement board and will do a motherboard swap this week in an unstable node.
cpu: x86-64-v4,hidden=1,flags=+virt-ssbd;+amd-ssbd;+aes
Glad to hear it's not just me! I had initially (wrongly?) suspected that Windows VMs were at fault, but we have other hypervisors on the same hardware which seem to be just fine, only two are affected. I deployed a spare one recently, and it also starts to reboot randomly after I live migrate the VMs from one to the other. Both servers are fully memtested, etc. One is a B650D4U and the other is a H13SAE-MF. However, the node without VMs appears to be stable somewhat (for atleast a few days, the one with VMs on it reboots consistently after every few hours, hasn't crossed 3-4 days of uptime ever).At our side we have the Proxmox node randomly rebooting even when there is no load at all. (no vm configured, just proxmox booted and connected to cluster and nfs shares).
The server just reboots after a few hours, a few days..
We memtested it during 48 hours without issue, so I would be keen on trusting the hardware.
Especially as we have multiple configuration with the exact same hardware with some stable for 100 days BUT maybe thre is combinaison of factors.
@mrpops2ko I ll check the dates we ordered the boards but I would thing the whole batch had a bios update before we put them into production.
@SagnikS at our side the server wich we swapped the board from B650D4U to H13SAE is now running since 9 days but only kvm linux guests..
We hav ethe issue with 7950X 7950X3D and even 7900X.
And yes, exactly the same here, some have uptime upwards of 3 months! However I think the common factor here is that only AMD Ryzen 7000 series appears to be affected.Especially as we have multiple configuration with the exact same hardware with some stable for 100 days BUT maybe thre is combinaison of factors.
I was having crashes even with the CPU set to max. I assume it must be a problem with nested virtualization then.I have been successful with getting the system to stop randomly rebooting by NOT using CPU Type = host.