Sudden Bulk stop of all VMs ?

we at least thought about a faulty psu, swapped out a fully functional 1000W beQuiet quite early during troubleshooting.
i also booted with "windows to go" on more than one machine, as we original had the issue on 2 of 3 nodes in a hardware-identical cluster.
i ran memory heavy tasks and a cpu stress for more than 3 hours, without any issues.
 
My guts feeling tell me we are looking at an AMD/Linux kernel bug here.

Or maybe a AMD+KVM+Windows guest only bug.
 
i can at least confirm that for me now with vCpu < phys. Cpu the problem node survived the night, which wasnt the case before.
sadly this leave me with 2 nodes in my 3 node cluster, which basically cant scale any services or anything anymore.
this isnt any homelab but its company infrastructure so it would be super cool if there would be a fix sometime or at least a root-cause to troubleshoot with.
for now i will probably end up scaling the prox-cluster horizontally.
 
Hi,
we also use 7950X.
we had these crashes without any traces up to last year and they stopped as soon as we stopped using "host" cpu-type for our windows vms. but now as we need nested virtualization again we went back to using "host". so since yesterday the crashes reappeared. is there any experience with the "svm" flag for other cpu-types and/or custom cpu types (like here).
or is the solution really to limit my hypervisor to not use more than its physical available cores when using cpu-type "host", because this seems to be a really weird solution?
additionally its only windows vms which cause the crashes, another host, which is also overprovisioned, is running smoothly since ever.
did you already try the 6.8 opt-in kernel? Do you have the latest BIOS updates and CPU microcode installed? See: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?

Is there anything in the system logs/journal (if not you could still try to run journalctl -f from another system via SSH as the logs might not make it to disk)?
 
Hi,

did you already try the 6.8 opt-in kernel? Do you have the latest BIOS updates and CPU microcode installed? See: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?

Is there anything in the system logs/journal (if not you could still try to run journalctl -f from another system via SSH as the logs might not make it to disk)?
thanks for reaching out.
i will try and find time to lab that stuff. for now i dont want sudden crashes in our production-environment.
but no, did not try the new kernel yet. yes i have the latest bios installed so microcode should be patched aswell.
logs are empty except random weird temperatures from lm-sensors (180C on a hdd, which is basically not true).
as of today the problem was mitigated with the above workaround. i ran journalctl -f via ssh over night, but without a crash, so no help.
 
For CPU microcode, there is a dedicated package (amd64-microcode or intel-microcode) you need to install via APT: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu
i already read that resource.
what is the correct interpretation of „besides“?
or do i misunderstand something?
Besides the recommended microcode update via persistent BIOS/UEFI updates, there is also an independent method via Early OS Microcode Updates. It is convenient to use and also quite helpful when the motherboard vendor no longer provides BIOS/UEFI updates.
 
what is the correct interpretation of „besides“?
You can update the CPU microcode with an update of the BIOS/UEFI or with the installation of the amd64-microcode/intel-microcode package.

The former microcode will be loaded with BIOS POST, the latter in the (early) boot process of the Linux kernel.
 
You can update the CPU microcode with an update of the BIOS/UEFI or with the installation of the amd64-microcode/intel-microcode package.

The former microcode will be loaded with BIOS POST, the latter in the (early) boot process of the Linux kernel.
then i dont understand fionas comment.
because bios is up-to-date so microcode should be patched persistent, no need of patching in the kernel.
 
then i dont understand fionas comment.
because bios is up-to-date so microcode should be patched persistent, no need of patching in the kernel.
Right, sorry. It does depend on which versions are available. If the BIOS update already includes a newer (or same) version, you don't need the package.
 
  • Like
Reactions: intelliIT

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!