Sudden Bulk stop of all VMs ?

intelliIT · Apr 18, 2024

we at least thought about a faulty psu, swapped out a fully functional 1000W beQuiet quite early during troubleshooting.
i also booted with "windows to go" on more than one machine, as we original had the issue on 2 of 3 nodes in a hardware-identical cluster.
i ran memory heavy tasks and a cpu stress for more than 3 hours, without any issues.

zzz09700 · Apr 18, 2024

My guts feeling tell me we are looking at an AMD/Linux kernel bug here.

Or maybe a AMD+KVM+Windows guest only bug.

intelliIT · Apr 19, 2024

i can at least confirm that for me now with vCpu < phys. Cpu the problem node survived the night, which wasnt the case before.
sadly this leave me with 2 nodes in my 3 node cluster, which basically cant scale any services or anything anymore.
this isnt any homelab but its company infrastructure so it would be super cool if there would be a fix sometime or at least a root-cause to troubleshoot with.
for now i will probably end up scaling the prox-cluster horizontally.

fiona · Apr 19, 2024

Hi,

intelliIT said:
we also use 7950X.
we had these crashes without any traces up to last year and they stopped as soon as we stopped using "host" cpu-type for our windows vms. but now as we need nested virtualization again we went back to using "host". so since yesterday the crashes reappeared. is there any experience with the "svm" flag for other cpu-types and/or custom cpu types (like here).
or is the solution really to limit my hypervisor to not use more than its physical available cores when using cpu-type "host", because this seems to be a really weird solution?
additionally its only windows vms which cause the crashes, another host, which is also overprovisioned, is running smoothly since ever.

did you already try the 6.8 opt-in kernel? Do you have the latest BIOS updates and CPU microcode installed? See: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?

Is there anything in the system logs/journal (if not you could still try to run journalctl -f from another system via SSH as the logs might not make it to disk)?

intelliIT · Apr 19, 2024

fiona said:
Hi,

did you already try the 6.8 opt-in kernel? Do you have the latest BIOS updates and CPU microcode installed? See: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?

Is there anything in the system logs/journal (if not you could still try to run journalctl -f from another system via SSH as the logs might not make it to disk)?

thanks for reaching out.
i will try and find time to lab that stuff. for now i dont want sudden crashes in our production-environment.
but no, did not try the new kernel yet. yes i have the latest bios installed so microcode should be patched aswell.
logs are empty except random weird temperatures from lm-sensors (180C on a hdd, which is basically not true).
as of today the problem was mitigated with the above workaround. i ran journalctl -f via ssh over night, but without a crash, so no help.

fiona · Monday at 09:52

intelliIT said:
yes i have the latest bios installed so microcode should be patched aswell.

For CPU microcode, there is a dedicated package (amd64-microcode or intel-microcode) you need to install via APT: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu

intelliIT · Monday at 16:29

fiona said:
For CPU microcode, there is a dedicated package (amd64-microcode or intel-microcode) you need to install via APT: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu

i already read that resource.
what is the correct interpretation of „besides“?
or do i misunderstand something?

Besides the recommended microcode update via persistent BIOS/UEFI updates, there is also an independent method via Early OS Microcode Updates. It is convenient to use and also quite helpful when the motherboard vendor no longer provides BIOS/UEFI updates.

Azunai333 · Monday at 16:39

intelliIT said:
what is the correct interpretation of „besides“?

You can update the CPU microcode with an update of the BIOS/UEFI or with the installation of the amd64-microcode/intel-microcode package.

The former microcode will be loaded with BIOS POST, the latter in the (early) boot process of the Linux kernel.

intelliIT · Monday at 18:17

Azunai333 said:
You can update the CPU microcode with an update of the BIOS/UEFI or with the installation of the amd64-microcode/intel-microcode package.

The former microcode will be loaded with BIOS POST, the latter in the (early) boot process of the Linux kernel.

then i dont understand fionas comment.
because bios is up-to-date so microcode should be patched persistent, no need of patching in the kernel.

fiona · Tuesday at 15:10

intelliIT said:
then i dont understand fionas comment.
because bios is up-to-date so microcode should be patched persistent, no need of patching in the kernel.

Right, sorry. It does depend on which versions are available. If the BIOS update already includes a newer (or same) version, you don't need the package.

Search

Search

Sudden Bulk stop of all VMs ?

intelliIT

New Member

zzz09700

Member

intelliIT

New Member

fiona

Proxmox Staff Member

intelliIT

New Member

fiona

Proxmox Staff Member

intelliIT

New Member

Azunai333

Member

intelliIT

New Member

fiona

Proxmox Staff Member