Sudden Bulk stop of all VMs ?

intelliIT · Apr 18, 2024

we at least thought about a faulty psu, swapped out a fully functional 1000W beQuiet quite early during troubleshooting.
i also booted with "windows to go" on more than one machine, as we original had the issue on 2 of 3 nodes in a hardware-identical cluster.
i ran memory heavy tasks and a cpu stress for more than 3 hours, without any issues.

zzz09700 · Apr 18, 2024

My guts feeling tell me we are looking at an AMD/Linux kernel bug here.

Or maybe a AMD+KVM+Windows guest only bug.

intelliIT · Apr 19, 2024

i can at least confirm that for me now with vCpu < phys. Cpu the problem node survived the night, which wasnt the case before.
sadly this leave me with 2 nodes in my 3 node cluster, which basically cant scale any services or anything anymore.
this isnt any homelab but its company infrastructure so it would be super cool if there would be a fix sometime or at least a root-cause to troubleshoot with.
for now i will probably end up scaling the prox-cluster horizontally.

fiona · Apr 19, 2024

Hi,

intelliIT said:
we also use 7950X.
we had these crashes without any traces up to last year and they stopped as soon as we stopped using "host" cpu-type for our windows vms. but now as we need nested virtualization again we went back to using "host". so since yesterday the crashes reappeared. is there any experience with the "svm" flag for other cpu-types and/or custom cpu types (like here).
or is the solution really to limit my hypervisor to not use more than its physical available cores when using cpu-type "host", because this seems to be a really weird solution?
additionally its only windows vms which cause the crashes, another host, which is also overprovisioned, is running smoothly since ever.

did you already try the 6.8 opt-in kernel? Do you have the latest BIOS updates and CPU microcode installed? See: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?

Is there anything in the system logs/journal (if not you could still try to run journalctl -f from another system via SSH as the logs might not make it to disk)?

intelliIT · Apr 19, 2024

fiona said:
Hi,

did you already try the 6.8 opt-in kernel? Do you have the latest BIOS updates and CPU microcode installed? See: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?

Is there anything in the system logs/journal (if not you could still try to run journalctl -f from another system via SSH as the logs might not make it to disk)?

thanks for reaching out.
i will try and find time to lab that stuff. for now i dont want sudden crashes in our production-environment.
but no, did not try the new kernel yet. yes i have the latest bios installed so microcode should be patched aswell.
logs are empty except random weird temperatures from lm-sensors (180C on a hdd, which is basically not true).
as of today the problem was mitigated with the above workaround. i ran journalctl -f via ssh over night, but without a crash, so no help.

fiona · Apr 22, 2024

intelliIT said:
yes i have the latest bios installed so microcode should be patched aswell.

For CPU microcode, there is a dedicated package (amd64-microcode or intel-microcode) you need to install via APT: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu

intelliIT · Apr 22, 2024

fiona said:
For CPU microcode, there is a dedicated package (amd64-microcode or intel-microcode) you need to install via APT: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu

i already read that resource.
what is the correct interpretation of „besides“?
or do i misunderstand something?

Besides the recommended microcode update via persistent BIOS/UEFI updates, there is also an independent method via Early OS Microcode Updates. It is convenient to use and also quite helpful when the motherboard vendor no longer provides BIOS/UEFI updates.

Azunai333 · Apr 22, 2024

intelliIT said:
what is the correct interpretation of „besides“?

You can update the CPU microcode with an update of the BIOS/UEFI or with the installation of the amd64-microcode/intel-microcode package.

The former microcode will be loaded with BIOS POST, the latter in the (early) boot process of the Linux kernel.

intelliIT · Apr 22, 2024

Azunai333 said:
You can update the CPU microcode with an update of the BIOS/UEFI or with the installation of the amd64-microcode/intel-microcode package.

The former microcode will be loaded with BIOS POST, the latter in the (early) boot process of the Linux kernel.

then i dont understand fionas comment.
because bios is up-to-date so microcode should be patched persistent, no need of patching in the kernel.

fiona · Apr 23, 2024

intelliIT said:
then i dont understand fionas comment.
because bios is up-to-date so microcode should be patched persistent, no need of patching in the kernel.

Right, sorry. It does depend on which versions are available. If the BIOS update already includes a newer (or same) version, you don't need the package.

intelliIT · May 7, 2024

ok somehow we are back to sudden restarts.
we had a migration of 2 VMs onto the two affected hosts at the weekend, the HA config was wrong.
i corrected that and remigrated so the hosts are vCPU < Cores.
since then the reboots are back to one to two daily unexpetced power-cycles.
i had the chance to do ssh + journactl -f but there was basically no more info.
are there any news on this?
should i try to get my hands on some kind of kernel-crash dump or would that not help?
or any other info i can provide that helps.
i try to get my hands on a minimal-config testing rig, but this will still be a couple of days/weeks (for opt-in kernel etc.)..
as said this stuff runs in production so would be very cool to get that stuff sorted and stop daily power cycles.

fiona · May 7, 2024

intelliIT said:
ok somehow we are back to sudden restarts.
we had a migration of 2 VMs onto the two affected hosts at the weekend, the HA config was wrong.
i corrected that and remigrated so the hosts are vCPU < Cores.
since then the reboots are back to one to two daily unexpetced power-cycles.
i had the chance to do ssh + journactl -f but there was basically no more info.
are there any news on this?

Unfortunately not, as far as I'm aware. We also haven't had any other reports about such an issue, besides the few people in the forum here.

intelliIT said:
should i try to get my hands on some kind of kernel-crash dump or would that not help?

If we are lucky, it would help. If you are using ZFS as the root FS, see here: https://forum.proxmox.com/threads/103293/post-445172 and the following post.

intelliIT said:
or any other info i can provide that helps.
i try to get my hands on a minimal-config testing rig, but this will still be a couple of days/weeks (for opt-in kernel etc.)..
as said this stuff runs in production so would be very cool to get that stuff sorted and stop daily power cycles.

I understand, but without any concrete logs/traces or being able to reproduce the issue ourselves, we can't hope to identify the issue, unfortunately.

intelliIT · May 7, 2024

fiona said:
I understand, but without any concrete logs/traces or being able to reproduce the issue ourselves, we can't hope to identify the issue, unfortunately.

do you want to reproduce?
i can give you my exact configuration. for us to recreate it again (after giving up on our workaround) it was only doing nested virt. on windows in order for docker desktop to work.
this might be or probably is related to our hardware combination though.
i understand ressources are limited and other stuff takes precedence but help would be really appreciated

fiona said:
If we are lucky, it would help. If you are using ZFS as the root FS, see here: https://forum.proxmox.com/threads/103293/post-445172 and the following post.

we do use zfs. if it doesnt miraculously heal itself, i will try to get that to work on one of the affected hosts. if they do stop cycling i will wait for my test-rig.

f.sennj · Tuesday at 08:27

Hello all,

i have a similar probolem ( i believe connected to PBS start of a job) 3 of 5 VM go down.

May 14 07:05:03 pve3 qm[1266127]: VM 140 qmp command failed - VM 140 qmp command 'savevm-end' failed - client closed connection
May 14 07:05:03 pve3 qm[1266127]: VM 140 qmp command failed - VM 140 not running
May 14 07:05:10 pve3 pvescheduler[1266236]: ERROR: Backup of VM 104 failed - VM is locked (snapshot)
May 14 07:05:11 pve3 QEMU[593299]: kvm: ../block/graph-lock.c:300: bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.
May 14 07:05:12 pve3 qm[1266231]: VM 104 qmp command failed - VM 104 qmp command 'savevm-end' failed - client closed connection
May 14 07:05:12 pve3 qm[1266231]: VM 104 qmp command failed - VM 104 not running
May 14 07:05:12 pve3 pvescheduler[1266234]: VM 104 qmp command failed - VM 104 not running
May 14 07:05:13 pve3 pvescheduler[1266234]: VM 104 qmp command failed - VM 104 not running
May 14 07:05:14 pve3 qm[1266461]: VM is locked (backup)
May 14 07:05:14 pve3 qm[1266456]: <root@pam> end task UPID

ve3:0013531D:005B66BA:6642F10A:qmsnapshot:110:root@pam: VM is locked (backup)
May 14 07:05:16 pve3 qm[1266547]: VM is locked (backup)
May 14 07:05:16 pve3 qm[1266527]: <root@pam> end task UPID

ve3:00135373:005B6791:6642F10C:qmdelsnapshot:110:root@pam: VM is locked (backup)
May 14 07:10:19 pve3 pvescheduler[1273811]: VM 140 qmp command failed - VM 140 qmp command 'guest-ping' failed - got timeout
May 14 07:15:19 pve3 pvescheduler[1285415]: VM 140 qmp command failed - VM 140 qmp command 'guest-ping' failed - got timeout
May 14 08:04:26 pve3 pvescheduler[1266236]: job errors
May 14 08:05:10 pve3 QEMU[585975]: kvm: ../block/graph-lock.c:300: bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.
May 14 08:05:10 pve3 qm[1345263]: VM 110 qmp command failed - VM 110 qmp command 'savevm-end' failed - client closed connection
May 14 08:05:10 pve3 pvescheduler[1345008]: VM 110 qmp command failed - VM 110 qmp command 'guest-fsfreeze-thaw' failed - client closed connection
May 14 08:05:10 pve3 qm[1345263]: VM 110 qmp command failed - VM 110 not running

Can anyone help on this?

justinclift · Tuesday at 09:57

@intelliIT Out of curiosity, on the hosts with this problem are you doing any kind of PCIe sharing?

Also, and separate to that, are the systems configured to use an IOMMU?

Also, for completeness, what's the kernel command line they're booting with?

For example on a 5950X system here:

Code:

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.5.13-5-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction

That info might help figure out what's going wrong.

fiona · Tuesday at 10:10

Hi,

f.sennj said:
Hello all,

i have a similar probolem ( i believe connected to PBS start of a job) 3 of 5 VM go down.

May 14 07:05:03 pve3 qm[1266127]: VM 140 qmp command failed - VM 140 qmp command 'savevm-end' failed - client closed connection
May 14 07:05:03 pve3 qm[1266127]: VM 140 qmp command failed - VM 140 not running
May 14 07:05:10 pve3 pvescheduler[1266236]: ERROR: Backup of VM 104 failed - VM is locked (snapshot)
May 14 07:05:11 pve3 QEMU[593299]: kvm: ../block/graph-lock.c:300: bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.

that sounds like a completely different issue. It's not your host that crashes but a VM with a clear assertion failure in the logs. It doesn't seem to be because of backup, but because of snapshotting (backup doesn't use savevm-end API). Please post the output of pveversion -v and qm config 140, qm config 110, qm config 104. Please also install the debugger and debug symbols apt install pve-qemu-kvm-dbgsym gdb as well as apt install systemd-coredump. After the next crash, you can then get a backtrace by running coredumpctl gdb -1 and then in GDB, please run thread apply all backtrace.

f.sennj · Tuesday at 15:15

Right thank you Fiona, as you say a problem with Snapshotting (nothing to do with Backup in itself). i am updating now the QEMU tools to see if that helps. strange is i have 2 nodes which have been installed by an older iso and are on 8.2.2 now (where no probs occur) and a new node installed with the New ISO 8.2.2. which makes the troubles when a snap is taken.

One VM that has been created on that new hosts doenst have a problem with snapshotting, the VMs which have been created previously have this behaviour.

coming back with dumps once taken..

f.sennj · Tuesday at 15:17

here is the pveversion -v output

proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.2.2-1
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2
root@pve3:~#

f.sennj · Tuesday at 15:17

and one of the VMs affected:

cat /etc/pve/qemu-server/104.conf
agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: order=virtio0
cores: 4
cpu: x86-64-v2-AES
efidisk0: zfs:vm-104-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
machine: pc-q35-8.1
memory: 8192
meta: creation-qemu=8.1.2,ctime=1707254210
name: roooms
net0: virtio=00:50:56:A8:58:C6,bridge=vmbr1,firewall=1
numa: 0
ostype: win10
parent: autohourly240514150503
protection: 1
scsihw: virtio-scsi-pci
smbios1: uuid=c8e030c8-c07b-4e39-a85f-f7947d06daed
sockets: 1
tags: vm_winsrv
virtio0: zfs:vm-104-disk-1,discard=on,iothread=1,size=80G
vmgenid: 5a87a72c-b2fc-46a3-be78-a4b4d73331cc
vmstatestorage: zfs

fiona · Tuesday at 16:31

f.sennj said:
Code:

pve-qemu-kvm: 8.2.2-1

That version was only released on the testing repository and it's not production-ready. The plan from our side is to go straight to QEMU 9.0 (currently held up by a migration bug that should get fixed soonish), because 8.2 has some issues in combination with iothread that can't easily be fixed.

I'd suggest downgrading to the latest version from the enterprise or no-subscription repository, i.e. pve-qemu-kvm=8.1.5-6 or alternatively, disabling iothread on your drive.

Sudden Bulk stop of all VMs ?

New Member

Active Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

New Member

Proxmox Staff Member

New Member

New Member

New Member

Proxmox Staff Member