[SOLVED] Temp and random VM freeze after upgrade to Proxmox 8.2

fossaaen · Aug 27, 2024

Hi,

we are experiencing some random VM freezes after upgrading to proxmox 8.2 which we didn’t see prior to this. VM’s seem to freeze randomly with hangs for 3-30 sec, but mostly in the 3-15 sec range. On the worst affected VM’s this can happen every 2-3 min.

When this occurs, everything for that specific VM is unresponsive; console, network, ‘qm’ commands from hypervisor towards VM +++. When it unfreezes, everything seems to continue as «normal». Nothing is killed or dies. This is affecting single VMs on a hypervisor running several VMs, where the others are unaffected during such problem. When the hang exceeds +21sec we also get the infamous «watchdog: BUG: soft lockup - CPU#XXX stuck for XXXs!» messages.

VM’s affected with this problem seems to be our larger VMs, typical 1-2,8TB memory sizes, on hypervisors with 1,5-3TB memory. We have not been able to detect anything special happening or scheduled jobs when the problem is detected (backups, batch jobs etc), so no obvious patterns.

Problem is temporary solved by restarting the affected VM. Time it takes for the problem to re-occur after reboot is typically 1-3 days, but also seen several weeks. When the problem is first starts it continues quite frequently until VM restarted.

When we first detected this issue, it also seemed like online migration to other hypervisor did the same as restarting VM (fix for a few days), until last test the other day, where even the migration job was affected and hanging when the VM had problem, and the result after ended migration was a VM running with the same problem as before live migrating. So behaviour is not consistent with regards of live migration temp helping either.

What has been tried:
- change all drives to use ‘virtual scsi single, iothread=1, aio=thread’ and cache=‘Default (No cache)’
- all of the above has been tried individually and in combinations without any effect.
- prior to the problem we normally used the above settings except aio=io_uring and cache=writeback
- Proxmox patched to lates available as of Aug 26th, kernel 6.8.x (tried various kernel versions)
- Also tried pinning proxmox kernel to 6.5.x
- VM OS’s up2date with regard to patches. VM’s are running SLES 15.4 and 15.5)
- ‘qm status <vmid> —verbose’ on hypervisor when VM hangs does not respond before hang is resolved, and then gives output like at the bottom of this text.

We have setup smokeping to detect/log the hangs, pinging from the FW towards VMs VLAN and see package drops when VM freeze.

Google and proxmox forum searches seem to hint to AIO and QEMU locking issues like https://bugzilla.kernel.org/show_bug.cgi?id=199727 or https://forum.proxmox.com/threads/proxmox-freezes-randomly.83102/page-2 and others which seems to give same symptoms as we experience.

There is noting in the OS logs of the hanging VM’s when problem occurs, other than ‘dmesg’ showing ‘CPU stuck’ messages when hang takes too long to resolve. Also nothing in the Hypervisor logs.

Any idea of other things to try, or are we barking up the wrong tree and this is not the well know QEMU/AIO issue ?
Any help is gratefully received as we are running out of ideas to test/debug.

Our environment:
- We are using the enterprise proxmox repos running on Dell E R640/R650 (dual Intel XEON Gold) and R840 (quad socket Intel XEON Gold) connected to an external CEPH storage cluster.
- BIOS and microcodes up2date.

### pveversion --verbose ##
proxmox-ve: 8.2.0 (running kernel: 6.5.13-6-pve)
pve-manager: 8.2.3 (running version: 8.2.3/b4648c690591095f)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-13
proxmox-kernel-6.8: 6.8.8-4
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.5.13-6-pve: 6.5.13-6
pve-kernel-5.4: 6.4-5
pve-kernel-5.15.152-1-pve: 5.15.152-1
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 18.2.2-pve1
ceph-fuse: 18.2.2-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx9
intel-microcode: 3.20240514.1~deb12u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.2.3-1
proxmox-backup-file-restore: 3.2.3-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

### typical config for an affected VM ###
agent: 1
boot: order=scsi0;ide2;net0
ciuser: root
cores: 14
cpu: custom-custom-model-host-bits
hotplug: disk,network,usb,cpu
ide2: none,media=cdrom
memory: 1433600
meta: creation-qemu=6.2.0,ctime=1660632446
name: ark-abpdb-p
net0: virtio=FA:81:28:6B:BB:F5,bridge=vmbr0,firewall=1,tag=49
numa: 1
ostype: l26
scsi0: ark:vm-2427-disk-0,cache=writeback,discard=on,iothread=1,size=20G,ssd=1
scsi1: ark:vm-2427-program,cache=writeback,discard=on,iothread=1,serial=drive-scsi1_program,size=200G,ssd=1
scsi2: ark:vm-2427-dbdata,cache=writeback,discard=on,iothread=1,serial=drive-scsi2_dbdata,size=2325G,ssd=1
scsi3: ark:vm-2427-dblog,backup=0,cache=writeback,discard=on,iothread=1,serial=drive-scsi3_dblog,size=500G,ssd=1
scsihw: virtio-scsi-single
searchdomain: xxx.yyy.zz
serial0: socket
smbios1: uuid=651759bf-b4cb-4e2b-9b3b-58f495fxe5db
sockets: 2
vcpus: 20
vmgenid: 47c4b65a-5728-4395-bcc9-f518cc190f47

### qm status of a hangig VM (just after unfreeze as command will only hang while freezed) ###

# qm status 2427 --verbose
balloon: 1503238553600
ballooninfo:
actual: 1503238553600
free_mem: 59277422592
last_update: 1724674499
major_page_faults: 1761
max_mem: 1503238553600
mem_swapped_in: 0
mem_swapped_out: 0
minor_page_faults: 4464968404
total_mem: 1479614443520
blockstat:
ide2:
account_failed: 1
account_invalid: 1
failed_flush_operations: 0
failed_rd_operations: 0
failed_unmap_operations: 0
failed_wr_operations: 0
failed_zone_append_operations: 0
flush_operations: 0
flush_total_time_ns: 0
invalid_flush_operations: 0
invalid_rd_operations: 0
invalid_unmap_operations: 0
invalid_wr_operations: 0
invalid_zone_append_operations: 0
rd_bytes: 0
rd_merged: 0
rd_operations: 0
rd_total_time_ns: 0
timed_stats:
unmap_bytes: 0
unmap_merged: 0
unmap_operations: 0
unmap_total_time_ns: 0
wr_bytes: 0
wr_highest_offset: 0
wr_merged: 0
wr_operations: 0
wr_total_time_ns: 0
zone_append_bytes: 0
zone_append_merged: 0
zone_append_operations: 0
zone_append_total_time_ns: 0
scsi0:
account_failed: 1
account_invalid: 1
failed_flush_operations: 0
failed_rd_operations: 0
failed_unmap_operations: 0
failed_wr_operations: 0
failed_zone_append_operations: 0
flush_operations: 17728
flush_total_time_ns: 79115966259
idle_time_ns: 3195394
invalid_flush_operations: 0
invalid_rd_operations: 0
invalid_unmap_operations: 0
invalid_wr_operations: 0
invalid_zone_append_operations: 0
rd_bytes: 111669248
rd_merged: 0
rd_operations: 12010
rd_total_time_ns: 18912834931
timed_stats:
unmap_bytes: 2235260928
unmap_merged: 0
unmap_operations: 8184
unmap_total_time_ns: 432059196
wr_bytes: 11224195072
wr_highest_offset: 5702189056
wr_merged: 0
wr_operations: 261666
wr_total_time_ns: 437550976027
zone_append_bytes: 0
zone_append_merged: 0
zone_append_operations: 0
zone_append_total_time_ns: 0
scsi1:
account_failed: 1
account_invalid: 1
failed_flush_operations: 0
failed_rd_operations: 0
failed_unmap_operations: 0
failed_wr_operations: 0
failed_zone_append_operations: 0
flush_operations: 108840
flush_total_time_ns: 252493584008
idle_time_ns: 19411454172
invalid_flush_operations: 0
invalid_rd_operations: 0
invalid_unmap_operations: 0
invalid_wr_operations: 0
invalid_zone_append_operations: 0
rd_bytes: 55840768
rd_merged: 0
rd_operations: 11533
rd_total_time_ns: 9125153430
timed_stats:
unmap_bytes: 3761864704
unmap_merged: 0
unmap_operations: 25507
unmap_total_time_ns: 809735134
wr_bytes: 19560767488
wr_highest_offset: 11362951168
wr_merged: 0
wr_operations: 432124
wr_total_time_ns: 265708271228
zone_append_bytes: 0
zone_append_merged: 0
zone_append_operations: 0
zone_append_total_time_ns: 0
scsi2:
account_failed: 1
account_invalid: 1
failed_flush_operations: 0
failed_rd_operations: 0
failed_unmap_operations: 0
failed_wr_operations: 0
failed_zone_append_operations: 0
flush_operations: 68203
flush_total_time_ns: 465511605462
idle_time_ns: 2448384
invalid_flush_operations: 0
invalid_rd_operations: 0
invalid_unmap_operations: 0
invalid_wr_operations: 0
invalid_zone_append_operations: 0
rd_bytes: 72660616192
rd_merged: 0
rd_operations: 355232
rd_total_time_ns: 1378948902383
timed_stats:
unmap_bytes: 176446558208
unmap_merged: 0
unmap_operations: 413
unmap_total_time_ns: 1176328244
wr_bytes: 2567690533888
wr_highest_offset: 2415919116288
wr_merged: 0
wr_operations: 10636902
wr_total_time_ns: 219556708959884
zone_append_bytes: 0
zone_append_merged: 0
zone_append_operations: 0
zone_append_total_time_ns: 0
scsi3:
account_failed: 1
account_invalid: 1
failed_flush_operations: 0
failed_rd_operations: 0
failed_unmap_operations: 0
failed_wr_operations: 0
failed_zone_append_operations: 0
flush_operations: 25271
flush_total_time_ns: 118942806454
idle_time_ns: 746751
invalid_flush_operations: 0
invalid_rd_operations: 0
invalid_unmap_operations: 0
invalid_wr_operations: 0
invalid_zone_append_operations: 0
rd_bytes: 304172396544
rd_merged: 0
rd_operations: 381824
rd_total_time_ns: 14168076167200
timed_stats:
unmap_bytes: 134053797888
unmap_merged: 0
unmap_operations: 489
unmap_total_time_ns: 30992549527
wr_bytes: 612484426240
wr_highest_offset: 375809650688
wr_merged: 0
wr_operations: 7566949
wr_total_time_ns: 29879037319517
zone_append_bytes: 0
zone_append_merged: 0
zone_append_operations: 0
zone_append_total_time_ns: 0
cpus: 20
disk: 0
diskread: 377000522752
diskwrite: 3210959922688
freemem: 59277422592
maxdisk: 21474836480
maxmem: 1503238553600
mem: 1420337020928
name: ark-abpdb-p
netin: 382764923664
netout: 3187412585103
nics:
tap2427i0:
netin: 382764923664
netout: 3187412585103
pid: 3501812
proxmox-support:
backup-fleecing: 1
backup-max-workers: 1
pbs-dirty-bitmap: 1
pbs-dirty-bitmap-migration: 1
pbs-dirty-bitmap-savevm: 1
pbs-library-version: 1.4.1 (UNKNOWN)
pbs-masterkey: 1
query-bitmap-info: 1
qmpstatus: running
running-machine: pc-i440fx-8.1+pve0
running-qemu: 8.1.5
serial: 1
status: running
uptime: 257514
vmid: 2427

### virtual CPU emulation definition for the above VM to be able to boot with +1,3TB memory ###

# cat /etc/pve/virtual-guest/cpu-models.conf
cpu-model: custom-model-host-bits
reported-model host
phys-bits host

-Svein-

waltar · Aug 27, 2024

Would check your ceph nodes for I/O/disk (iostat -xm 1/ceph-node), cpu (top/ceph-node), network (sar -n DEV 1/ceph-node) and network on pve into your ceph network (sar -n DEV 1 --iface=...).

fossaaen · Aug 28, 2024

Hi,

thank you for your reply. No obvious issues detected there, but a new observation from yesterdays debugging. On the pve-node, if you do ps command towards the kvm pid for the affected VM, it gives you output as expected, except when the VM is hanging, then any ps commands, strace or others towards the kvm pid also hangs until VM unfreezes, and then gives output. This problem is only for the kvm-pid of the VM affected. Commands/actions involving "all" other PIDs on hypervisor seem to be unaffected. Even when running the 'top program/or nmon' prior to hang, we will see the KVM pid acting as normal, but when the VM hang starts all other pids in the top/nmon window updates every values, except the KVM pid, which does not report new values until VM unfreezes.

When checking the process state of KVM pid both before and during freeze, it reports "sleeping"

In my head all this seem to lead to PID getting stuck in some kernel-mode/lock mode during freeze, but could of course be totally wrong, so any help is still gratefully received.

-Svein-

waltar · Aug 28, 2024

Sleeping and lock mode looks for I/O not responding so I would eagerly watch what is going on in your ceph or even not while it should do something.

fossaaen · Aug 30, 2024

Sleeping state for KVM pid seems to be standard for all VM pids, good or bad, and some very low % of state 'running' so this alone seems to be OK. Still not able to find any problems releated to our CEPH storage. I would expect when CPEH storage for a hypervisor is shared for all VMs, all VMs on that hypervisor should struggle if something wrong with CEPH in general, not just a single VM, and only after been running 1-3 days typically after reboots.

fossaaen · Sep 2, 2024

A tip came in to try and lower the qemu version in editing "Machine->version" in the GUI for the VM. For windows it had helped with lowering to 5.1. We tried the same for our linux environment, but no change. Also tried pinning kernel as old as 5.15.152-1-pve on PVE to no luck

fossaaen · Wednesday at 10:14

After much debugging/digging, we finally solved our problem.

When VMs hang, we saw that file ‘/proc/<kvm-pid>/environ’ on hypervisors was not possible to cat/read when VM hangs, OK when VM did not hang - all other files in /proc/<kvm-pid>/ was accessible during hangs. Doing a strace on ‘ps’ command and others showed they relay on reading this environ file, thus commands involving KVM-pid hangs when VM hangs.

Did a continues loop through ’ps aux’ command on all PIDs on hypervisor looking for states. The ps command would also hang when VM was hanging, but just as the VM unfroze we saw the ksmtuned process had state ‘D’ meaning PID is stuck waiting on a resource - this was only identified when a VM froze. Since /proc is a memory based filesystem, it lead us to believe this was a KSM problem.

Looking in Zabbix history for hypervisor we could see when hypervisor passed 80% memory utilisation, and KSM kicks in, the trouble of the hanging VM started. This explained the diff in time after restart before problem occurred (needing to reach 80%).

What solved it for us was to either:

shut down all VMs on affected hypervisor, disable and stop KSM, and release any KSM pages existing, before restarting the VMs.
have an empty hypervisor ready, make sure KSM is not enabled and pages released, and then just live migrate affected VM to new Hypervisor.

As mentioned earlier in the thread, this problem started after upgrading to proxmox 8.2.x. We also noticed when KSM was active on pve 8.x it had a much lower value for the same hypervisor/load/VMs on pve 8.x then on 7.x……typically 100-200% better on pve 7.x. I think I saw a post on the forum of someone else also seeing this «degradation» of KSM utilisation, but not able to find it again.

Anyhow, there seems to be some changes done to KSM between 7.x and 8.x, so for now we have decided to disable KSM sharing on all clusters/pve’s. All systems been running fine the last week after KSM disabled.

-Svein-

Search

Search

[SOLVED] Temp and random VM freeze after upgrade to Proxmox 8.2

fossaaen

Member

waltar

Member

fossaaen

Member

waltar

Member

fossaaen

Member

fossaaen

Member

fossaaen

Member