Proxmox 8.1.4 vms freezeing with one vm core at 100%

gzader

New Member
Feb 5, 2024
8
3
3
Update: upgraded pve qemu kvm 8.1.2-6 > 8.1.5-1 along with the other items available at the time. Fully patched as of 2024-02-06
--

I've been having vm freeze/lock ups with Proxmox on multiple old intel boxes running various vms.
I'm new to Proxmox but I think I have most things correct. The vm image files were created in qemu on older systems and imported into Proxmox.

Today, I had two vms lock up at once. I had to stop and start one of them (vm215) immediately as its running a website that I need up. It then locked up again just a few hours later. It's a low traffic website. The other (vm110) is a load balancer running HA Proxy and Keepalived.
I have replication set up on all the hosts (5 old small boxes) and most replicate to three different boxes. Vms are running Ubuntu 20.04.

I don't think this part is involved, but just in case, before a lock up I sometimes get errors like the below, they might happen hours or even a day or so before a lock up:

"
Replication job '215-2' with target 'vmouthost2' and schedule '*/15' failed!

Last successful sync: 2024-02-02 01:45:27
Next sync try: 2024-02-02 02:05:00
Failure count: 1

Error:
command 'zfs snapshot king240zfs/vm-215-disk-0@__replicate_215-2_1706857236__' failed: got timeout
"

vm110 only replicates once a day, vm215 had no warning between events.
Also, these VMs are setup for HA. However the system never recognizes them as down.

Below is my system data, ceph is not in use:

root@vmouthost6:~# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

(update: I changed from the defaulted LSI controller to virtio scsi single)
root@vmouthost6:~# qm config 110
boot: order=scsi0
cores: 2
description: [removed]
meta: creation-qemu=8.0.2,ctime=1692909357
name: LoadBalancer2004.1
net0: virtio=52:54:00:72:4f:6e,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: king240zfs:vm-110-disk-0,format=raw,size=10G
scsihw: virtio-scsi-single
smbios1: uuid=60fe2c4b-85a0-4b90-887e-428ef8b40be5
sockets: 1
vmgenid: 412a64a2-3239-4987-9fc0-b0bcc5e774a3


(I moved the machine to another host in hopes of keeping it alive longer.)
(Update: I changed from the defaulted LSI controller to virtio scsi single)
root@vmouthost4:~# qm config 215
boot: order=scsi0
cores: 4
description: [removed]
memory: 640
meta: creation-qemu=8.0.2,ctime=1693328618
name: com2004web2
net0: virtio=66:D3:26:49:97:5C,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
scsi0: king240zfs:vm-215-disk-0,format=raw,size=10G
scsihw: virtio-scsi-single
smbios1: uuid=59deb694-cb02-4969-99b1-f00d97b4ceb0
sockets: 1
vmgenid: f0296322-d84d-4822-b762-7bc4956f914e

(During one frozen vm)
root@vmouthost6:~# timeout 10 strace -c -p $(cat /var/run/qemu-server/110.pid)
strace: Process 3323 attached
strace: Process 3323 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 4330 read
0.00 0.000000 0 16834 write
0.00 0.000000 0 16 close
0.00 0.000000 0 80 sendmsg
0.00 0.000000 0 4112 recvmsg
0.00 0.000000 0 16 getsockname
0.00 0.000000 0 32 fcntl
0.00 0.000000 0 22694 ppoll
0.00 0.000000 0 16 accept4
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 0 48130 total


vm110 is showing 53% cpu usage. It appears to be one virtual core maxed, the other virtual core running 3%. I cannot restart it. I've had to stop and start the vm to get it back in service. Opening a console to it shows the terminal frozen. There is no way to interact with the vm or to try and get info from it.

These vms were working fine in the old system. I don't think it's the vm, I think it's something in the host.

Thank you in advance for any help and suggestions.
 
Last edited:
One extra note, the box that had two VMs lock up at once is my "best" box. It is not resource constrained like the other hosts. The only thing that might be an issue is that there are two vms that will try to replicate out the last 15 minutes of changes to three other boxes. The replication time though tends to be around 6 second for each task, a total of 6 tasks.
 
I've researched a lot of threads and have made some changes.
I'm using virtio scsi single now as the scsi controller, before it had been LSI 53C895A.
This seems more in line with how others are running their systems.

I moved my replication times around to reduce any possible data contention issues. Replication times were short already and this doesn't seem to have changed the duration by more than a second or so (around 20%). Still, hopefully it narrows down possible issues.

I grabbed the latest patches including pve qemu kvm 8.1.2-6 > 8.1.5-1.


The shortest time a vm has frozen was just a few hours. That was on a machine with three VMs. That machine now only hosts two.
The longest run before a freeze was over a week. I don't recall if that vm was the only machine on the host or not. Not every vm has frozen. Just several important ones.

This is an strace while things are working normally.
root@vmouthost6:~# timeout 10 strace -c -p $(cat /var/run/qemu-server/110.pid)
strace: Process 1460193 attached
strace: Process 1460193 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
96.80 0.115047 5 19459 5 ppoll
1.88 0.002236 1 2084 write
0.59 0.000706 1 543 read
0.59 0.000704 1 507 recvmsg
0.03 0.000037 3 10 sendmsg
0.03 0.000037 4 9 io_uring_enter
0.02 0.000021 10 2 accept4
0.02 0.000018 9 2 close
0.01 0.000016 1 10 ioctl
0.01 0.000010 2 5 1 futex
0.01 0.000008 2 4 fcntl
0.01 0.000006 3 2 getsockname
------ ----------- ----------- --------- --------- ----------------
100.00 0.118846 5 22637 6 total


I should know more in a few days, but any help anyone can offer would be greatly appreciated.
 
Last edited:
Hi,
are there any messages in the system logs/journal around the time the issue happens? What about logs inside the guest? How does the system load look like? What kind of hardware do you have? What does zpool status -v show? What about qm status 110 --verbose after the VM is stuck?