Update: upgraded pve qemu kvm 8.1.2-6 > 8.1.5-1 along with the other items available at the time. Fully patched as of 2024-02-06
--
I've been having vm freeze/lock ups with Proxmox on multiple old intel boxes running various vms.
I'm new to Proxmox but I think I have most things correct. The vm image files were created in qemu on older systems and imported into Proxmox.
Today, I had two vms lock up at once. I had to stop and start one of them (vm215) immediately as its running a website that I need up. It then locked up again just a few hours later. It's a low traffic website. The other (vm110) is a load balancer running HA Proxy and Keepalived.
I have replication set up on all the hosts (5 old small boxes) and most replicate to three different boxes. Vms are running Ubuntu 20.04.
I don't think this part is involved, but just in case, before a lock up I sometimes get errors like the below, they might happen hours or even a day or so before a lock up:
"
Replication job '215-2' with target 'vmouthost2' and schedule '*/15' failed!
Last successful sync: 2024-02-02 01:45:27
Next sync try: 2024-02-02 02:05:00
Failure count: 1
Error:
command 'zfs snapshot king240zfs/vm-215-disk-0@__replicate_215-2_1706857236__' failed: got timeout
"
vm110 only replicates once a day, vm215 had no warning between events.
Also, these VMs are setup for HA. However the system never recognizes them as down.
Below is my system data, ceph is not in use:
root@vmouthost6:~# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
(update: I changed from the defaulted LSI controller to virtio scsi single)
root@vmouthost6:~# qm config 110
boot: order=scsi0
cores: 2
description: [removed]
meta: creation-qemu=8.0.2,ctime=1692909357
name: LoadBalancer2004.1
net0: virtio=52:54:00:72:4f:6e,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: king240zfs:vm-110-disk-0,format=raw,size=10G
scsihw: virtio-scsi-single
smbios1: uuid=60fe2c4b-85a0-4b90-887e-428ef8b40be5
sockets: 1
vmgenid: 412a64a2-3239-4987-9fc0-b0bcc5e774a3
(I moved the machine to another host in hopes of keeping it alive longer.)
(Update: I changed from the defaulted LSI controller to virtio scsi single)
root@vmouthost4:~# qm config 215
boot: order=scsi0
cores: 4
description: [removed]
memory: 640
meta: creation-qemu=8.0.2,ctime=1693328618
name: com2004web2
net0: virtio=66
3:26:49:97:5C,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
scsi0: king240zfs:vm-215-disk-0,format=raw,size=10G
scsihw: virtio-scsi-single
smbios1: uuid=59deb694-cb02-4969-99b1-f00d97b4ceb0
sockets: 1
vmgenid: f0296322-d84d-4822-b762-7bc4956f914e
(During one frozen vm)
root@vmouthost6:~# timeout 10 strace -c -p $(cat /var/run/qemu-server/110.pid)
vm110 is showing 53% cpu usage. It appears to be one virtual core maxed, the other virtual core running 3%. I cannot restart it. I've had to stop and start the vm to get it back in service. Opening a console to it shows the terminal frozen. There is no way to interact with the vm or to try and get info from it.
These vms were working fine in the old system. I don't think it's the vm, I think it's something in the host.
Thank you in advance for any help and suggestions.
--
I've been having vm freeze/lock ups with Proxmox on multiple old intel boxes running various vms.
I'm new to Proxmox but I think I have most things correct. The vm image files were created in qemu on older systems and imported into Proxmox.
Today, I had two vms lock up at once. I had to stop and start one of them (vm215) immediately as its running a website that I need up. It then locked up again just a few hours later. It's a low traffic website. The other (vm110) is a load balancer running HA Proxy and Keepalived.
I have replication set up on all the hosts (5 old small boxes) and most replicate to three different boxes. Vms are running Ubuntu 20.04.
I don't think this part is involved, but just in case, before a lock up I sometimes get errors like the below, they might happen hours or even a day or so before a lock up:
"
Replication job '215-2' with target 'vmouthost2' and schedule '*/15' failed!
Last successful sync: 2024-02-02 01:45:27
Next sync try: 2024-02-02 02:05:00
Failure count: 1
Error:
command 'zfs snapshot king240zfs/vm-215-disk-0@__replicate_215-2_1706857236__' failed: got timeout
"
vm110 only replicates once a day, vm215 had no warning between events.
Also, these VMs are setup for HA. However the system never recognizes them as down.
Below is my system data, ceph is not in use:
root@vmouthost6:~# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
(update: I changed from the defaulted LSI controller to virtio scsi single)
root@vmouthost6:~# qm config 110
boot: order=scsi0
cores: 2
description: [removed]
meta: creation-qemu=8.0.2,ctime=1692909357
name: LoadBalancer2004.1
net0: virtio=52:54:00:72:4f:6e,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: king240zfs:vm-110-disk-0,format=raw,size=10G
scsihw: virtio-scsi-single
smbios1: uuid=60fe2c4b-85a0-4b90-887e-428ef8b40be5
sockets: 1
vmgenid: 412a64a2-3239-4987-9fc0-b0bcc5e774a3
(I moved the machine to another host in hopes of keeping it alive longer.)
(Update: I changed from the defaulted LSI controller to virtio scsi single)
root@vmouthost4:~# qm config 215
boot: order=scsi0
cores: 4
description: [removed]
memory: 640
meta: creation-qemu=8.0.2,ctime=1693328618
name: com2004web2
net0: virtio=66

numa: 0
onboot: 1
scsi0: king240zfs:vm-215-disk-0,format=raw,size=10G
scsihw: virtio-scsi-single
smbios1: uuid=59deb694-cb02-4969-99b1-f00d97b4ceb0
sockets: 1
vmgenid: f0296322-d84d-4822-b762-7bc4956f914e
(During one frozen vm)
root@vmouthost6:~# timeout 10 strace -c -p $(cat /var/run/qemu-server/110.pid)
strace: Process 3323 attached
strace: Process 3323 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 4330 read
0.00 0.000000 0 16834 write
0.00 0.000000 0 16 close
0.00 0.000000 0 80 sendmsg
0.00 0.000000 0 4112 recvmsg
0.00 0.000000 0 16 getsockname
0.00 0.000000 0 32 fcntl
0.00 0.000000 0 22694 ppoll
0.00 0.000000 0 16 accept4
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 0 48130 total
vm110 is showing 53% cpu usage. It appears to be one virtual core maxed, the other virtual core running 3%. I cannot restart it. I've had to stop and start the vm to get it back in service. Opening a console to it shows the terminal frozen. There is no way to interact with the vm or to try and get info from it.
These vms were working fine in the old system. I don't think it's the vm, I think it's something in the host.
Thank you in advance for any help and suggestions.
Last edited: