VM hanging / freeze after update from PVE 6 to 7

Links2004 · Jul 22, 2021

Hi,

after the update form PVE 6 to 7 beginning this week we have daily hanging/ freezes VMs on our Proxmox node.
the VM can not be accessed via the console,shutdown or reboot is not working too.
the only way to fix this is a full server reboot which is bad

I have found old threads with the 'query-proxmox-support' failed but not found a fix or workaround to apply.

the journal shows the flowing:

Code:

Jul 22 12:47:10 s24 pvedaemon[3281804]: starting vnc proxy UPID:s24:0032138C:006DBE61:60F94CAE:vncproxy:200:root@pam:
Jul 22 12:47:10 s24 pvedaemon[3211439]: <root@pam> starting task UPID:s24:0032138C:006DBE61:60F94CAE:vncproxy:200:root@pam:
Jul 22 12:47:13 s24 qm[3281806]: VM 200 qmp command failed - VM 200 qmp command 'set_password' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:47:13 s24 pvedaemon[3281804]: Failed to run vncproxy.
Jul 22 12:47:13 s24 pvedaemon[3211439]: <root@pam> end task UPID:s24:0032138C:006DBE61:60F94CAE:vncproxy:200:root@pam: Failed to run vncproxy.
Jul 22 12:47:14 s24 pvestatd[1856]: VM 200 qmp command failed - VM 200 qmp command 'query-proxmox-support' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:47:16 s24 pvestatd[1856]: got timeout
Jul 22 12:47:16 s24 pvedaemon[3220601]: VM 200 qmp command failed - VM 200 qmp command 'query-proxmox-support' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries

Code:

Jul 22 12:48:09 s24 pvestatd[1856]: status update time (12.362 seconds)
Jul 22 12:48:09 s24 pvedaemon[3211439]: VM 200 qmp command failed - VM 200 qmp command 'query-status' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:48:09 s24 pvedaemon[3211439]: VM 200 qmp command 'query-status' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:48:09 s24 pvedaemon[3284333]: shutdown VM 200: UPID:s24:00321D6D:006DD5A2:60F94CE9:qmshutdown:200:root@pam:
Jul 22 12:48:09 s24 pvedaemon[3211439]: <root@pam> starting task UPID:s24:00321D6D:006DD5A2:60F94CE9:qmshutdown:200:root@pam:
Jul 22 12:48:12 s24 pvedaemon[3284333]: VM 200 qmp command failed - VM 200 qmp command 'system_powerdown' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:48:12 s24 pvedaemon[3284333]: VM quit/powerdown failed
Jul 22 12:48:12 s24 pvedaemon[3211439]: <root@pam> end task UPID:s24:00321D6D:006DD5A2:60F94CE9:qmshutdown:200:root@pam: VM quit/powerdown failed
Jul 22 12:48:13 s24 pvedaemon[3205746]: VM 200 qmp command failed - VM 200 qmp command 'query-proxmox-support' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:48:15 s24 pvestatd[1856]: VM 200 qmp command failed - VM 200 qmp command 'query-proxmox-support' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries

VM config:

Code:

boot: c
bootdisk: sata0
cores: 6
hostpci0: 29:00,pcie=1
keyboard: de
machine: q35
memory: 8192
name: omv
net0: virtio=56:0F:B4:D2:16:0C,bridge=vmbr1
net1: virtio=7A:32:96:CD:32:99,bridge=vmbr99
net2: virtio=22:7F:ED:B4:D3:AD,bridge=vmbr100
numa: 0
onboot: 1
ostype: l26
rng0: max_bytes=10240,source=/dev/urandom
sata0: ceph-disk:vm-200-disk-0,discard=on,size=8G,ssd=1
sata1: /dev/disk/by-id/ata-Samsung_SSD_850_EVO_1TB_S21DNXAG602063P,backup=0,discard=on,ssd=1,size=976762584K
sata2: /dev/disk/by-id/ata-Samsung_SSD_850_EVO_500GB_S2RBNX0J345835P-part6,backup=0,cache=writeback,discard=on,size=238418M,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=a89b35f0-7758-45bd-80d2-ec74762eeacf
sockets: 1
startup: order=3,up=240

pveversion -v

proxmox-ve: 7.0-2 (running kernel: 5.11.22-2-pve)
pve-manager: 7.0-10 (running version: 7.0-10/d2f465d3)
pve-kernel-5.11: 7.0-5
pve-kernel-helper: 7.0-5
pve-kernel-5.4: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph: 15.2.13-pve1
ceph-fuse: 15.2.13-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx2
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.2.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-5
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-9
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.5-2
proxmox-backup-file-restore: 2.0.5-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-5
pve-cluster: 7.0-3
pve-container: 4.0-8
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-10
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

mira · Jul 23, 2021

Is there anything in the syslog of the guest?

Links2004 · Jul 23, 2021

no way to check, the VM is fully frozen. ssh, serial and vnc are not working.
and after the reboot the last hours are lost do to a corrupted file system.
will open a ssh session with journalctl -f to write logs to local system, but not sure if this will help.

Links2004 · Jul 31, 2021

Hi,
the journalctl output has not shown any us full logs, only the typical crontab runs and pam_unix entrys.
we tried to move some OSDs of the node to reduce some load but no real change yet.
is there anything else we can test or info that may help?

mira · Aug 4, 2021

Could you try adding the `aio=threads` parameter to the disks?
Seems there are some issues with io_uring which is the current default.

Links2004 · Aug 4, 2021

ok, we added aio=threads to all disks.
will report back how this goes.

Links2004 · Aug 11, 2021

with

Code:

aio=threads

set the host and VM are 99% stable,
we had only once a hanging VM which may is not related to this exact problem,
this where resolve by restating the one VM (no restart of the host needed).

Code:

io_uring

sounds link good fit for VM to reduce overhead but may needs some more work in handling the workload of VMs.

shruxx88 · Nov 18, 2021

Hi,

got the same problem. But I don't know where to put aio=threads?
Via GUI not possible?
nvm... but my VMs won't start

mira · Nov 18, 2021

shruxx88 said:
Hi,

got the same problem. But I don't know where to put aio=threads?
Via GUI not possible?
nvm... but my VMs won't start

Please provide the output of pveversion -v as well as the VM config from a VM that does not start (qm config <VMID>).

shruxx88 · Nov 18, 2021

Code:

root@proxmox:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.4: 6.3-1
pve-kernel-5.13.19-1-pve: 5.13.19-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2
criu: 3.11-3
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.14-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-2
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-1
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.12.0-1
qemu-server: 6.3-1
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: residual config

Code:

root@proxmox:~# qm config 101

boot: order=scsi0;ide2;net0
cores: 1
cpu: host
ide2: none,media=cdrom
localtime: 1
memory: 4096
name: enedlia-utm
net0: virtio=52:57:60:7B:40:8C,bridge=vmbr2
net1: virtio=AA:87:AD:70:57:D0,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-101-disk-0,backup=0,size=80G,aio=threads
smbios1: uuid=7c531094-4620-49b9-b244-75852cc41216
sockets: 1
vmgenid: 5a531140-64bf-44d6-a579-533d699e57bf

mira · Nov 18, 2021

Does the `Start` task finish, does it run into a timeout? Do you get an error?
You should be able to see the task status in the GUI on the bottom. If you double click a task, you should see the log of the task.

Could you provide the journal from 5 minutes before until after the start of a VM?
Use the command journalctl --since "2021-11-18 16:55:00" > journal.txt, modify the time to match ~5 minutes before the `start` task, and attach the resulting file here.

shruxx88 · Nov 18, 2021

From what I see right now is, that the SCSI controller is corrupted on all machines.

The "Start" task finishes fine. No errors to see in GUI.
The problem is, that my VMs won't boot up. They freeze during startup. If I set up another controller, they will be "stable" but won't boot into system - on my linux VMs clearly because they won't find the root partition.

mira · Nov 18, 2021

These errors seem strange:

Code:

Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pveproxy' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pvedaemon' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show spiceproxy' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pvestatd' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pve-cluster' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show corosync' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pve-firewall' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pvefw-logger' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pve-ha-crm' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pve-ha-lrm' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show sshd' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show syslog' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show cron' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show postfix@-' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show ksmtuned' failed: exit code 1

Could you reboot and check again? This time please provide the complete log of the boot (journalctl -b > journal.txt)

shruxx88 · Nov 18, 2021

Ok... now it is not possible anymore to get on it via ssh... Right now I have to deal with that. networking.service is marked as active (exited). And a reboot is resulting in timeout from watchdog and rebooting.

shruxx88 · Nov 18, 2021

So after a bit of time I wasn't able to get network working... I'm very clueless right now. I will screenshot the bootlog and attach it.
EDIT: the workaround with hardcoded MAC works very well. At least I can continue my research right now and I'm able to ssh into it again

journal attached.

shruxx88 · Nov 18, 2021

My VMs are freezing. I don't get it. No errors shown. I experimented a bit with hardware settings for vm's but no luck at all. Is there a logfile / log folder for vm's?

Polyphemus · Nov 18, 2021

Same for me. When I reduce the number of vCPU's to 1 (it were 3), then the VM does not freeze anymore and runs. I would like to believe it's the new kernel, but I don't know how to test the previous/latest 7.0 supplied kernel (5.11?)

shruxx88 · Nov 18, 2021

I'm running my system on a 6-Core nested AMD Epyc. Is it possible that AMD is a problem?

Polyphemus · Nov 18, 2021

I'm running an Intel i3-9100...

Polyphemus · Nov 18, 2021

I've forgot to mention that my problems arose when upgrading from 7.0 to 7.1...

VM hanging / freeze after update from PVE 6 to 7

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

New Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Attachments

Proxmox Staff Member

Member

Member

Attachments

Member

Member

Member

Member

Member