VM hanging / freeze after update from PVE 6 to 7

Links2004

New Member
Jul 22, 2021
5
0
1
34
Hi,

after the update form PVE 6 to 7 beginning this week we have daily hanging/ freezes VMs on our Proxmox node.
the VM can not be accessed via the console,shutdown or reboot is not working too.
the only way to fix this is a full server reboot which is bad ;)

I have found old threads with the 'query-proxmox-support' failed but not found a fix or workaround to apply.

the journal shows the flowing:

Code:
Jul 22 12:47:10 s24 pvedaemon[3281804]: starting vnc proxy UPID:s24:0032138C:006DBE61:60F94CAE:vncproxy:200:root@pam:
Jul 22 12:47:10 s24 pvedaemon[3211439]: <root@pam> starting task UPID:s24:0032138C:006DBE61:60F94CAE:vncproxy:200:root@pam:
Jul 22 12:47:13 s24 qm[3281806]: VM 200 qmp command failed - VM 200 qmp command 'set_password' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:47:13 s24 pvedaemon[3281804]: Failed to run vncproxy.
Jul 22 12:47:13 s24 pvedaemon[3211439]: <root@pam> end task UPID:s24:0032138C:006DBE61:60F94CAE:vncproxy:200:root@pam: Failed to run vncproxy.
Jul 22 12:47:14 s24 pvestatd[1856]: VM 200 qmp command failed - VM 200 qmp command 'query-proxmox-support' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:47:16 s24 pvestatd[1856]: got timeout
Jul 22 12:47:16 s24 pvedaemon[3220601]: VM 200 qmp command failed - VM 200 qmp command 'query-proxmox-support' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries


Code:
Jul 22 12:48:09 s24 pvestatd[1856]: status update time (12.362 seconds)
Jul 22 12:48:09 s24 pvedaemon[3211439]: VM 200 qmp command failed - VM 200 qmp command 'query-status' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:48:09 s24 pvedaemon[3211439]: VM 200 qmp command 'query-status' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:48:09 s24 pvedaemon[3284333]: shutdown VM 200: UPID:s24:00321D6D:006DD5A2:60F94CE9:qmshutdown:200:root@pam:
Jul 22 12:48:09 s24 pvedaemon[3211439]: <root@pam> starting task UPID:s24:00321D6D:006DD5A2:60F94CE9:qmshutdown:200:root@pam:
Jul 22 12:48:12 s24 pvedaemon[3284333]: VM 200 qmp command failed - VM 200 qmp command 'system_powerdown' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:48:12 s24 pvedaemon[3284333]: VM quit/powerdown failed
Jul 22 12:48:12 s24 pvedaemon[3211439]: <root@pam> end task UPID:s24:00321D6D:006DD5A2:60F94CE9:qmshutdown:200:root@pam: VM quit/powerdown failed
Jul 22 12:48:13 s24 pvedaemon[3205746]: VM 200 qmp command failed - VM 200 qmp command 'query-proxmox-support' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries
Jul 22 12:48:15 s24 pvestatd[1856]: VM 200 qmp command failed - VM 200 qmp command 'query-proxmox-support' failed - unable to connect to VM 200 qmp socket - timeout after 31 retries

VM config:
Code:
boot: c
bootdisk: sata0
cores: 6
hostpci0: 29:00,pcie=1
keyboard: de
machine: q35
memory: 8192
name: omv
net0: virtio=56:0F:B4:D2:16:0C,bridge=vmbr1
net1: virtio=7A:32:96:CD:32:99,bridge=vmbr99
net2: virtio=22:7F:ED:B4:D3:AD,bridge=vmbr100
numa: 0
onboot: 1
ostype: l26
rng0: max_bytes=10240,source=/dev/urandom
sata0: ceph-disk:vm-200-disk-0,discard=on,size=8G,ssd=1
sata1: /dev/disk/by-id/ata-Samsung_SSD_850_EVO_1TB_S21DNXAG602063P,backup=0,discard=on,ssd=1,size=976762584K
sata2: /dev/disk/by-id/ata-Samsung_SSD_850_EVO_500GB_S2RBNX0J345835P-part6,backup=0,cache=writeback,discard=on,size=238418M,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=a89b35f0-7758-45bd-80d2-ec74762eeacf
sockets: 1
startup: order=3,up=240


pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-2-pve)
pve-manager: 7.0-10 (running version: 7.0-10/d2f465d3)
pve-kernel-5.11: 7.0-5
pve-kernel-helper: 7.0-5
pve-kernel-5.4: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.11.22-2-pve: 5.11.22-4
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph: 15.2.13-pve1
ceph-fuse: 15.2.13-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx2
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.2.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-5
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-9
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.5-2
proxmox-backup-file-restore: 2.0.5-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-5
pve-cluster: 7.0-3
pve-container: 4.0-8
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-10
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1
 
Is there anything in the syslog of the guest?
 
no way to check, the VM is fully frozen. ssh, serial and vnc are not working.
and after the reboot the last hours are lost do to a corrupted file system.
will open a ssh session with journalctl -f to write logs to local system, but not sure if this will help.
 
Hi,
the journalctl output has not shown any us full logs, only the typical crontab runs and pam_unix entrys.
we tried to move some OSDs of the node to reduce some load but no real change yet.
is there anything else we can test or info that may help?
 
Could you try adding the `aio=threads` parameter to the disks?
Seems there are some issues with io_uring which is the current default.
 
with
Code:
aio=threads
set the host and VM are 99% stable,
we had only once a hanging VM which may is not related to this exact problem,
this where resolve by restating the one VM (no restart of the host needed).

Code:
io_uring
sounds link good fit for VM to reduce overhead but may needs some more work in handling the workload of VMs.
 
Hi,

got the same problem. But I don't know where to put aio=threads?
Via GUI not possible?
nvm... but my VMs won't start :(
 
Last edited:
Hi,

got the same problem. But I don't know where to put aio=threads?
Via GUI not possible?
nvm... but my VMs won't start :(
Please provide the output of pveversion -v as well as the VM config from a VM that does not start (qm config <VMID>).
 
Code:
root@proxmox:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.4: 6.3-1
pve-kernel-5.13.19-1-pve: 5.13.19-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2
criu: 3.11-3
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.14-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-2
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-1
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.12.0-1
qemu-server: 6.3-1
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: residual config

Code:
root@proxmox:~# qm config 101

boot: order=scsi0;ide2;net0
cores: 1
cpu: host
ide2: none,media=cdrom
localtime: 1
memory: 4096
name: enedlia-utm
net0: virtio=52:57:60:7B:40:8C,bridge=vmbr2
net1: virtio=AA:87:AD:70:57:D0,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-101-disk-0,backup=0,size=80G,aio=threads
smbios1: uuid=7c531094-4620-49b9-b244-75852cc41216
sockets: 1
vmgenid: 5a531140-64bf-44d6-a579-533d699e57bf
 
Does the `Start` task finish, does it run into a timeout? Do you get an error?
You should be able to see the task status in the GUI on the bottom. If you double click a task, you should see the log of the task.

Could you provide the journal from 5 minutes before until after the start of a VM?
Use the command journalctl --since "2021-11-18 16:55:00" > journal.txt, modify the time to match ~5 minutes before the `start` task, and attach the resulting file here.
 
From what I see right now is, that the SCSI controller is corrupted on all machines.

The "Start" task finishes fine. No errors to see in GUI.
The problem is, that my VMs won't boot up. They freeze during startup. If I set up another controller, they will be "stable" but won't boot into system - on my linux VMs clearly because they won't find the root partition.
 

Attachments

  • journal.txt
    30.6 KB · Views: 6
These errors seem strange:
Code:
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pveproxy' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pvedaemon' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show spiceproxy' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pvestatd' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pve-cluster' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show corosync' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pve-firewall' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pvefw-logger' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pve-ha-crm' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show pve-ha-lrm' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show sshd' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show syslog' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show cron' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show postfix@-' failed: exit code 1
Nov 18 17:04:16 proxmox pvedaemon[25942]: command 'systemctl show ksmtuned' failed: exit code 1
Could you reboot and check again? This time please provide the complete log of the boot (journalctl -b > journal.txt)
 
Ok... now it is not possible anymore to get on it via ssh... Right now I have to deal with that. networking.service is marked as active (exited). And a reboot is resulting in timeout from watchdog and rebooting.
 
So after a bit of time I wasn't able to get network working... I'm very clueless right now. I will screenshot the bootlog and attach it.
EDIT: the workaround with hardcoded MAC works very well. At least I can continue my research right now and I'm able to ssh into it again :D

journal attached.
 

Attachments

  • journal.txt
    84.1 KB · Views: 3
Last edited:
My VMs are freezing. I don't get it. No errors shown. I experimented a bit with hardware settings for vm's but no luck at all. Is there a logfile / log folder for vm's?
 
Same for me. When I reduce the number of vCPU's to 1 (it were 3), then the VM does not freeze anymore and runs. I would like to believe it's the new kernel, but I don't know how to test the previous/latest 7.0 supplied kernel (5.11?)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!