VM goes read only when starting local backup

uFx

Renowned Member
Jun 19, 2015
13
0
66
We upgraded one of our nodes to the latest Proxmox version (pve-manager/6.2-4/9824574a (running kernel: 5.4.34-1-pve)) and now we encounter an issue during backup of one specific VM (Linux guest). The backup starts:

INFO: Starting Backup of VM 123 (qemu)
INFO: Backup started at 2020-05-19 03:01:41
INFO: status = running
INFO: VM Name: test-VM
INFO: include disk 'virtio0' 'local:123/vm-123-disk-1.raw' 220G
INFO: backup mode: snapshot
INFO: bandwidth limit: 250000 KB/s
INFO: ionice priority: 7
INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-123-2020_05_19-03_01_41.vma.lzo'
INFO: started backup task 'bfa59d15-0af8-4726-b21d-6faa2e4dcb33'
INFO: resuming VM again
ERROR: VM 123 qmp command 'cont' failed - got timeout
INFO: aborting backup job
ERROR: Backup of VM 123 failed - VM 123 qmp command 'cont' failed - got timeout

And fails after a few seconds. The guest goes in read only mode and we have to reboot the vm and execute a filesystem check to fix it. This issue does not occur on any other vm's on this server. And there's enough space available at /var/lib/vz/dump.
 
We are experiencing this with more vm's on this node now. Some extra info:

All VM's use local directory storage with raw as disk format. The problem occured when we upgraded to Proxmox 6.2. The VM's are not using the qemu-agent. Not all vm's on this node have this problem.

pveversion -v:
Code:
 proxmox-ve: 6.2-1 (running kernel: 5.4.34-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-1
pve-kernel-helper: 6.2-1
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-4.15: 5.4-9
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.98-3-pve: 4.4.98-103
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.62-1-pve: 4.4.62-88
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.19-1-pve: 4.4.19-66
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-5
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
Another VM on this host hangs for a few moments when starting the backup but doesn't go to read only:

Code:
May 22 05:12:35 vps1 kernel: [442137.812027] INFO: rcu_sched detected stalls on CPUs/tasks:
May 22 05:12:35 vps1 kernel: [442137.812171]     0-...: (1 GPs behind) idle=a73/2/0 softirq=65007472/65007472 fqs=15000
May 22 05:12:35 vps1 kernel: [442137.814520]     (detected by 1, t=15002 jiffies, g=43606771, c=43606770, q=97)
May 22 05:12:35 vps1 kernel: [442137.815982] Task dump for CPU 0:
May 22 05:12:35 vps1 kernel: [442137.815985] swapper/0       R  running task        0     0      0 0x00000008
May 22 05:12:35 vps1 kernel: [442137.815990]  ffffffff81067af2 0000000000000010 0000000000000246 ffffffff81e03e98
May 22 05:12:35 vps1 kernel: [442137.815995]  0000000000000018 ffffffff81f43800 ffffffff81e03eb8 ffffffff8103914e
May 22 05:12:35 vps1 kernel: [442137.815998]  ffffffff81f43800 ffffffff81e04000 ffffffff81e03ec8 ffffffff81039ff5
May 22 05:12:35 vps1 kernel: [442137.816002] Call Trace:

What's the best way to downgrade to the previous kernel? Manual select an older one during boot?