VMs hung after backup

florian.koenig · Feb 20, 2024

Something seems to be not quite right with this patch.
In our cluster everything was fine until the installation of pve-qemu-kvm 8.1.5-2
Since the update last friday we have a VM experiencing high iowait (~60%) during backup (no freeze/crash luckily).
After backup completes, the iowait returns to normal (~4%).

The said vm uses an hdd disk on a ceph pool with iothreads.

I disabled iothreads now and will observe the behaviour at the next backup cycle

fiona · Feb 20, 2024

Hi,

florian.koenig said:
Something seems to be not quite right with this patch.
In our cluster everything was fine until the installation of pve-qemu-kvm 8.1.5-2
Since the update last friday we have a VM experiencing high iowait (~60%) during backup (no freeze/crash luckily).
After backup completes, the iowait returns to normal (~4%).

The said vm uses an hdd disk on a ceph pool with iothreads.

I disabled iothreads now and will observe the behaviour at the next backup cycle

from what previous version did you upgrade? You could install the previous version of the package to see if it's really a regression. As always, shutting down+starting the VM, Reboot in the UI (within the guest is not enough!) or migrating to a node with the older version is required for the VM to actually use the newly installed QEMU binary.

Was there also a kernel update?

Another thing you could try is limiting the amount of workers used for bacukp: https://forum.proxmox.com/threads/t...ad-behavior-during-backup.118430/#post-513106

florian.koenig · Feb 20, 2024

Bash:

egrep "upgrade (pve-qemu-kvm|proxmox-kernel-)" /var/log/dpkg.log
2024-02-16 07:50:13 upgrade proxmox-kernel-6.5:all 6.5.11-7 6.5.11-8
2024-02-16 07:50:13 upgrade pve-qemu-kvm:amd64 8.1.2-6 8.1.5-2

(Whole cluster was rebootet node by node afterwards)

I'll observe todays backup run with iothreads off (vm rebootet after change ✔) and will try a downgrade of pve-qemu-kvm tomorrow

florian.koenig · Feb 21, 2024

Backup last night ran without issues. I suppose disabling iothreads helped.
Today I'll try enabling iothreads again to see if the problem comes back.

florian.koenig · Feb 22, 2024

On last nights backup run there were no problems.... This is indeed a very strange behaviour.
I'll observe this the next days

raceme · Feb 22, 2024

Hello, I have a similar problem since I migrated to version 8 (see https://forum.proxmox.com/threads/s...on-zfs -with-5-15-39-2-pve.123639/post-635590)
Our environment includes 2 clusters and 3 TrueNAS with NFS.
The problem is clearly visible when backing up or moving a disk: the VMs become unusable and unreachable. When a console is open, we see blockages of up to 3 seconds, I/O at most. My VMs with corosync are getting out of sync due to blockages.
Only a few VMs are affected and I've been experimenting for a week without finding a clear pattern on the possible causes (different versions of linux, windows; i440fx or qt35 of different versions, etc.). I tested with or without iothreads and aiosync without it changing anything.

Syslog is full of messages like that:

pvestatd[2173]: VM 233 qmp command failed - VM 233 qmp command 'query-proxmox-support' failed - unable to connect to VM 233 qmp socket - timeout after 51 retries

For now I have simply deactivated the backup procedure. Fortunately I also have BackupPC and TrueNAS replications as backup.

Currently my settings are on all disks:
aio=threads,cache=writeback,discard=on,iothread=1,ssd=1
And on each NFS share:
options noatime
preallocation off
Which gives:

rw,noatime,vers=4.2,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,local_lock=none,addr=y.y.y.y

To test I activated CIFS sharing at the TrueNAS level and the difference is clear: the problem disappears, the backup or disk movement speeds are at the top without really disrupting the operation of the VMs. But I don't want to keep CIFS in production, I have the impression that it is less secure in terms of writes and less reliable: I have had disk corruption several times in the past.

The last thing I'm going to test today is to downgrade one of the TrueNAS to version 12 to see if that makes a difference. But I have to move important VM on the other cluster before testing...

fiona · Feb 27, 2024

raceme said:
Hello, I have a similar problem since I migrated to version 8 (see https://forum.proxmox.com/threads/s...on-zfs -with-5-15-39-2-pve.123639/post-635590)
Our environment includes 2 clusters and 3 TrueNAS with NFS.
The problem is clearly visible when backing up or moving a disk: the VMs become unusable and unreachable. When a console is open, we see blockages of up to 3 seconds, I/O at most. My VMs with corosync are getting out of sync due to blockages.
Only a few VMs are affected and I've been experimenting for a week without finding a clear pattern on the possible causes (different versions of linux, windows; i440fx or qt35 of different versions, etc.). I tested with or without iothreads and aiosync without it changing anything.

Syslog is full of messages like that:
pvestatd[2173]: VM 233 qmp command failed - VM 233 qmp command 'query-proxmox-support' failed - unable to connect to VM 233 qmp socket - timeout after 51 retries

Do the VMs continue running normally after backup? If not, please use

Code:

apt install pve-qemu-kvm-dbgsym gdb
gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/233.pid)

with the correct VM ID to obtain a debugger backtrace.

raceme said:
To test I activated CIFS sharing at the TrueNAS level and the difference is clear: the problem disappears, the backup or disk movement speeds are at the top without really disrupting the operation of the VMs. But I don't want to keep CIFS in production, I have the impression that it is less secure in terms of writes and less reliable: I have had disk corruption several times in the past.

Do you see anything interesting in your system logs/journal when using NFS?

You can try using a bandwidth limit for backup and disk move and see if it improves the situation.

raceme · Feb 28, 2024

fiona said:
Do the VMs continue running normally after backup? If not, please use

During backup or disk move VM is slowed down. Sometime is is barely usable (ping increase up to 1 to 4 seconds, vnc console is hashed, vm with corosync loose links, etc.).
After backup (or disk move) everything is back to normal immediately.

fiona said:
Do you see anything interesting in your system logs/journal when using NFS?

Nothing except on TrueNAS log: every pve host request exports every 10 seconds but it seems normal: https://forum.proxmox.com/threads/constant-nfs-exports-requests.23893/

fiona said:
You can try using a bandwidth limit for backup and disk move and see if it improves the situation.

I am not sure it will change something: there is no saturation of the bandwidth, the transfer is extremely slow. Perhaps there is a saturation in the number of small operations?
However, I noticed two things:

the problem disappears by using CIFS instead of NFS.
the problem is much less when downgrading TrueNAS from version 13 to 12. But even in this case there are still slowdowns and locks that did not exist with PVE 7 before the update to PVE 8.

cmeWNQAm · Feb 28, 2024

fiona said:
Do the VMs continue running normally after backup? If not, please use

Code:

apt install pve-qemu-kvm-dbgsym gdb gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/233.pid)

with the correct VM ID to obtain a debugger backtrace.

Do you see anything interesting in your system logs/journal when using NFS?

You can try using a bandwidth limit for backup and disk move and see if it improves the situation.

I've been having problems with backups to an NFS share as well since upgrading to PVE 8. I didn't see this thread (or thought it was unrelated as of a week ago), so started my own: https://forum.proxmox.com/threads/n...-in-freebsd-guest-after-pve-8-upgrade.141922/

I have tried the new QEMU patch, without success. GDB is also unable to do anything. I've copied my latest post on that thread below.

==================

I just upgraded my system to use qemu 8.1.5-3 that has the iothread patch discussed in the other thread:

Code:

# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.13-1-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-10
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
pve-kernel-5.15.136-1-pve: 5.15.136-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.1
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-4
pve-firewall: 5.0.3
pve-firmware: 3.9-2
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve2

I just manually ran the backup job, and the exact same problem happened. One VM and one container backed up to the NFS share, and now it's stuck trying to back up the second container. The TrueNAS VM appears to be freezing and is becoming nonfunctional and the Proxmox web UI is deteriorating as I'm typing this.

I have tried suspending the TrueNAS VM (which was a suggested workaround to the iothread issue), which doesn't work:

Code:

# qm suspend 102
VM 102 qmp command 'stop' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries

I have also tried the suggested GDB command, which simply hangs forever and does nothing.

Code:

# gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/102.pid)

iotop shows 0B/s disk IO, apart from a few kB/s spikes presumably to the local storage.

fiona · Feb 28, 2024

Hi,

cmeWNQAm said:
I have tried suspending the TrueNAS VM (which was a suggested workaround to the iothread issue), which doesn't work:

Code:

# qm suspend 102 VM 102 qmp command 'stop' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries

I have also tried the suggested GDB command, which simply hangs forever and does nothing.

Code:

# gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/102.pid)

iotop shows 0B/s disk IO, apart from a few kB/s spikes presumably to the local storage.

that sounds like the process might be stuck in uninterruptible D state. You can check with e.g. ps or top.

@cmeWNQAm @raceme are you running the backup over the same network used for the VM disks? I'd still try the bandwidth limit to make sure. The network not being saturated doesn't necessarily mean that there is no pile of IO requests that got stuck/delayed after some burst for example.

cmeWNQAm · Feb 28, 2024

fiona said:
Hi,

that sounds like the process might be stuck in uninterruptible D state. You can check with e.g. ps or top.

@cmeWNQAm @raceme are you running the backup over the same network used for the VM disks? I'd still try the bandwidth limit to make sure. The network not being saturated doesn't necessarily mean that there is no pile of IO requests that got stuck/delayed after some burst for example.

Yes, although the NFS share between my TrueNAS VM and PVE host uses a dedicated virtual network interface, all NFS traffic goes over that, regardless of if it's for the backups or the container boot disks. I could imagine that there is some kind of deadlock happening since, during backups, it's copying container disks from TrueNAS back to itself (I know this sounds silly for the purposes of "backups," but it's more for rollback ability). This shouldn't cause a deadlock, but maybe copying data from an NFS storage to itself is the trigger of the whole problem.

So far, the first backup of a VM from local storage, as well as a couple smaller containers from TrueNAS has always worked. It seems to usually fail when it hits a larger (tens of GB) image from the next container.

I can check for D state processes tomorrow, but I suspect you're right. I can also try limiting the bandwidth, but even if that does work (I'm inclined to say it will since this only happens during IO intensive operations), it seems like a bandaid solution and prevents utilizing the full performance of my disks.

raceme · Feb 28, 2024

fiona said:
@cmeWNQAm @raceme are you running the backup over the same network used for the VM disks? I'd still try the bandwidth limit to make sure. The network not being saturated doesn't necessarily mean that there is no pile of IO requests that got stuck/delayed after some burst for example.

Not in my case there is two dedicated network using different switches and different TrueNAS with NFS:

san: MTU 9000, tagged vlan, 2 x 1 Gbps bond for storing VM disks
management: 2 x 1 Gbps bond for storing backups

I'll try the i/o limitation but how to set it ? Do you mean the global configuration in datacenter.cfg (I don't see anything about backup bandwidth limitation in https://pve.proxmox.com/wiki/Manual:_datacenter.cfg) or on each vm via the conf file ?

fiona · Feb 28, 2024

raceme said:
Not in my case there is two dedicated network using different switches and different TrueNAS with NFS:

san: MTU 9000, tagged vlan, 2 x 1 Gbps bond for storing VM disks

management: 2 x 1 Gbps bond for storing backups

I'll try the i/o limitation but how to set it ? Do you mean the global configuration in datacenter.cfg (I don't see anything about backup bandwidth limitation in https://pve.proxmox.com/wiki/Manual:_datacenter.cfg) or on each vm via the conf file ?

For backups, you can the limit node-wide in /etc/vzdump.conf. Setting it for a specific backup job requires doing it via API. See the following post (which is about a different setting, but still applies): https://forum.proxmox.com/threads/t...ad-behavior-during-backup.118430/#post-513106

raceme · Feb 28, 2024

fiona said:
For backups, you can the limit node-wide in /etc/vzdump.conf. Setting it for a specific backup job requires doing it via API. See the following post (which is about a different setting, but still applies): https://forum.proxmox.com/threads/t...ad-behavior-during-backup.118430/#post-513106

Thank you for the ideas and suggestions for improvement and some encouraging news:
First on my first cluster I tried in datacenter.cfg : bwlimit: move=25600,restore=25600 and now I can "live" move disk without perturbation.
On the second cluster (the one with TrueNAS 12) I tried in /etc/vzdump.conf : bwlimit: 25600 and performance: max-workers=2 and backup is really capped at 25 MiB/s and VM seems to be usable. I will try to revert TrueNAS to v13 to confirm the behaviour.
I am not sure about the use of max-workers for vzdump ?

fiona · Feb 29, 2024

raceme said:
I am not sure about the use of max-workers for vzdump ?

If you get the full speed you configured via limit, there is no reason to increase max-workers further. It's essentially the max number of IO requests issued by QEMU for backup at the same time.

germanb · Feb 29, 2024

Hello, i face the same problems with backups, generating high io and hang ups of the vm with error in proxmox "watchdog: BUG: soft lockup - CPU stuck for" that lead to force a restart in the server, this with the last packages of enterprise repository in a clean install with zfs.

Its any news to fix this problems? @fiona

fiona · Feb 29, 2024

Hi,

germanb said:
Hello, i face the same problems with backups, generating high io and hang ups of the vm with error in proxmox "watchdog: BUG: soft lockup - CPU stuck for" that lead to force a restart in the server, this with the last packages of enterprise repository in a clean install with zfs.

Its any news to fix this problems? @fiona

please post the output of pveversion -v and qm config <ID> and the full backup task log. Did you already try and configure a bandwidth limit? If you have high IO wait on the host during backup, see: https://forum.proxmox.com/threads/t...ad-behavior-during-backup.118430/#post-513106

germanb · Feb 29, 2024

Hi, no I didn't try the bandwidth limit,

Code:

root@pve1:~# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-9
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.1
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

Code:

root@pve1:~# qm config 110
agent: 1
bios: ovmf
boot: order=virtio0;net0
cores: 6
cpu: host
description: qmdump#map%3Aefidisk0%3Adrive-efidisk0%3Alocal-zfs%3Araw%3A%0Aqmdump#map%3Avirtio0%3Adrive-virtio0%3Alocal-zfs%3Araw%3A%0Aqmdump#map%3Avirtio1%3Adrive-virtio1%3Alocal-zfs%3Araw%3A%0Aqmdump#map%3Avirtio2%3Adrive-virtio2%3ADATA%3Araw%3A
efidisk0: DATASSD:vm-110-disk-2,size=1M
machine: pc-i440fx-5.2
memory: 6144
name: WIN2022DATA
net0: virtio=FA:58:29:C3:DA:96,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win10
scsihw: virtio-scsi-single
smbios1: uuid=2679019c-2d63-43b3-a314-a272c5b893ea
sockets: 1
startup: order=1
tablet: 0
usb0: host=0665:5161,usb3=1
virtio0: DATASSD:vm-110-disk-0,aio=threads,cache=writeback,discard=on,size=50G
virtio1: DATASSD:vm-110-disk-3,aio=threads,cache=writeback,discard=on,size=200G
virtio2: DATA:vm-110-disk-0,aio=threads,cache=writeback,discard=on,size=200G
vmgenid: 0181eecc-e2a7-4dd9-bbd4-8bdc4453146d

Usually the backup can finish the task and then if i try to stop the vm or do another action generate the soft lockup of the cpu, after the backup can pass all the day ok but at the moment i do another action of shutdown or stop of any vm can trigger the soft lockup to.
Then the same can happen during the backup task.

Code:

Feb 28 17:09:23 pve1 kernel:  </TASK>
Feb 28 17:09:23 pve1 kernel: R13: 00007ffc895c62a0 R14: 000055bdb643fd38 R15: 00007f10eccbc020
Feb 28 17:09:23 pve1 kernel: R10: 5db6b1a630e97fcf R11: 0000000000000293 R12: 000055bdb79a3a80
Feb 28 17:09:23 pve1 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Feb 28 17:09:23 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000010
Feb 28 17:09:23 pve1 kernel: RAX: 0000000000000000 RBX: 000055bdb79b3190 RCX: 00007f10ea98790a
Feb 28 17:09:23 pve1 kernel: RSP: 002b:00007ffc895c5e60 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
Feb 28 17:09:23 pve1 kernel: Code: 48 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c 24 0c e8 a3 ce f8 ff 8b 7c 24 0c 89 c2 b8 03 00 00 0>
Feb 28 17:09:23 pve1 kernel: RIP: 0033:0x7f10ea98790a
Feb 28 17:09:23 pve1 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Feb 28 17:09:23 pve1 kernel:  ? exc_page_fault+0x94/0x1b0
Feb 28 17:09:23 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 28 17:09:23 pve1 kernel:  ? irqentry_exit+0x43/0x50
Feb 28 17:09:23 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 28 17:09:23 pve1 kernel:  ? irqentry_exit_to_user_mode+0x17/0x20
Feb 28 17:09:23 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 28 17:09:23 pve1 kernel:  do_syscall_64+0x67/0x90
Feb 28 17:09:23 pve1 kernel:  syscall_exit_to_user_mode+0x29/0x60
Feb 28 17:09:23 pve1 kernel:  exit_to_user_mode_prepare+0x170/0x190
Feb 28 17:09:23 pve1 kernel:  task_work_run+0x61/0xa0
Feb 28 17:09:23 pve1 kernel:  ____fput+0xe/0x20
Feb 28 17:09:23 pve1 kernel:  __fput+0xfc/0x2c0
Feb 28 17:09:23 pve1 kernel:  blkdev_release+0x2b/0x40
Feb 28 17:09:23 pve1 kernel:  blkdev_put+0x116/0x1e0
Feb 28 17:09:23 pve1 kernel:  blkdev_put_whole+0x3c/0x40
Feb 28 17:09:23 pve1 kernel:  blkdev_flush_mapping+0x5e/0xf0
Feb 28 17:09:23 pve1 kernel:  truncate_inode_pages+0x15/0x30
Feb 28 17:09:23 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 28 17:09:23 pve1 kernel:  truncate_inode_pages_range+0xe9/0x4c0
Feb 28 17:09:23 pve1 kernel:  ? truncate_folio_batch_exceptionals.part.0+0x1bb/0x200
Feb 28 17:09:23 pve1 kernel:  ? find_lock_entries+0x7f/0x290
Feb 28 17:09:23 pve1 kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
Feb 28 17:09:23 pve1 kernel:  <TASK>
Feb 28 17:09:23 pve1 kernel:  </IRQ>
Feb 28 17:09:23 pve1 kernel:  ? sysvec_apic_timer_interrupt+0x8d/0xd0
Feb 28 17:09:23 pve1 kernel:  ? __sysvec_apic_timer_interrupt+0x62/0x140
Feb 28 17:09:23 pve1 kernel:  ? hrtimer_interrupt+0xf6/0x250
Feb 28 17:09:23 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 28 17:09:23 pve1 kernel:  ? __hrtimer_run_queues+0x108/0x280
Feb 28 17:09:23 pve1 kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
Feb 28 17:09:23 pve1 kernel:  ? watchdog_timer_fn+0x1d8/0x240
Feb 28 17:09:23 pve1 kernel:  ? show_regs+0x6d/0x80
Feb 28 17:09:23 pve1 kernel:  <IRQ>
Feb 28 17:09:23 pve1 kernel: Call Trace:
Feb 28 17:09:23 pve1 kernel: PKRU: 55555554
Feb 28 17:09:23 pve1 kernel: CR2: 000055bdb608c530 CR3: 00000004f3904000 CR4: 0000000000750ee0
Feb 28 17:09:23 pve1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 28 17:09:23 pve1 kernel: FS:  00007f10dfac1340(0000) GS:ffff9c18ff100000(0000) knlGS:0000000000000000
Feb 28 17:09:23 pve1 kernel: R13: ffffa829f4663c48 R14: ffffa829f4663c50 R15: ffff9bfa9383c9e8
Feb 28 17:09:23 pve1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffa829f4663cc8
Feb 28 17:09:23 pve1 kernel: RBP: ffffa829f4663c08 R08: ffffed6a51fadf80 R09: 0000000000000000
Feb 28 17:09:23 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Feb 28 17:09:23 pve1 kernel: RAX: ffffed6a51fadf80 RBX: fffffffffffffffe RCX: 0000000000000000
Feb 28 17:09:23 pve1 kernel: RSP: 0018:ffffa829f4663b80 EFLAGS: 00000246
Feb 28 17:09:23 pve1 kernel: Code: 45 b8 00 00 00 00 48 c7 45 c0 00 00 00 00 48 c7 45 c8 00 00 00 00 e8 e0 a4 e4 ff 48 89 de 48 8d 7d 98 e8 c4 c1 c8 00 4>
Feb 28 17:09:23 pve1 kernel: RIP: 0010:find_lock_entries+0x7f/0x290
Feb 28 17:09:23 pve1 kernel: Hardware name: Gigabyte Technology Co., Ltd. X570S AORUS ELITE AX/X570S AORUS ELITE AX, BIOS F7d 12/25/2023
Feb 28 17:09:23 pve1 kernel: CPU: 28 PID: 4223 Comm: kvm Tainted: P           O L     6.5.11-8-pve #1
Feb 28 17:09:23 pve1 kernel:  input_leds cryptd snd i2c_algo_bit ecdh_generic usbkbd ecc libarc4 rapl video soundcore pcspkr gigabyte_wmi wmi_bmof k10tem>
Feb 28 17:09:23 pve1 kernel: Modules linked in: tcp_diag inet_diag nf_conntrack_netlink overlay xt_nat xfrm_user xfrm_algo ipt_REJECT nf_reject_ipv4 xt_L>
Feb 28 17:09:23 pve1 kernel: watchdog: BUG: soft lockup - CPU#28 stuck for 82s! [kvm:4223]
Feb 28 17:08:58 pve1 qmeventd[1757]: cleanup failed, terminating pid '4223' with SIGKILL

mnih · Mar 19, 2024

After it seemed fixed with the latest pve-qemu updates, I've got (seemingly this) problem again.

VM:

(shell does not come up)

VM:

Host:

VM's webserver is not reachable.

htop (proxmox backup server verify running, backup pool is not on VM disk pool):

pve cluster of three nodes. other VMs on the affected node and other nodes work without any problems. VMs run on nvme zfs-mirror

vm config:

Code:

agent: 1
args: -vnc 0.0.0.0:107
boot: order=scsi0
cores: 4
cpu: x86-64-v3
ide2: none,media=cdrom
memory: 8192
name: nextcloud64
net0: virtio=36:6E:FF:87:4B:12,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: vmpool:vm-107-disk-0,aio=threads,discard=on,format=raw,iothread=1,size=20G,ssd=1
scsi1: vmpool:vm-107-disk-1,aio=threads,discard=on,format=raw,iothread=1,size=1T,ssd=1
scsi2: vmpool:vm-107-disk-2,aio=threads,backup=0,discard=on,iothread=1,size=3T,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=c353f2c8-5387-4f46-b005-d913ed41280b
sockets: 1
startup: order=20
tablet: 0
tags: nc
vmgenid: b2b1fa71-c010-4e16-b8af-8a23315febf0

pveversion:

Code:

root@vas:~# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

fiona · Mar 19, 2024

mnih said:
After it seemed fixed with the latest pve-qemu updates, I've got (seemingly this) problem again.

Should there have been issues with the connection to PBS, it could be: https://bugzilla.proxmox.com/show_bug.cgi?id=3231
Backup fleecing is currently being worked on to avoid such issues: https://bugzilla.proxmox.com/show_bug.cgi?id=4136

How did the load on PBS and network look like during the problematic backup? Where there other backups running at the same time?

mnih said:
pve cluster of three nodes. other VMs on the affected node and other nodes work without any problems. VMs run on nvme zfs-mirror

Is there anything special about this VM compared to other VMs? Did the same VM already fail multiple times?

VMs hung after backup

New Member

Proxmox Staff Member

New Member

New Member

New Member

Member

Proxmox Staff Member

Member

New Member

Proxmox Staff Member

New Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Attachments

Active Member

Proxmox Staff Member

We value your privacy