Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

... Does everyone who has the problem have LACP aggregation?
Me too. All my PVE nodes run with a bond lacp 803.2ad layer 3+4 with MTU 1500, on two 10g cards.
The PBS is a VM that runs on one of the nodes, even though the virtual disks are qcow2 files on a NAS via NFS.
 
you could try to downgrade kernel also on PVE host and report it, maybe help, but its not related to specific manufacter driver.

Actually Im running Intel Corporation Ethernet Controller X710 for 10GbE SFP+ on 8.4.14 hosts
and
BCM5719 on 9.1.1 test host

but no issues on both scenario restoring VMs.

On the other hand Im on Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) on PBS 4.1 , and every backup was a nightmare before I reverted to good old 6.14
So, with a 9.1.1 node and 6.14.11-4-pve kernel, I can restore without any problem.
PS: i had not problems with mass live migration, and I use always the same bond of 4x10Gbit with vlans and mtu 9000
 
I've seen several mention the Intel 82599ES NIC which is also what I'm running in my dev PBS Server (and PVE Hosts). 20GB LAGG's as well.

Given the drop in Disk I/O it is certainly pointing towards a networking issue on the surface.
 
So, with a 9.1.1 node and 6.14.11-4-pve kernel, I can restore without any problem.
PS: i had not problems with mass live migration, and I use always the same bond of 4x10Gbit with vlans and mtu 9000

I have downgraded kernel on two clusters and PBS, so they are to the last software available (no subscription) but kernel 6.14.11-4-pve.
One backup was normal, another one is still running, very slow; linux vm with guest agent:
INFO: 32% (19.2 GiB of 60.0 GiB) in 23m 32s, read: 16.0 MiB/s, write: 15.8 MiB/s
Incremental, new dirty-map, Really too slow....
so downgrading kernel doesn't resolve the issue for me,.
 
Can confirm downgrading the kernel worked for us, no more hanging backups or broken vms.
Which PVE version are you using ? I have the last no-subscription with downgraded kernel too, but some vm is still very very very very slow... (i am testing only a few and less important vm).
INFO: 67% (40.2 GiB of 60.0 GiB) in 49m 34s, read: 9.2 MiB/s, write: 0 B/s
 
Last edited:
Which PVE version are you using ? I have the last no-subscription with downgraded kernel too, but some vm is still very very very very slow... (i am testing only a few and less important vm).
INFO: 67% (40.2 GiB of 60.0 GiB) in 49m 34s, read: 9.2 MiB/s, write: 0 B/s
it is in my earlier message, but our main production clusters we have both 9 and 8 versions, both fully updated as of last weekend. All PBS is 4 though (4.0 with the 6.14 kernel)
 
it is in my earlier message, but our main production clusters we have both 9 and 8 versions, both fully updated as of last weekend. All PBS is 4 though (4.0 with the 6.14 kernel)
I started to have problems upgrading PBS from 4.0 to 4.1; no any problem found with PBS 4.0
 
We were not able to reproduce the issue yet unfortunately. Again, if you are not using ZFS, you can help narrow it down by testing mainline kernels:
The ZFS kernel module is the same version 2.3.4+pve1 in both kernel 6.14 and 6.17, so the likely cause of the issue is in the rest of the kernel code. Unfortunately, the difference between 6.14 and 6.17 is very big. If anybody is not using ZFS and still affected by the issue at hand, you could test mainline builds to help narrow it down:
https://kernel.ubuntu.com/mainline/v6.15/
https://kernel.ubuntu.com/mainline/v6.16/
(the amd64/linux-image... and amd64/linux-modules... packages need to be installed).
 
Did you disabled on PBS host or also on PVE hosts ? Have you noticed performance issues with 10Gbit nics like decreasing transfer speeds ?
No, we've had no issue but hard to compare since we have a bandwith limit on our Backup jobs due to with 802.3ad we hit the limits on our core switch. We have roughly 22 backup servers running at the same time, all of them with 10Gbit nics and the backbone is MLAG with 2x100Gbit.
 
Is the issue reproducible if stressing the network, e.g. by iperf? iperf -s on the PBS host and iperf -c <PBS-host-IP> -t 600 -i 10 on the backup source.

Edit: Also, do you see high memory pressure on the PBS while this issue appears?
 
Last edited: