Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

we're in the same boat there. unfortunately we haven't managed to reproduce it at all so far, which likely means there is some additional factor network-wise that makes this much more likely to trigger on your systems.
Absolutely. It might be the fact that PBS is running as a VM on PVE and the 9000MTU network is bridged, but that's nothing more than a gut feeling.

What I find curious is that it's always the PBS traffic that gets halted, never the NFS traffic - neither the NFS traffic from the SAN to the PVE nodes nor the NFS traffic from PBS to the (different) SAN that carries the PBS storage.

If you have more commits for me to test I'm happy to. Now that I've enabled fleecing to local-zfs the VMs do not get stuck anymore when the backup gets stuck, which was a showstopper beforehand.
 
Absolutely. It might be the fact that PBS is running as a VM on PVE and the 9000MTU network is bridged, but that's nothing more than a gut feeling.

What I find curious is that it's always the PBS traffic that gets halted, never the NFS traffic - neither the NFS traffic from the SAN to the PVE nodes nor the NFS traffic from PBS to the (different) SAN that carries the PBS storage.

the backup and reader sessions use HTTP2, which probably changes the traffic pattern sufficiently from other work loads. we haven't had any reports yet of other traffic/connections being affected. it might also be a bad interaction between the HTTP 2 client/server code and the new behaviour of the kernel.

If you have more commits for me to test I'm happy to. Now that I've enabled fleecing to local-zfs the VMs do not get stuck anymore when the backup gets stuck, which was a showstopper beforehand.

the commit I mentioned earlier is actually already part of 6.7.11 :-/ Chris replied to the LKML thread with the patch series that contained the "problematic" commit, maybe the netdev people have more ideas what to look at next.
 
There is a new Linux kernel with version 6.17.11-2-test-pve available for testing. You can get the debian packages (including sha256 checksums for integrity verification) from http://download.proxmox.com/temp/kernel-6.17.11-tcp-stall-2/ and install by using apt install ./[<package-name>].deb. Again, double check that you booted into the correct version by uname -a after a reboot.

Testing and feedback on this kernel build is highly appreciated!
Hi,
this weekend I also activated Kernel 6.17.11-1-test-pve on the primary PBS (2 x 10G NICs with LACP and 9000 MTU), and so far it's working properly. I've now activated the latest test release, 6.17.11-2-test-pve, and the first backup was fine. I'll keep you updated if any issues arise.
 
  • Like
Reactions: fabian and LKo
Has the hypothesis been tested that this is not a kernel bug, but rather a problem in the Proxmox Rust data transfer application that has been triggered by a change in the kernel? FWIW, I also have been unable to hang PVE to PBS server network connections with the new kernel outside of this Rust application.
 
Has the hypothesis been tested that this is not a kernel bug, but rather a problem in the Proxmox Rust data transfer application that has been triggered by a change in the kernel? FWIW, I also have been unable to hang PVE to PBS server network connections with the new kernel outside of this Rust application.
tbf, it shouldn't be able to trigger a TCP rcv_wnd collapse like that. That said, I did not see any calls in strace that would indicate an application side problem/misuse of syscalls.
 
tbf, it shouldn't be able to trigger a TCP rcv_wnd collapse like that. That said, I did not see any calls in strace that would indicate an application side problem/misuse of syscalls.
Even if the kernel shouldn't be doing what is observed is there an application level patch to consider to make PBS more robust, e.g., request a larger value if it gets too small?