Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

we're in the same boat there. unfortunately we haven't managed to reproduce it at all so far, which likely means there is some additional factor network-wise that makes this much more likely to trigger on your systems.
Absolutely. It might be the fact that PBS is running as a VM on PVE and the 9000MTU network is bridged, but that's nothing more than a gut feeling.

What I find curious is that it's always the PBS traffic that gets halted, never the NFS traffic - neither the NFS traffic from the SAN to the PVE nodes nor the NFS traffic from PBS to the (different) SAN that carries the PBS storage.

If you have more commits for me to test I'm happy to. Now that I've enabled fleecing to local-zfs the VMs do not get stuck anymore when the backup gets stuck, which was a showstopper beforehand.
 
Absolutely. It might be the fact that PBS is running as a VM on PVE and the 9000MTU network is bridged, but that's nothing more than a gut feeling.

What I find curious is that it's always the PBS traffic that gets halted, never the NFS traffic - neither the NFS traffic from the SAN to the PVE nodes nor the NFS traffic from PBS to the (different) SAN that carries the PBS storage.

the backup and reader sessions use HTTP2, which probably changes the traffic pattern sufficiently from other work loads. we haven't had any reports yet of other traffic/connections being affected. it might also be a bad interaction between the HTTP 2 client/server code and the new behaviour of the kernel.

If you have more commits for me to test I'm happy to. Now that I've enabled fleecing to local-zfs the VMs do not get stuck anymore when the backup gets stuck, which was a showstopper beforehand.

the commit I mentioned earlier is actually already part of 6.7.11 :-/ Chris replied to the LKML thread with the patch series that contained the "problematic" commit, maybe the netdev people have more ideas what to look at next.
 
There is a new Linux kernel with version 6.17.11-2-test-pve available for testing. You can get the debian packages (including sha256 checksums for integrity verification) from http://download.proxmox.com/temp/kernel-6.17.11-tcp-stall-2/ and install by using apt install ./[<package-name>].deb. Again, double check that you booted into the correct version by uname -a after a reboot.

Testing and feedback on this kernel build is highly appreciated!
Hi,
this weekend I also activated Kernel 6.17.11-1-test-pve on the primary PBS (2 x 10G NICs with LACP and 9000 MTU), and so far it's working properly. I've now activated the latest test release, 6.17.11-2-test-pve, and the first backup was fine. I'll keep you updated if any issues arise.
 
Has the hypothesis been tested that this is not a kernel bug, but rather a problem in the Proxmox Rust data transfer application that has been triggered by a change in the kernel? FWIW, I also have been unable to hang PVE to PBS server network connections with the new kernel outside of this Rust application.
 
Has the hypothesis been tested that this is not a kernel bug, but rather a problem in the Proxmox Rust data transfer application that has been triggered by a change in the kernel? FWIW, I also have been unable to hang PVE to PBS server network connections with the new kernel outside of this Rust application.
tbf, it shouldn't be able to trigger a TCP rcv_wnd collapse like that. That said, I did not see any calls in strace that would indicate an application side problem/misuse of syscalls.
 
tbf, it shouldn't be able to trigger a TCP rcv_wnd collapse like that. That said, I did not see any calls in strace that would indicate an application side problem/misuse of syscalls.
Even if the kernel shouldn't be doing what is observed is there an application level patch to consider to make PBS more robust, e.g., request a larger value if it gets too small?
 
Has the hypothesis been tested that this is not a kernel bug, but rather a problem in the Proxmox Rust data transfer application that has been triggered by a change in the kernel? FWIW, I also have been unable to hang PVE to PBS server network connections with the new kernel outside of this Rust application.
this would require a reproducer first. but yes, it is possible the issue is triggered by a certain network workload/traffic pattern/.., which might very well be specific to our code (or otherwise very rare). we can't make code more robust if we don't know what the actual cause of the problem is.
 
Has the hypothesis been tested that this is not a kernel bug, but rather a problem in the Proxmox Rust data transfer application that has been triggered by a change in the kernel? FWIW, I also have been unable to hang PVE to PBS server network connections with the new kernel outside of this Rust application.
FYI We pinned our kernel on PBS to 6.14.11-4 last night and went from a minimum of 5 freezes a night to 0. Changed nothing else.
 
On 6.18, with higher [rw]mem_default, the backups went through tonight. ([rw]mem_max is 4MiB per default in 6.18). We'll see in the next few nights whether this was chance.

Code:
sysctl -w net.core.rmem_default=1048576
sysctl -w net.core.wmem_default=1048576
 
So just to clarify, the receive window being small (or even zero) is not an issue per se, it is controlled by the Kernels TCP stack and used to signal to the other side how much data it is allowed to send at a given moment. The issue could be it not being calculated/updated correctly.

There was this fix in Kernel v6.18 [0], which could be the reason the issue is now not so easy to trigger anymore. Some patches included in that series also expose new tracing metrics to measure the tcp_rcvbuf_grow(). So it might be worth to record and display this with a stuck state on kernel v6.18, which can be done via
Code:
perf record -a -e tcp:tcp_rcvbuf_grow sleep 20
perf script
according to the patch commit message.

Another thing of interest for further investigation would be to capture the traffic flowing over the socket. Ideally one would capture the whole tcp session from the start until the hang, but captures will grow in size rather quickly. So as first step it would be of interest to at least capture some of the traffic after via tcpdump. E.g. by:
Code:
tcpdump port 8007 -i <interface> -w dump.pcap
Be aware that also this can grow in size rather quickly.

Edit: yet another interesting aspect to check would be to see how the tcp_adv_win_scale changes the behavior, by setting it either to 2 or -2. This changes how the advertised receive window changes with respect to the available socket buffer memory.

[0] https://git.kernel.org/pub/scm/linu.../?id=aa251c84636c326471ca9d53723816ba8fffe2bf
 
Last edited:
On 6.18, with higher [rw]mem_default, the backups went through tonight. ([rw]mem_max is 4MiB per default in 6.18). We'll see in the next few nights whether this was chance.

Code:
sysctl -w net.core.rmem_default=1048576
sysctl -w net.core.wmem_default=1048576
I was frolicking too soon: one backup did in fact stall, but recovered (sort of) after two hours:

Code:
242: 2025-12-16 04:28:22 INFO: using fast incremental mode (dirty-bitmap), 908.0 MiB dirty of 50.0 GiB total
242: 2025-12-16 04:28:25 INFO:  64% (584.0 MiB of 908.0 MiB) in 3s, read: 194.7 MiB/s, write: 193.3 MiB/s
242: 2025-12-16 04:28:28 INFO:  81% (736.0 MiB of 908.0 MiB) in 6s, read: 50.7 MiB/s, write: 50.7 MiB/s
242: 2025-12-16 06:36:19 INFO:  92% (844.0 MiB of 908.0 MiB) in 2h 7m 57s, read: 14.4 KiB/s, write: 14.4 KiB/s
242: 2025-12-16 06:52:27 INFO:  93% (848.0 MiB of 908.0 MiB) in 2h 24m 5s, read: 4.2 KiB/s, write: 4.2 KiB/s
242: 2025-12-16 07:34:00 INFO: 100% (908.0 MiB of 908.0 MiB) in 3h 5m 38s, read: 24.6 KiB/s, write: 24.6 KiB/s
242: 2025-12-16 08:38:36 INFO: backup was done incrementally, reused 49.12 GiB (98%)
242: 2025-12-16 08:38:36 INFO: transferred 908.00 MiB in 15014 seconds (61.9 KiB/s)

When the stall reoccurs while I'm awake I'll run the perf and tcpdump.
 
Has the hypothesis been tested that this is not a kernel bug, but rather a problem in the Proxmox Rust data transfer application that has been triggered by a change in the kernel? FWIW, I also have been unable to hang PVE to PBS server network connections with the new kernel outside of this Rust application.
While considered, what speaks against this being an application level issue that is that in the ss outputs @LKo provided the receive queues are empty on the server side, so the application has no data to read from the socket and waits for new data. The sending side however has the send queues filled up, so wants to transmit new data.
 
back on this thread
proxmox-kernel-6.17.4-1-pve-signed correct the bad behavior ?
safe to update ?
Thanks for help
This version fixes only the issue which we were able to reproduce so far, but does not contain a complete fix as reported by others in this thread.
 
Ehm... After some days of trouble-free backups, my PBS with the latest test Kernel installed (6.17.11-2-test-pve) crashed again, crashing our ERP Database Server... Same problem logged in previous posts. :(
 
There is a new Linux kernel with version 6.17.11-3-test-pve available for testing (thanks a lot again to @fabian). You can get the debian packages (including sha256 checksums for integrity verification) from http://download.proxmox.com/temp/kernel-6.17.11-tcp-stall-3-reverts/ and install by using apt install ./[<package-name>].deb. Again, double check that you booted into the correct version by uname -a after a reboot. This kernel now reverts commit 65c52878 tcp: fix sk_rcvbuf overshoot and followup patches thereof.

Testing and feedback on this kernel build is highly appreciated!
 
There is a new Linux kernel with version 6.17.11-3-test-pve available for testing (thanks a lot again to @fabian). You can get the debian packages (including sha256 checksums for integrity verification) from http://download.proxmox.com/temp/kernel-6.17.11-tcp-stall-3-reverts/ and install by using apt install ./[<package-name>].deb. Again, double check that you booted into the correct version by uname -a after a reboot. This kernel now reverts commit 65c52878 tcp: fix sk_rcvbuf overshoot and followup patches thereof.

Testing and feedback on this kernel build is highly appreciated!
Thanks Chris for the info,
however, we're in the "feature freeze" phase leading up to the Christmas holidays, and I don't feel like testing a kernel that might work well for a few days and then, on December 25th, crashes my system just as I'm popping the champagne. I've currently reverted to 6.14 in the hopes of having a peaceful holiday. I'll resume testing in 2026.
 
  • Like
Reactions: Chris