Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

FYI maybe not so important but we are still running PVE 8.4.1 with kernel Linux pm01 6.8.12-9-pve and are also having those issues with fully patched PBS
 
FYI maybe not so important but we are still running PVE 8.4.1 with kernel Linux pm01 6.8.12-9-pve and are also having those issues with fully patched PBS
Yes, we have determined as a group the problem is on the PBS (kernel) side, affecting all versions of PVE.

- Would also like to add 6.17.2-2 has problems on the PVE side, we have noticed vm disks halting randomly with 'watchers' being stuck on the ceph side. This happens with live migrations (HA). Downgrading PVE to 6.14.x solves this, that problem also does not seem to occur with 6.17.2-1. Very hesitant to upgrade to to the latest 6.17.x now as it does not look like the problems are actually solved there.
 
There are a few different scenarios that I think people are potentially hitting due to different deployment methods/configs. I had updated PVE a week back to the new kernel but I had not updated PBS. I run PBS on PVE and not as a dedicated server. The new Kernel update installed on just PVE slowed everything down a lot for all of my vms and I thought I had some bad storage or a CEPH issue. I noted at that time my docker containers were taking forever to load as well as PBS still running in the morning which never happened before. I had installed the new kernel the night before on PVE so I backreved it to 6.14.11-4 and all issues were resolved.

Anyway, what I'm getting at is, I think some might still be seeing issues with PBS even if they upgrade to what might be a patched version of 6.17.2-2 due to having the 6.17.x kernel on PVE if running the VM on the same host. There are a mix of kernels between PBS and PVE so I'd recommend trying to run the same kernel on both if you are in the same boat as I am. For now, the stable path is to stay on 6.14.11-4 on both however if you have time to test, update BOTH PVE and PBS to the latest "patched" kernel.

For what it's worth, I'm just a home lab user with each node having a 10Gb connection to my network (with 9000mtu) and a 40Gb ceph ring.
 
There are a few different scenarios that I think people are potentially hitting due to different deployment methods/configs.
Yeah I think it's entirely possible that the same root cause can hit PVE networking problems as well. Luckily, we didn't (yet) get hit by them, just the PBS slow/halt thing. But the shrinking rcv_wnd, for whatever reason it actually occurs, would certainly be a major problem for e.g. live migration or ceph storage. I'm a big flabbergasted that our PVE on 6.17 runs so well to be honest, since 90% of our qcow2-disks lie on a TrueNAS connected via NFS4 over 25gbe fiber/MTU 9000.
 
I think it's entirely possible that the same root cause can hit PVE networking problems as well.
I can confirm this is occurring for me. I am hoping that the Proxmox application level code will be hardened in addition to whatever tweaks are being made to the networking code in the kernel package.