Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

FYI maybe not so important but we are still running PVE 8.4.1 with kernel Linux pm01 6.8.12-9-pve and are also having those issues with fully patched PBS
 
FYI maybe not so important but we are still running PVE 8.4.1 with kernel Linux pm01 6.8.12-9-pve and are also having those issues with fully patched PBS
Yes, we have determined as a group the problem is on the PBS (kernel) side, affecting all versions of PVE.

- Would also like to add 6.17.2-2 has problems on the PVE side, we have noticed vm disks halting randomly with 'watchers' being stuck on the ceph side. This happens with live migrations (HA). Downgrading PVE to 6.14.x solves this, that problem also does not seem to occur with 6.17.2-1. Very hesitant to upgrade to to the latest 6.17.x now as it does not look like the problems are actually solved there.
 
There are a few different scenarios that I think people are potentially hitting due to different deployment methods/configs. I had updated PVE a week back to the new kernel but I had not updated PBS. I run PBS on PVE and not as a dedicated server. The new Kernel update installed on just PVE slowed everything down a lot for all of my vms and I thought I had some bad storage or a CEPH issue. I noted at that time my docker containers were taking forever to load as well as PBS still running in the morning which never happened before. I had installed the new kernel the night before on PVE so I backreved it to 6.14.11-4 and all issues were resolved.

Anyway, what I'm getting at is, I think some might still be seeing issues with PBS even if they upgrade to what might be a patched version of 6.17.2-2 due to having the 6.17.x kernel on PVE if running the VM on the same host. There are a mix of kernels between PBS and PVE so I'd recommend trying to run the same kernel on both if you are in the same boat as I am. For now, the stable path is to stay on 6.14.11-4 on both however if you have time to test, update BOTH PVE and PBS to the latest "patched" kernel.

For what it's worth, I'm just a home lab user with each node having a 10Gb connection to my network (with 9000mtu) and a 40Gb ceph ring.
 
There are a few different scenarios that I think people are potentially hitting due to different deployment methods/configs.
Yeah I think it's entirely possible that the same root cause can hit PVE networking problems as well. Luckily, we didn't (yet) get hit by them, just the PBS slow/halt thing. But the shrinking rcv_wnd, for whatever reason it actually occurs, would certainly be a major problem for e.g. live migration or ceph storage. I'm a big flabbergasted that our PVE on 6.17 runs so well to be honest, since 90% of our qcow2-disks lie on a TrueNAS connected via NFS4 over 25gbe fiber/MTU 9000.
 
I think it's entirely possible that the same root cause can hit PVE networking problems as well.
I can confirm this is occurring for me. I am hoping that the Proxmox application level code will be hardened in addition to whatever tweaks are being made to the networking code in the kernel package.
 
Yes, we have determined as a group the problem is on the PBS (kernel) side, affecting all versions of PVE.

- Would also like to add 6.17.2-2 has problems on the PVE side, we have noticed vm disks halting randomly with 'watchers' being stuck on the ceph side. This happens with live migrations (HA). Downgrading PVE to 6.14.x solves this, that problem also does not seem to occur with 6.17.2-1. Very hesitant to upgrade to to the latest 6.17.x now as it does not look like the problems are actually solved there.

the difference between -2 and -1 for 6.17.2 is a tiny fix that avoids a kernel panic, so this only means that the issue at hand doesn't trigger quickly deterministically on either version of that kernel.
 
Hi,
I have been able to reproduce hung backups to PBS that started after upgrading to 6.14.0-{1,2}-pve, as well as failed live VM migrations on a new PVE cluster with 2x400G LACP when running the same kernels that log,

QEMU[949247]: kvm: ../util/bitmap.c:167: bitmap_set: Assertion `start >= 0 && nr >= 0' failed.

After upgrading to 6.17.4-1-pve I have not been able to reproduce either failure yet. The statistics are significant that 6.17.4-1-pve is much better on my systems than either of 6.14.0-{1,2}-pve. However, I will keep running tests with a set of large VMs (1TB RAM + 2TB local storage) to see if I can break it.
please open a separate thread regarding the live migration issue and ping me there with @fiona. Assertion failures in QEMU should not happen even if the network is stuck. Please share the VM configuration, output of pveversion -v from both nodes, the full migration task log and excerpts from the system logs from both nodes around the time the issue happened.
 
There is a new Linux kernel with version 6.17.11-1-test-pve available for testing. You can get the debian packages (including sha256 checksums for integrity verification) from http://download.proxmox.com/temp/kernel-6.17.11-tcp-stall/ and install by using apt install ./[<package-name>].deb. Double check that you booted into the correct version by uname -a after a reboot.

Testing and feedback on this kernel build is highly appreciated!
 
  • Like
Reactions: fiona
Since most reports state that a kernel downgrade on the PBS host fixes the issue, installing and booting the kernel on the PBS host should be enough.
 
  • Like
Reactions: Heracleos
Since most reports state that a kernel downgrade on the PBS host fixes the issue, installing and booting the kernel on the PBS host should be enough.
Test Kernel installed successfully...

I will report the results as soon as I can test it!
 
There is a new Linux kernel with version 6.17.11-1-test-pve available for testing. You can get the debian packages (including sha256 checksums for integrity verification) from http://download.proxmox.com/temp/kernel-6.17.11-tcp-stall/ and install by using apt install ./[<package-name>].deb. Double check that you booted into the correct version by uname -a after a reboot.

Testing and feedback on this kernel build is highly appreciated!
First test looks promising, though the rcv_wnd does fluctuate quite a bit as does the transfer speed, I haven't seen that before with 6.14 (might also be the case I didn't pay too much attention to it). I can say more in the morning when the regular backups ran through in the night when there is no other load (comparing their runtime with the previous days).
 
Hello everyone,
first round of tests and overnight backups completed successfully on my PBS (Kernel 6.17.11-1-test-pve) with 2 x 1 Gbps NICs aggregated in LACP with MTU at 1500. If tonight's round goes well too, I will put the second PBS back into production with 10G interfaces and MTU at 9000.

Best regards!
 
  • Like
Reactions: fabian
Hey, sorry to disappoint. Two out of 5 nodes got stuck again during nightly backups. The other three nodes finished their backups within a normal timeframe.

Code:
root@prx-backup:~# uname -a
Linux prx-backup 6.17.11-1-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-1 (2025-12-09T09:02Z) x86_64 GNU/Linux
root@prx-backup:~# ss -ti sport 8007                                                                                                             
State           Recv-Q           Send-Q                             Local Address:Port                               Peer Address:Port                                                                                                                                                             
ESTAB           0                0                           [::ffff:10.x.y.a]:8007                       [::ffff:10.x.y.w]:54852           
         cubic wscale:10,10 rto:201 rtt:0.081/0.013 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 bytes_sent:1856768 bytes_acked:1856768 bytes_received:3871256091 segs_out:288905 segs_in:340570 data_segs_out:2030 data_segs_in:339498 send 8.84Gbps lastsnd:162916 lastrcv:45 lastack:45 pacing_rate 17.5Gbps delivery_rate 3.72Gbps delivered:2031 app_limited busy:4615ms rcv_rtt:206.125 rcv_space:154367 rcv_ssthresh:421228 minrtt:0.045 rcv_ooopack:569 snd_wnd:372736 rcv_wnd:4096
ESTAB           0                0                           [::ffff:10.x.y.a]:8007                       [::ffff:10.x.y.v]:39906           
         cubic wscale:7,10 rto:201 rtt:0.074/0.003 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 bytes_sent:1986249 bytes_retrans:123 bytes_acked:1986126 bytes_received:4422434801 segs_out:407824 segs_in:366754 data_segs_out:3255 data_segs_in:365433 send 9.67Gbps lastsnd:49496 lastrcv:204 lastack:204 pacing_rate 19.2Gbps delivery_rate 3.01Gbps delivered:3256 app_limited busy:8361ms retrans:0/1 dsack_dups:1 rcv_rtt:206.981 rcv_space:200091 rcv_ssthresh:599955 minrtt:0.045 rcv_ooopack:1299 snd_wnd:517120 rcv_wnd:4096


root@prx-gisela:~# uname -a                                                                                                                     
Linux prx-gisela 6.17.2-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.2-2 (2025-11-26T12:33Z) x86_64 GNU/Linux                                           
root@prx-gisela:~# ss -ti dport 8007                                                                                                             
State             Recv-Q             Send-Q                           Local Address:Port                            Peer Address:Port             
ESTAB             0                  1540696                            10.x.y.w:54852                            10.x.y.a:8007             
         cubic wscale:10,10 rto:201 rtt:0.267/0.033 ato:63 mss:8948 pmtu:9000 rcvmss:8948 advmss:8948 cwnd:2 ssthresh:2 bytes_sent:3878303601 bytes_retrans:6289750 bytes_acked:3872013852 bytes_received:1856914 segs_out:521860 segs_in:289091 data_segs_out:520787 data_segs_in:2031 send 536Mbps lastsnd:72 lastrcv:29818 lastack:72 pacing_rate 642Mbps delivery_rate 258Mbps delivered:520008 busy:28274170ms rwnd_limited:28266093ms(100.0%) retrans:0/824 dsack_dups:13 rcv_rtt:1 rcv_space:187548 rcv_ssthresh:372681 notsent:1540696 minrtt:0.046 snd_wnd:4096 rcv_wnd:372736 rehash:8


Linux prx-hanspeter 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux
root@prx-hanspeter:~# ss -ti dport 8007
State             Recv-Q             Send-Q                           Local Address:Port                            Peer Address:Port             
ESTAB             0                  1737472                            10.x.y.v:39906                            10.x.y.a:8007             
         cubic wscale:10,7 rto:201 rtt:0.29/0.046 ato:55 mss:8948 pmtu:9000 rcvmss:7199 advmss:8948 cwnd:3 ssthresh:3 bytes_sent:4439587038 bytes_retrans:16034029 bytes_acked:4423553010 bytes_received:1986307 segs_out:681869 segs_in:408100 data_segs_out:680546 data_segs_in:3257 send 741Mbps lastsnd:128 lastrcv:26751 lastack:128 pacing_rate 888Mbps delivery_rate 245Mbps delivered:678507 busy:21866678ms rwnd_limited:21863847ms(100.0%) retrans:0/2080 dsack_dups:5 rcv_rtt:0.823 rcv_space:108231 rcv_ssthresh:517077 notsent:1737472 minrtt:0.043 snd_wnd:4096 rcv_wnd:517120 rehash:12
root@prx-hanspeter:~#
 
Could you check if disabling the tcp window scaling has an effect? sysctl -w net.ipv4.tcp_window_scaling=0 followed by a systemctl restart proxmox-backup-proxy.service proxmox-backup.service.

Further, are you able to trigger the stalls/slowdowns also by e.g. iperf? Would be great to have a more stable reproducer for further investigation.
 
  • Like
Reactions: Heracleos
Ok, short testing (2 min) with iperf3 -c 10.x.y.a -P 8 -t 300 got to linespeed (25gbe) with no slowdowns (rcv_wnd was consistently large), didn't want to leave it running for too long during the day as the disks are served via the same network.

I'll test the 6.15 and 6.16 mainline kernels in the next two nights, and will try w/o tcp window scaling on 6.17.11 after that. If you have an idea how I might be able to trigger the bug with iperf without leaving it running on linespeed for hours that would be appreciated.
 
There is a new Linux kernel with version 6.17.11-2-test-pve available for testing. You can get the debian packages (including sha256 checksums for integrity verification) from http://download.proxmox.com/temp/kernel-6.17.11-tcp-stall-2/ and install by using apt install ./[<package-name>].deb. Again, double check that you booted into the correct version by uname -a after a reboot.

Testing and feedback on this kernel build is highly appreciated!
 
I'll test the 6.15 and 6.16 mainline kernels in the next two nights, and will try w/o tcp window scaling on 6.17.11 after that. If you have an idea how I might be able to trigger the bug with iperf without leaving it running on linespeed for hours that would be appreciated.
I was wondering if you might see differences in network speeds also for short timespans. But may I suggest to test 6.17.11-2-test-pve first, as this includes now some cherry picked bugfix commits from newer kernel versions on top of the 6.17.11 kernel.
 
  • Like
Reactions: LKo
Unfortunatly, same thing again. What's curious is that it is the same two nodes that stalled again - but that can simply be caused by these two nodes hosting mainly windows machines which have a higher rate of change. The other three nodes went through without issue.

I'll let them finish the backups on 6.14 tonight and will try to reproduce such a thing with iperf tomorrow after businness hours.

Code:
root@prx-backup:~# uname -a                                                                                                                                           
Linux prx-backup 6.17.11-2-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-2 (2025-12-09T09:02Z) x86_64 GNU/Linux                                                         
root@prx-backup:~# ss -ti sport 8007                                                                                                                                 
State              Recv-Q              Send-Q                                  Local Address:Port                                     Peer Address:Port                                                                                                                                                                                     
ESTAB              0                   0                                [::ffff:10.x.y.a]:8007                             [::ffff:10.x.y.z]:45952             
         cubic wscale:7,10 rto:201 rtt:0.08/0.006 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:20 bytes_sent:2193408 bytes_acked:2193408 bytes_received:6721663128 segs_out:142602 segs_in:285078 data_segs_out:5058 data_segs_in:283239 send 8.95Gbps lastsnd:17863 lastrcv:187 lastack:187 pacing_rate 10.7Gbps delivery_rate 3.3Gbps delivered:5059 app_limited busy:13303ms rcv_rtt:207.596 rcv_space:155943 rcv_ssthresh:522519 minrtt:0.051 rcv_ooopack:17 snd_wnd:383872 rcv_wnd:4096
ESTAB              0                   0                                [::ffff:10.x.y.a]:8007                             [::ffff:10.x.y.b]:48152             
         cubic wscale:10,10 rto:201 rtt:0.081/0.01 ato:40 mss:8948 pmtu:9000 rcvmss:7168 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:2238838 bytes_acked:2238838 bytes_received:2925506009 segs_out:119878 segs_in:160978 data_segs_out:1849 data_segs_in:160096 send 8.84Gbps lastsnd:99832 lastrcv:207 lastack:207 pacing_rate 10.6Gbps delivery_rate 2.63Gbps delivered:1850 app_limited busy:3629ms rcv_rtt:206.413 rcv_space:119381 rcv_ssthresh:754046 minrtt:0.042 rcv_ooopack:4 snd_wnd:176128 rcv_wnd:7168


root@prx-gisela:~# uname -a                                                                                                                                           
Linux prx-gisela 6.17.2-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.2-2 (2025-11-26T12:33Z) x86_64 GNU/Linux                                                               
root@prx-gisela:~# ss -ti dport 8007                                                                                                                                 
State                Recv-Q                Send-Q                                Local Address:Port                                  Peer Address:Port               
ESTAB                0                     3193316                                 10.x.y.b:48152                                  10.x.y.a:8007               
         cubic wscale:10,10 rto:201 rtt:0.178/0.011 ato:60 mss:8948 pmtu:9000 rcvmss:7868 advmss:8948 cwnd:2 ssthresh:2 bytes_sent:2926053465 bytes_retrans:375424 bytes_acked:2925678042 bytes_received:2239019 segs_out:344511 segs_in:119929 data_segs_out:343627 data_segs_in:1851 send 804Mbps lastsnd:109 lastrcv:79 lastack:79 pacing_rate 965Mbps delivery_rate 391Mbps delivered:343619 busy:3285983ms rwnd_limited:3285291ms(100.0%) retrans:0/60 dsack_dups:50 rcv_rtt:0.583 rcv_space:81409 rcv_ssthresh:175710 notsent:3193316 minrtt:0.048 snd_wnd:7168 rcv_wnd:176128 rehash:2                                                                                           

root@prx-hanspeter:~# uname -a                                                                                                                                       
Linux prx-hanspeter 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux                                                           
root@prx-hanspeter:~# ss -ti dport 8007                                                                                                                               
State                Recv-Q                Send-Q                                Local Address:Port                                  Peer Address:Port               
ESTAB                0                     3726048                                 10.x.y.z:45952                                  10.x.y.a:8007               
         cubic wscale:10,7 rto:201 rtt:0.208/0.032 ato:68 mss:8948 pmtu:9000 rcvmss:8948 advmss:8948 cwnd:774 ssthresh:2 bytes_sent:6722063594 bytes_retrans:466002 bytes_acked:6721597593 bytes_received:2193408 segs_out:797194 segs_in:142585 data_segs_out:795355 data_segs_in:5058 send 266Gbps lastsnd:103 lastrcv:14450 lastack:103 pacing_rate 318Gbps delivery_rate 284Mbps delivered:795324 busy:3230884ms rwnd_limited:3228404ms(99.9%) retrans:0/73 dsack_dups:38 rcv_rtt:0.438 rcv_space:89644 rcv_ssthresh:383845 notsent:3726048 minrtt:0.044 snd_wnd:4096 rcv_wnd:383872 rehash:2