Yes, we have determined as a group the problem is on the PBS (kernel) side, affecting all versions of PVE.FYI maybe not so important but we are still running PVE 8.4.1 with kernel Linux pm01 6.8.12-9-pve and are also having those issues with fully patched PBS
Yeah I think it's entirely possible that the same root cause can hit PVE networking problems as well. Luckily, we didn't (yet) get hit by them, just the PBS slow/halt thing. But the shrinking rcv_wnd, for whatever reason it actually occurs, would certainly be a major problem for e.g. live migration or ceph storage. I'm a big flabbergasted that our PVE on 6.17 runs so well to be honest, since 90% of our qcow2-disks lie on a TrueNAS connected via NFS4 over 25gbe fiber/MTU 9000.There are a few different scenarios that I think people are potentially hitting due to different deployment methods/configs.
I can confirm this is occurring for me. I am hoping that the Proxmox application level code will be hardened in addition to whatever tweaks are being made to the networking code in the kernel package.I think it's entirely possible that the same root cause can hit PVE networking problems as well.
Yes, we have determined as a group the problem is on the PBS (kernel) side, affecting all versions of PVE.
- Would also like to add 6.17.2-2 has problems on the PVE side, we have noticed vm disks halting randomly with 'watchers' being stuck on the ceph side. This happens with live migrations (HA). Downgrading PVE to 6.14.x solves this, that problem also does not seem to occur with 6.17.2-1. Very hesitant to upgrade to to the latest 6.17.x now as it does not look like the problems are actually solved there.
please open a separate thread regarding the live migration issue and ping me there with @fiona. Assertion failures in QEMU should not happen even if the network is stuck. Please share the VM configuration, output ofI have been able to reproduce hung backups to PBS that started after upgrading to 6.14.0-{1,2}-pve, as well as failed live VM migrations on a new PVE cluster with 2x400G LACP when running the same kernels that log,
QEMU[949247]: kvm: ../util/bitmap.c:167: bitmap_set: Assertion `start >= 0 && nr >= 0' failed.
After upgrading to 6.17.4-1-pve I have not been able to reproduce either failure yet. The statistics are significant that 6.17.4-1-pve is much better on my systems than either of 6.14.0-{1,2}-pve. However, I will keep running tests with a set of large VMs (1TB RAM + 2TB local storage) to see if I can break it.
pveversion -v from both nodes, the full migration task log and excerpts from the system logs from both nodes around the time the issue happened.6.17.11-1-test-pve available for testing. You can get the debian packages (including sha256 checksums for integrity verification) from http://download.proxmox.com/temp/kernel-6.17.11-tcp-stall/ and install by using apt install ./[<package-name>].deb. Double check that you booted into the correct version by uname -a after a reboot.Hello Chris, do you suggest installing it on PVE nodes or only on the PBS server or both ?There is a new Linux kernel with version6.17.11-1-test-pve
Test Kernel installed successfully...Since most reports state that a kernel downgrade on the PBS host fixes the issue, installing and booting the kernel on the PBS host should be enough.
First test looks promising, though the rcv_wnd does fluctuate quite a bit as does the transfer speed, I haven't seen that before with 6.14 (might also be the case I didn't pay too much attention to it). I can say more in the morning when the regular backups ran through in the night when there is no other load (comparing their runtime with the previous days).There is a new Linux kernel with version6.17.11-1-test-pveavailable for testing. You can get the debian packages (including sha256 checksums for integrity verification) from http://download.proxmox.com/temp/kernel-6.17.11-tcp-stall/ and install by usingapt install ./[<package-name>].deb. Double check that you booted into the correct version byuname -aafter a reboot.
Testing and feedback on this kernel build is highly appreciated!
root@prx-backup:~# uname -a
Linux prx-backup 6.17.11-1-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-1 (2025-12-09T09:02Z) x86_64 GNU/Linux
root@prx-backup:~# ss -ti sport 8007
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 [::ffff:10.x.y.a]:8007 [::ffff:10.x.y.w]:54852
cubic wscale:10,10 rto:201 rtt:0.081/0.013 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 bytes_sent:1856768 bytes_acked:1856768 bytes_received:3871256091 segs_out:288905 segs_in:340570 data_segs_out:2030 data_segs_in:339498 send 8.84Gbps lastsnd:162916 lastrcv:45 lastack:45 pacing_rate 17.5Gbps delivery_rate 3.72Gbps delivered:2031 app_limited busy:4615ms rcv_rtt:206.125 rcv_space:154367 rcv_ssthresh:421228 minrtt:0.045 rcv_ooopack:569 snd_wnd:372736 rcv_wnd:4096
ESTAB 0 0 [::ffff:10.x.y.a]:8007 [::ffff:10.x.y.v]:39906
cubic wscale:7,10 rto:201 rtt:0.074/0.003 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 bytes_sent:1986249 bytes_retrans:123 bytes_acked:1986126 bytes_received:4422434801 segs_out:407824 segs_in:366754 data_segs_out:3255 data_segs_in:365433 send 9.67Gbps lastsnd:49496 lastrcv:204 lastack:204 pacing_rate 19.2Gbps delivery_rate 3.01Gbps delivered:3256 app_limited busy:8361ms retrans:0/1 dsack_dups:1 rcv_rtt:206.981 rcv_space:200091 rcv_ssthresh:599955 minrtt:0.045 rcv_ooopack:1299 snd_wnd:517120 rcv_wnd:4096
root@prx-gisela:~# uname -a
Linux prx-gisela 6.17.2-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.2-2 (2025-11-26T12:33Z) x86_64 GNU/Linux
root@prx-gisela:~# ss -ti dport 8007
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1540696 10.x.y.w:54852 10.x.y.a:8007
cubic wscale:10,10 rto:201 rtt:0.267/0.033 ato:63 mss:8948 pmtu:9000 rcvmss:8948 advmss:8948 cwnd:2 ssthresh:2 bytes_sent:3878303601 bytes_retrans:6289750 bytes_acked:3872013852 bytes_received:1856914 segs_out:521860 segs_in:289091 data_segs_out:520787 data_segs_in:2031 send 536Mbps lastsnd:72 lastrcv:29818 lastack:72 pacing_rate 642Mbps delivery_rate 258Mbps delivered:520008 busy:28274170ms rwnd_limited:28266093ms(100.0%) retrans:0/824 dsack_dups:13 rcv_rtt:1 rcv_space:187548 rcv_ssthresh:372681 notsent:1540696 minrtt:0.046 snd_wnd:4096 rcv_wnd:372736 rehash:8
Linux prx-hanspeter 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux
root@prx-hanspeter:~# ss -ti dport 8007
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1737472 10.x.y.v:39906 10.x.y.a:8007
cubic wscale:10,7 rto:201 rtt:0.29/0.046 ato:55 mss:8948 pmtu:9000 rcvmss:7199 advmss:8948 cwnd:3 ssthresh:3 bytes_sent:4439587038 bytes_retrans:16034029 bytes_acked:4423553010 bytes_received:1986307 segs_out:681869 segs_in:408100 data_segs_out:680546 data_segs_in:3257 send 741Mbps lastsnd:128 lastrcv:26751 lastack:128 pacing_rate 888Mbps delivery_rate 245Mbps delivered:678507 busy:21866678ms rwnd_limited:21863847ms(100.0%) retrans:0/2080 dsack_dups:5 rcv_rtt:0.823 rcv_space:108231 rcv_ssthresh:517077 notsent:1737472 minrtt:0.043 snd_wnd:4096 rcv_wnd:517120 rehash:12
root@prx-hanspeter:~#
sysctl -w net.ipv4.tcp_window_scaling=0 followed by a systemctl restart proxmox-backup-proxy.service proxmox-backup.service.iperf? Would be great to have a more stable reproducer for further investigation.6.17.11-2-test-pve available for testing. You can get the debian packages (including sha256 checksums for integrity verification) from http://download.proxmox.com/temp/kernel-6.17.11-tcp-stall-2/ and install by using apt install ./[<package-name>].deb. Again, double check that you booted into the correct version by uname -a after a reboot.I was wondering if you might see differences in network speeds also for short timespans. But may I suggest to test 6.17.11-2-test-pve first, as this includes now some cherry picked bugfix commits from newer kernel versions on top of the 6.17.11 kernel.I'll test the 6.15 and 6.16 mainline kernels in the next two nights, and will try w/o tcp window scaling on 6.17.11 after that. If you have an idea how I might be able to trigger the bug with iperf without leaving it running on linespeed for hours that would be appreciated.
root@prx-backup:~# uname -a
Linux prx-backup 6.17.11-2-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-2 (2025-12-09T09:02Z) x86_64 GNU/Linux
root@prx-backup:~# ss -ti sport 8007
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 [::ffff:10.x.y.a]:8007 [::ffff:10.x.y.z]:45952
cubic wscale:7,10 rto:201 rtt:0.08/0.006 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:20 bytes_sent:2193408 bytes_acked:2193408 bytes_received:6721663128 segs_out:142602 segs_in:285078 data_segs_out:5058 data_segs_in:283239 send 8.95Gbps lastsnd:17863 lastrcv:187 lastack:187 pacing_rate 10.7Gbps delivery_rate 3.3Gbps delivered:5059 app_limited busy:13303ms rcv_rtt:207.596 rcv_space:155943 rcv_ssthresh:522519 minrtt:0.051 rcv_ooopack:17 snd_wnd:383872 rcv_wnd:4096
ESTAB 0 0 [::ffff:10.x.y.a]:8007 [::ffff:10.x.y.b]:48152
cubic wscale:10,10 rto:201 rtt:0.081/0.01 ato:40 mss:8948 pmtu:9000 rcvmss:7168 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:2238838 bytes_acked:2238838 bytes_received:2925506009 segs_out:119878 segs_in:160978 data_segs_out:1849 data_segs_in:160096 send 8.84Gbps lastsnd:99832 lastrcv:207 lastack:207 pacing_rate 10.6Gbps delivery_rate 2.63Gbps delivered:1850 app_limited busy:3629ms rcv_rtt:206.413 rcv_space:119381 rcv_ssthresh:754046 minrtt:0.042 rcv_ooopack:4 snd_wnd:176128 rcv_wnd:7168
root@prx-gisela:~# uname -a
Linux prx-gisela 6.17.2-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.2-2 (2025-11-26T12:33Z) x86_64 GNU/Linux
root@prx-gisela:~# ss -ti dport 8007
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 3193316 10.x.y.b:48152 10.x.y.a:8007
cubic wscale:10,10 rto:201 rtt:0.178/0.011 ato:60 mss:8948 pmtu:9000 rcvmss:7868 advmss:8948 cwnd:2 ssthresh:2 bytes_sent:2926053465 bytes_retrans:375424 bytes_acked:2925678042 bytes_received:2239019 segs_out:344511 segs_in:119929 data_segs_out:343627 data_segs_in:1851 send 804Mbps lastsnd:109 lastrcv:79 lastack:79 pacing_rate 965Mbps delivery_rate 391Mbps delivered:343619 busy:3285983ms rwnd_limited:3285291ms(100.0%) retrans:0/60 dsack_dups:50 rcv_rtt:0.583 rcv_space:81409 rcv_ssthresh:175710 notsent:3193316 minrtt:0.048 snd_wnd:7168 rcv_wnd:176128 rehash:2
root@prx-hanspeter:~# uname -a
Linux prx-hanspeter 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux
root@prx-hanspeter:~# ss -ti dport 8007
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 3726048 10.x.y.z:45952 10.x.y.a:8007
cubic wscale:10,7 rto:201 rtt:0.208/0.032 ato:68 mss:8948 pmtu:9000 rcvmss:8948 advmss:8948 cwnd:774 ssthresh:2 bytes_sent:6722063594 bytes_retrans:466002 bytes_acked:6721597593 bytes_received:2193408 segs_out:797194 segs_in:142585 data_segs_out:795355 data_segs_in:5058 send 266Gbps lastsnd:103 lastrcv:14450 lastack:103 pacing_rate 318Gbps delivery_rate 284Mbps delivered:795324 busy:3230884ms rwnd_limited:3228404ms(99.9%) retrans:0/73 dsack_dups:38 rcv_rtt:0.438 rcv_space:89644 rcv_ssthresh:383845 notsent:3726048 minrtt:0.044 snd_wnd:4096 rcv_wnd:383872 rehash:2
We use essential cookies to make this site work, and optional cookies to enhance your experience.