Thanks Lukas for this useful suggestion!I'm doing my testing in the off-hours anyway, 6.17.11-3-test-pve is up and running and we'll have a result in the morning.
For all others: if the stalled backup crashes your VMs, please look into enabling fleecing for the scheduled backups. Using fleecing on local storage changed the problem from "VMs freeze and lose data" down to "a few skipped backups" for our users.
Everyone in this thread is facing this... Rollback to Kernel 6.14 to temporarly fix the problem.I have problem with PBS 4.1 and PVE 9.1, while I backed up the VM Window Server 2016, VM get Stuck/Hangs, can not do anything until I rebooted the VM, so bad issues for production VM. Have anyone facing this?
Just weird that when I backup the VM have SQL Server running on that, it's fine, but all another VM does not have SQL Server, all hangs up/stuck, must restartEveryone in this thread is facing this... Rollback to Kernel 6.14 to temporarly fix the problem.
6.17.11-3-test-pve, and the very first backed-up VM froze and hung after a minute.6.14.11-4-pve and running the backup, which now ran without any problems and exactly as expected.tcpdump and captured just under 2GB of data - of course, thanks to the fact that the second node continued to back up. I can't say whether my dump will be of any use, but I'm happy to provide it.INFO: 47% (15.3 GiB of 32.0 GiB) in 1m 34s, read: 333.3 MiB/s, write: 34.7 MiB/s
INFO: 48% (15.6 GiB of 32.0 GiB) in 1m 37s, read: 113.3 MiB/s, write: 2.7 MiB/s
INFO: 50% (16.0 GiB of 32.0 GiB) in 1m 40s, read: 134.7 MiB/s, write: 12.0 MiB/s
INFO: 51% (16.3 GiB of 32.0 GiB) in 3m 42s, read: 2.6 MiB/s, write: 537.2 KiB/s
INFO: 52% (16.7 GiB of 32.0 GiB) in 9m 47s, read: 1.2 MiB/s, write: 78.6 KiB/s
INFO: 53% (17.0 GiB of 32.0 GiB) in 10m 21s, read: 8.1 MiB/s, write: 361.4 KiB/s
INFO: 54% (17.3 GiB of 32.0 GiB) in 10m 24s, read: 102.7 MiB/s, write: 4.0 MiB/s
INFO: 55% (17.6 GiB of 32.0 GiB) in 11m 13s, read: 6.7 MiB/s, write: 167.2 KiB/s
Thanks for your efforts. Could you upload the dumps to some privately owned file sharing service at your disposal and send me a direct link to download the dumps via a PM?When I saw that the VM backup on the first node had stalled, I rantcpdumpand captured just under 2GB of data - of course, thanks to the fact that the second node continued to back up. I can't say whether my dump will be of any use, but I'm happy to provide it.
Is there any way to share it other than publicly here in the thread?
No luck, all nodes now froze their backups:I'm doing my testing in the off-hours anyway, 6.17.11-3-test-pve is up and running and we'll have a result in the morning.
For all others: if the stalled backup crashes your VMs, please look into enabling fleecing for the scheduled backups. Using fleecing on local storage changed the problem from "VMs freeze and lose data" down to "a few skipped backups" for our users.
Every 5.0s: ss -ti sport 8007 prx-backup: Wed Dec 17 11:57:23 2025
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 [::ffff:10.31.14.32]:8007 [::ffff:10.31.14.24]:44702
cubic wscale:7,10 rto:201 rtt:0.082/0.008 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:19 bytes_sent:2562889 bytes_retrans:123 bytes_acked:2562766 bytes_received:1180908594 segs_out:178584 segs_in:182290 data_segs_out:993
data_segs_in:181688 send 8.73Gbps lastsnd:97452 lastrcv:110 lastack:110 pacing_rate 10.5Gbps delivery_rate 2.57Gbps delivered:994 app_limited busy:486ms retrans:0/1 dsack_dups:1 rcv_rtt:207.234 rcv_space:111902 rcv_ssthresh:1160894 minrtt:0.042 rcv_ooop
ack:43 snd_wnd:3487488 rcv_wnd:4096
ESTAB 0 0 [::ffff:10.31.14.32]:8007 [::ffff:10.31.14.27]:57566
cubic wscale:10,10 rto:201 rtt:0.08/0.006 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:2632483 bytes_acked:2632483 bytes_received:3201652887 segs_out:193920 segs_in:255877 data_segs_out:1327 data_segs_in:2545
55 send 8.95Gbps lastsnd:304897 lastrcv:179 lastack:179 pacing_rate 10.6Gbps delivery_rate 4.18Gbps delivered:1328 app_limited busy:2420ms rcv_rtt:206.776 rcv_space:109371 rcv_ssthresh:692321 minrtt:0.043 rcv_ooopack:8 snd_wnd:1446912 rcv_wnd:4096
ESTAB 0 0 [::ffff:10.31.14.32]:8007 [::ffff:10.31.14.25]:60382
cubic wscale:7,10 rto:201 rtt:0.083/0.008 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:20 bytes_sent:1785006 bytes_acked:1785006 bytes_received:1278331644 segs_out:174551 segs_in:183555 data_segs_out:1429 data_segs_in:1827
58 send 8.62Gbps lastsnd:39679 lastrcv:159 lastack:159 pacing_rate 10.2Gbps delivery_rate 2.76Gbps delivered:1430 app_limited busy:1263ms rcv_rtt:207.523 rcv_space:95072 rcv_ssthresh:463231 minrtt:0.053 rcv_ooopack:11 snd_wnd:516480 rcv_wnd:4096
ESTAB 0 0 [::ffff:10.31.14.32]:8007 [::ffff:10.31.14.31]:40856
cubic wscale:10,10 rto:201 rtt:0.057/0.01 ato:40 mss:8948 pmtu:9000 rcvmss:8192 advmss:8948 cwnd:10 bytes_sent:1832498 bytes_acked:1832498 bytes_received:1228981092 segs_out:150983 segs_in:149414 data_segs_out:1664 data_segs_in:147713 send 12.6
Gbps lastsnd:145958 lastrcv:153 lastack:153 pacing_rate 25.1Gbps delivery_rate 4.77Gbps delivered:1665 app_limited busy:413ms rcv_rtt:207.307 rcv_space:49465 rcv_ssthresh:248909 minrtt:0.015 rcv_ooopack:14 snd_wnd:280576 rcv_wnd:8192
ESTAB 0 0 [::ffff:10.31.14.32]:8007 [::ffff:10.31.14.23]:55554
cubic wscale:10,10 rto:201 rtt:0.085/0.006 ato:40 mss:8948 pmtu:9000 rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:933314 bytes_acked:933314 bytes_received:1224661971 segs_out:281514 segs_in:152444 data_segs_out:1001 data_segs_in:151656 send 8.42G
bps lastsnd:26933 lastrcv:106 lastack:106 pacing_rate 16.8Gbps delivery_rate 2.74Gbps delivered:1002 app_limited busy:435ms rcv_rtt:206.999 rcv_space:96242 rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168
root@prx-backup:~# uname -a
Linux prx-backup 6.17.11-3-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-3 (2025-12-09T09:02Z) x86_64 GNU/Linux
Compression speed: 468.21 MB/s
Decompress speed: 596.21 MB/s
AES256/GCM speed: 3711.01 MB/s
Verify speed: 440.28 MB/s
┌───────────────────────────────────┬─────────────────────┐
│ Name │ Value │
???????????????????????????????????????????????????????????
│ TLS (maximal backup upload speed) │ 395.65 MB/s (32%) │
├───────────────────────────────────┼─────────────────────┤
│ SHA256 checksum computation speed │ 1735.38 MB/s (86%) │
├───────────────────────────────────┼─────────────────────┤
│ ZStd level 1 compression speed │ 468.21 MB/s (62%) │
├───────────────────────────────────┼─────────────────────┤
│ ZStd level 1 decompression speed │ 596.21 MB/s (50%) │
├───────────────────────────────────┼─────────────────────┤
│ Chunk verification speed │ 440.28 MB/s (58%) │
├───────────────────────────────────┼─────────────────────┤
│ AES256 GCM encryption speed │ 3711.01 MB/s (102%) │
└───────────────────────────────────┴─────────────────────┘
INFO: scsi0: dirty-bitmap status: OK (17.7 GiB of 100.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 17.7 GiB dirty of 100.0 GiB total
INFO: 5% (912.0 MiB of 17.7 GiB) in 3s, read: 304.0 MiB/s, write: 262.7 MiB/s
INFO: 7% (1.3 GiB of 17.7 GiB) in 3m 58s, read: 1.8 MiB/s, write: 1.3 MiB/s
INFO: 9% (1.8 GiB of 17.7 GiB) in 20m 17s, read: 506.2 KiB/s, write: 418.4 KiB/s
INFO: 10% (1.8 GiB of 17.7 GiB) in 49m 43s, read: 4.6 KiB/s, write: 4.6 KiB/s
INFO: 11% (2.0 GiB of 17.7 GiB) in 1h 22m 3s, read: 118.2 KiB/s, write: 107.7 KiB/s
INFO: 12% (2.2 GiB of 17.7 GiB) in 1h 22m 6s, read: 74.7 MiB/s, write: 36.0 MiB/s
INFO: 13% (2.4 GiB of 17.7 GiB) in 2h 40m 47s, read: 32.1 KiB/s, write: 26.0 KiB/s
INFO: 15% (2.7 GiB of 17.7 GiB) in 3h 33m 15s, read: 109.3 KiB/s, write: 71.6 KiB/s
INFO: 16% (2.9 GiB of 17.7 GiB) in 4h 59m 47s, read: 41.8 KiB/s, write: 25.2 KiB/s
INFO: 18% (3.3 GiB of 17.7 GiB) in 5h 22m 10s, read: 286.7 KiB/s, write: 259.2 KiB/s
INFO: 19% (3.4 GiB of 17.7 GiB) in 7h 22m 47s, read: 19.2 KiB/s, write: 18.1 KiB/s
INFO: 20% (3.7 GiB of 17.7 GiB) in 8h 4m 54s, read: 123.2 KiB/s, write: 95.6 KiB/s
can you also include the socket memory stats for ss in this stateNo luck, all nodes now froze their backups:
Code:Every 5.0s: ss -ti sport 8007 prx-backup: Wed Dec 17 11:57:23 2025 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 [::ffff:10.31.14.32]:8007 [::ffff:10.31.14.24]:44702 cubic wscale:7,10 rto:201 rtt:0.082/0.008 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:19 bytes_sent:2562889 bytes_retrans:123 bytes_acked:2562766 bytes_received:1180908594 segs_out:178584 segs_in:182290 data_segs_out:993 data_segs_in:181688 send 8.73Gbps lastsnd:97452 lastrcv:110 lastack:110 pacing_rate 10.5Gbps delivery_rate 2.57Gbps delivered:994 app_limited busy:486ms retrans:0/1 dsack_dups:1 rcv_rtt:207.234 rcv_space:111902 rcv_ssthresh:1160894 minrtt:0.042 rcv_ooop ack:43 snd_wnd:3487488 rcv_wnd:4096 ESTAB 0 0 [::ffff:10.31.14.32]:8007 [::ffff:10.31.14.27]:57566 cubic wscale:10,10 rto:201 rtt:0.08/0.006 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:2632483 bytes_acked:2632483 bytes_received:3201652887 segs_out:193920 segs_in:255877 data_segs_out:1327 data_segs_in:2545 55 send 8.95Gbps lastsnd:304897 lastrcv:179 lastack:179 pacing_rate 10.6Gbps delivery_rate 4.18Gbps delivered:1328 app_limited busy:2420ms rcv_rtt:206.776 rcv_space:109371 rcv_ssthresh:692321 minrtt:0.043 rcv_ooopack:8 snd_wnd:1446912 rcv_wnd:4096 ESTAB 0 0 [::ffff:10.31.14.32]:8007 [::ffff:10.31.14.25]:60382 cubic wscale:7,10 rto:201 rtt:0.083/0.008 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:20 bytes_sent:1785006 bytes_acked:1785006 bytes_received:1278331644 segs_out:174551 segs_in:183555 data_segs_out:1429 data_segs_in:1827 58 send 8.62Gbps lastsnd:39679 lastrcv:159 lastack:159 pacing_rate 10.2Gbps delivery_rate 2.76Gbps delivered:1430 app_limited busy:1263ms rcv_rtt:207.523 rcv_space:95072 rcv_ssthresh:463231 minrtt:0.053 rcv_ooopack:11 snd_wnd:516480 rcv_wnd:4096 ESTAB 0 0 [::ffff:10.31.14.32]:8007 [::ffff:10.31.14.31]:40856 cubic wscale:10,10 rto:201 rtt:0.057/0.01 ato:40 mss:8948 pmtu:9000 rcvmss:8192 advmss:8948 cwnd:10 bytes_sent:1832498 bytes_acked:1832498 bytes_received:1228981092 segs_out:150983 segs_in:149414 data_segs_out:1664 data_segs_in:147713 send 12.6 Gbps lastsnd:145958 lastrcv:153 lastack:153 pacing_rate 25.1Gbps delivery_rate 4.77Gbps delivered:1665 app_limited busy:413ms rcv_rtt:207.307 rcv_space:49465 rcv_ssthresh:248909 minrtt:0.015 rcv_ooopack:14 snd_wnd:280576 rcv_wnd:8192 ESTAB 0 0 [::ffff:10.31.14.32]:8007 [::ffff:10.31.14.23]:55554 cubic wscale:10,10 rto:201 rtt:0.085/0.006 ato:40 mss:8948 pmtu:9000 rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:933314 bytes_acked:933314 bytes_received:1224661971 segs_out:281514 segs_in:152444 data_segs_out:1001 data_segs_in:151656 send 8.42G bps lastsnd:26933 lastrcv:106 lastack:106 pacing_rate 16.8Gbps delivery_rate 2.74Gbps delivered:1002 app_limited busy:435ms rcv_rtt:206.999 rcv_space:96242 rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168 root@prx-backup:~# uname -a Linux prx-backup 6.17.11-3-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-3 (2025-12-09T09:02Z) x86_64 GNU/Linux
@Chris dm with a short tcpdump incoming
edit: what's interesting is that most backups that are stuck are either DB2 installations or Windows clients (both have a high rate of change). Also 6.18 does seem to make a stall a lot less likely, as with 6.18 I mostly get one or two nodes stalled, not all five of them, and some even do recover after two hours of stalling.
ss -tim sport 8007.root@prx-backup:~# uname -a
Linux prx-backup 6.17.11-3-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-3 (2025-12-09T09:02Z) x86_64 GNU/Linux
root@prx-backup:~# ss -tim sport 8007
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 [::ffff:10.x.y.a]:8007 [::ffff:10.x.y.c]:44702
skmem:(r0,rb4352,t0,tb332800,f0,w0,o0,bl0,d127) cubic wscale:7,10 rto:201 rtt:0.084/0.011 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:19 bytes_sent:2564308 bytes_retrans:123 bytes_acked:2564185 bytes_received:1221479474 segs_out:188501 segs_in:192207 data_segs_out:1005 data_segs_in:191593 send 8.52Gbps lastsnd:62217 lastrcv:26 lastack:26 pacing_rate 10.2Gbps delivery_rate 2.57Gbps delivered:1006 app_limited busy:486ms retrans:0/1 dsack_dups:1 rcv_rtt:207.232 rcv_space:111902 rcv_ssthresh:1160894 minrtt:0.042 rcv_ooopack:43 snd_wnd:3487488 rcv_wnd:4096
ESTAB 0 0 [::ffff:10.x.y.a]:8007 [::ffff:10.x.y.b]:57566
skmem:(r0,rb4352,t0,tb332800,f0,w0,o0,bl0,d1032) cubic wscale:10,10 rto:201 rtt:0.085/0.014 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:2634947 bytes_acked:2634947 bytes_received:3242215575 segs_out:203842 segs_in:265799 data_segs_out:1346 data_segs_in:264458 send 8.42Gbps lastsnd:15706 lastrcv:111 lastack:111 pacing_rate 10.1Gbps delivery_rate 4.18Gbps delivered:1347 app_limited busy:2420ms rcv_rtt:206.987 rcv_space:109371 rcv_ssthresh:692321 minrtt:0.043 rcv_ooopack:8 snd_wnd:1446912 rcv_wnd:4096
ESTAB 0 0 [::ffff:10.x.y.a]:8007 [::ffff:10.x.y.z]:60382
skmem:(r0,rb4352,t0,tb332800,f0,w0,o0,bl0,d19) cubic wscale:7,10 rto:201 rtt:0.074/0.005 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:20 bytes_sent:1788323 bytes_acked:1788323 bytes_received:1318890236 segs_out:184478 segs_in:193482 data_segs_out:1454 data_segs_in:192660 send 9.67Gbps lastsnd:92287 lastrcv:141 lastack:141 pacing_rate 11.5Gbps delivery_rate 2.76Gbps delivered:1455 app_limited busy:1265ms rcv_rtt:207 rcv_space:95072 rcv_ssthresh:463231 minrtt:0.053 rcv_ooopack:11 snd_wnd:516480 rcv_wnd:4096
ESTAB 0 0 [::ffff:10.x.y.a]:8007 [::ffff:10.x.z.d]:40856
skmem:(r0,rb8432,t0,tb332800,f0,w0,o0,bl0,d306) cubic wscale:10,10 rto:201 rtt:0.053/0.016 ato:40 mss:8948 pmtu:9000 rcvmss:8192 advmss:8948 cwnd:10 bytes_sent:1857063 bytes_acked:1857063 bytes_received:1310073700 segs_out:161056 segs_in:159487 data_segs_out:1838 data_segs_in:157612 send 13.5Gbps lastsnd:2521 lastrcv:46 lastack:46 pacing_rate 26.6Gbps delivery_rate 4.77Gbps delivered:1839 app_limited busy:425ms rcv_rtt:207 rcv_space:49465 rcv_ssthresh:248909 minrtt:0.015 rcv_ooopack:14 snd_wnd:280576 rcv_wnd:8192
ESTAB 0 0 [::ffff:10.x.y.a]:8007 [::ffff:10.x.z.e]:55554
skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000 rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478 bytes_received:1295747055 segs_out:301010 segs_in:162410 data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308 lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242 rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168
nstat, tcpdump, ss output with socket buffer size and try to trace calls to tcp_rcv_space_adjust() by running:perf record -a -e tcp:tcp_rcv_space_adjust
perf script
Please also try to get these traces once in a stuck state (always on the6.17.11-3-test-pve kernel)@Chris see dm
perf record -a -e tcp:tcp_rcv_space_adjust,tcp:tcp_rcvbuf_grow
perf script
#!/bin/bash
trap 'killall tcpdump; exit 1' EXIT HUP INT TERM
PBS_REPOSITORY='your-repo'
PBS_PASSWORD='your-password'
iface=your-interface
elapsed=0
while [ $elapsed -lt 60 ]
do
rm dump.pcap || true
start=$(date +%s)
tcpdump port 8007 -i $iface -w dump.pcap &
proxmox-backup-client benchmark
end=$(date +%s)
elapsed=$((end - start))
killall tcpdump
echo $elapsed
done
INFO: Starting Backup of VM 265 (qemu)
INFO: Backup started at 2025-12-12 03:33:08
INFO: status = running
INFO: VM Name: anonymized
INFO: include disk 'scsi0' 'zfs:vm-265-disk-0' 161G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/265/2025-12-12T02:33:08Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '55ff0f38-3b86-4d42-a96d-c4abd0296dfa'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (9.3 GiB of 161.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 9.3 GiB dirty of 161.0 GiB total
INFO: 16% (1.6 GiB of 9.3 GiB) in 3s, read: 536.0 MiB/s, write: 536.0 MiB/s
INFO: 28% (2.6 GiB of 9.3 GiB) in 6s, read: 368.0 MiB/s, write: 368.0 MiB/s
INFO: 40% (3.8 GiB of 9.3 GiB) in 9s, read: 386.7 MiB/s, write: 386.7 MiB/s
INFO: 55% (5.1 GiB of 9.3 GiB) in 12s, read: 452.0 MiB/s, write: 450.7 MiB/s
INFO: 65% (6.1 GiB of 9.3 GiB) in 15s, read: 334.7 MiB/s, write: 334.7 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 265 failed - interrupted by signal
INFO: Failed at 2025-12-12 05:10:24
ERROR: Backup job failed - interrupted by signal
TASK ERROR: interrupted by signal
6.16/6.17? I don't know exactly), he found that many parameters and options had been modified, renamed, or moved, which made his old config unusable - there was no automatic "migration" of the old config for the kernel to the new parameters.We use essential cookies to make this site work, and optional cookies to enhance your experience.