Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

Hello,
Facing the same problem since I upgraded all my PBS to pbs4. Running 5 pbs-servers / 8 pve clusters, we had freezes on different VMs on different PVE versions.
I've rolled back all my pbs to 6.14.11-4-pve. I'll test the latest "test-pve" kernel after holidays, as @GE_Admin said :)
 
I'm doing my testing in the off-hours anyway, 6.17.11-3-test-pve is up and running and we'll have a result in the morning.

For all others: if the stalled backup crashes your VMs, please look into enabling fleecing for the scheduled backups. Using fleecing on local storage changed the problem from "VMs freeze and lose data" down to "a few skipped backups" for our users.
 
I'm doing my testing in the off-hours anyway, 6.17.11-3-test-pve is up and running and we'll have a result in the morning.

For all others: if the stalled backup crashes your VMs, please look into enabling fleecing for the scheduled backups. Using fleecing on local storage changed the problem from "VMs freeze and lose data" down to "a few skipped backups" for our users.
Thanks Lukas for this useful suggestion!
 
I have problem with PBS 4.1 and PVE 9.1, while I backed up the VM Window Server 2016, VM get Stuck/Hangs, can not do anything until I rebooted the VM, so bad issues for production VM. Have anyone facing this?
 
I have problem with PBS 4.1 and PVE 9.1, while I backed up the VM Window Server 2016, VM get Stuck/Hangs, can not do anything until I rebooted the VM, so bad issues for production VM. Have anyone facing this?
Everyone in this thread is facing this... Rollback to Kernel 6.14 to temporarly fix the problem.
 
  • Like
Reactions: ddanh512
Everyone in this thread is facing this... Rollback to Kernel 6.14 to temporarly fix the problem.
Just weird that when I backup the VM have SQL Server running on that, it's fine, but all another VM does not have SQL Server, all hangs up/stuck, must restart
 
I tested the aforementioned kernel 6.17.11-3-test-pve, and the very first backed-up VM froze and hung after a minute.
My backups are performed on two Proxmox nodes, and while the backup on the first node froze immediately during the first VM and stopped, the second node successfully backed up several VMs before also slowing down or freezing completely.

I tried reverting to kernel 6.14.11-4-pve and running the backup, which now ran without any problems and exactly as expected.

When I saw that the VM backup on the first node had stalled, I ran tcpdump and captured just under 2GB of data - of course, thanks to the fact that the second node continued to back up. I can't say whether my dump will be of any use, but I'm happy to provide it.
Is there any way to share it other than publicly here in the thread?


Log for a frozen VM backup on node one:
Code:
INFO:  47% (15.3 GiB of 32.0 GiB) in 1m 34s, read: 333.3 MiB/s, write: 34.7 MiB/s
INFO:  48% (15.6 GiB of 32.0 GiB) in 1m 37s, read: 113.3 MiB/s, write: 2.7 MiB/s
INFO:  50% (16.0 GiB of 32.0 GiB) in 1m 40s, read: 134.7 MiB/s, write: 12.0 MiB/s
INFO:  51% (16.3 GiB of 32.0 GiB) in 3m 42s, read: 2.6 MiB/s, write: 537.2 KiB/s
INFO:  52% (16.7 GiB of 32.0 GiB) in 9m 47s, read: 1.2 MiB/s, write: 78.6 KiB/s
INFO:  53% (17.0 GiB of 32.0 GiB) in 10m 21s, read: 8.1 MiB/s, write: 361.4 KiB/s
INFO:  54% (17.3 GiB of 32.0 GiB) in 10m 24s, read: 102.7 MiB/s, write: 4.0 MiB/s
INFO:  55% (17.6 GiB of 32.0 GiB) in 11m 13s, read: 6.7 MiB/s, write: 167.2 KiB/s
 
When I saw that the VM backup on the first node had stalled, I ran tcpdump and captured just under 2GB of data - of course, thanks to the fact that the second node continued to back up. I can't say whether my dump will be of any use, but I'm happy to provide it.
Is there any way to share it other than publicly here in the thread?
Thanks for your efforts. Could you upload the dumps to some privately owned file sharing service at your disposal and send me a direct link to download the dumps via a PM?
 
I'm doing my testing in the off-hours anyway, 6.17.11-3-test-pve is up and running and we'll have a result in the morning.

For all others: if the stalled backup crashes your VMs, please look into enabling fleecing for the scheduled backups. Using fleecing on local storage changed the problem from "VMs freeze and lose data" down to "a few skipped backups" for our users.
No luck, all nodes now froze their backups:
Code:
Every 5.0s: ss -ti sport 8007                                                                                                                                                                                            prx-backup: Wed Dec 17 11:57:23 2025

State                          Recv-Q                           Send-Q                                                            Local Address:Port                                                             Peer Address:Port

ESTAB                          0                                0                                                          [::ffff:10.31.14.32]:8007                                                     [::ffff:10.31.14.24]:44702
         cubic wscale:7,10 rto:201 rtt:0.082/0.008 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:19 bytes_sent:2562889 bytes_retrans:123 bytes_acked:2562766 bytes_received:1180908594 segs_out:178584 segs_in:182290 data_segs_out:993
data_segs_in:181688 send 8.73Gbps lastsnd:97452 lastrcv:110 lastack:110 pacing_rate 10.5Gbps delivery_rate 2.57Gbps delivered:994 app_limited busy:486ms retrans:0/1 dsack_dups:1 rcv_rtt:207.234 rcv_space:111902 rcv_ssthresh:1160894 minrtt:0.042 rcv_ooop
ack:43 snd_wnd:3487488 rcv_wnd:4096
ESTAB                          0                                0                                                          [::ffff:10.31.14.32]:8007                                                     [::ffff:10.31.14.27]:57566
         cubic wscale:10,10 rto:201 rtt:0.08/0.006 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:2632483 bytes_acked:2632483 bytes_received:3201652887 segs_out:193920 segs_in:255877 data_segs_out:1327 data_segs_in:2545
55 send 8.95Gbps lastsnd:304897 lastrcv:179 lastack:179 pacing_rate 10.6Gbps delivery_rate 4.18Gbps delivered:1328 app_limited busy:2420ms rcv_rtt:206.776 rcv_space:109371 rcv_ssthresh:692321 minrtt:0.043 rcv_ooopack:8 snd_wnd:1446912 rcv_wnd:4096
ESTAB                          0                                0                                                          [::ffff:10.31.14.32]:8007                                                     [::ffff:10.31.14.25]:60382
         cubic wscale:7,10 rto:201 rtt:0.083/0.008 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:20 bytes_sent:1785006 bytes_acked:1785006 bytes_received:1278331644 segs_out:174551 segs_in:183555 data_segs_out:1429 data_segs_in:1827
58 send 8.62Gbps lastsnd:39679 lastrcv:159 lastack:159 pacing_rate 10.2Gbps delivery_rate 2.76Gbps delivered:1430 app_limited busy:1263ms rcv_rtt:207.523 rcv_space:95072 rcv_ssthresh:463231 minrtt:0.053 rcv_ooopack:11 snd_wnd:516480 rcv_wnd:4096
ESTAB                          0                                0                                                          [::ffff:10.31.14.32]:8007                                                     [::ffff:10.31.14.31]:40856
         cubic wscale:10,10 rto:201 rtt:0.057/0.01 ato:40 mss:8948 pmtu:9000 rcvmss:8192 advmss:8948 cwnd:10 bytes_sent:1832498 bytes_acked:1832498 bytes_received:1228981092 segs_out:150983 segs_in:149414 data_segs_out:1664 data_segs_in:147713 send 12.6
Gbps lastsnd:145958 lastrcv:153 lastack:153 pacing_rate 25.1Gbps delivery_rate 4.77Gbps delivered:1665 app_limited busy:413ms rcv_rtt:207.307 rcv_space:49465 rcv_ssthresh:248909 minrtt:0.015 rcv_ooopack:14 snd_wnd:280576 rcv_wnd:8192
ESTAB                          0                                0                                                          [::ffff:10.31.14.32]:8007                                                     [::ffff:10.31.14.23]:55554
         cubic wscale:10,10 rto:201 rtt:0.085/0.006 ato:40 mss:8948 pmtu:9000 rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:933314 bytes_acked:933314 bytes_received:1224661971 segs_out:281514 segs_in:152444 data_segs_out:1001 data_segs_in:151656 send 8.42G
bps lastsnd:26933 lastrcv:106 lastack:106 pacing_rate 16.8Gbps delivery_rate 2.74Gbps delivered:1002 app_limited busy:435ms rcv_rtt:206.999 rcv_space:96242 rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168

root@prx-backup:~# uname -a
Linux prx-backup 6.17.11-3-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-3 (2025-12-09T09:02Z) x86_64 GNU/Linux

@Chris dm with a short tcpdump incoming

edit: what's interesting is that most backups that are stuck are either DB2 installations or Windows clients (both have a high rate of change). Also 6.18 does seem to make a stall a lot less likely, as with 6.18 I mostly get one or two nodes stalled, not all five of them, and some even do recover after two hours of stalling.
 
Last edited:
@LKo could you try to trigger the issue with "proxmox-backup-client benchmark --repository ..." ? that will (after doing a little exercise for your CPU) upload a stream of fake chunks for 5s and wait until they are all processed. if you can reproduce it using that (e.g., in a loop until it hangs) we could build you a custom proxmox-backup-client binary that skips the CPU part and uploads chunks for a longer period of time, so you'd no longer need real backups to test kernel versions..
 
@fabian the benchmark is running right now in an endless loop, but I fear 5 seconds of burst data and then creating a new tcp connection won't likely trigger the stall.

Code:
Compression speed: 468.21 MB/s   
Decompress speed: 596.21 MB/s   
AES256/GCM speed: 3711.01 MB/s   
Verify speed: 440.28 MB/s   
┌───────────────────────────────────┬─────────────────────┐
│ Name                              │ Value               │
???????????????????????????????????????????????????????????
│ TLS (maximal backup upload speed) │ 395.65 MB/s (32%)   │
├───────────────────────────────────┼─────────────────────┤
│ SHA256 checksum computation speed │ 1735.38 MB/s (86%)  │
├───────────────────────────────────┼─────────────────────┤
│ ZStd level 1 compression speed    │ 468.21 MB/s (62%)   │
├───────────────────────────────────┼─────────────────────┤
│ ZStd level 1 decompression speed  │ 596.21 MB/s (50%)   │
├───────────────────────────────────┼─────────────────────┤
│ Chunk verification speed          │ 440.28 MB/s (58%)   │
├───────────────────────────────────┼─────────────────────┤
│ AES256 GCM encryption speed       │ 3711.01 MB/s (102%) │
└───────────────────────────────────┴─────────────────────┘

For example for one node, the stall happened after 20 successful backups totalling >100gb, then stalled after transferring 1.3GB:

Code:
INFO: scsi0: dirty-bitmap status: OK (17.7 GiB of 100.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 17.7 GiB dirty of 100.0 GiB total
INFO:   5% (912.0 MiB of 17.7 GiB) in 3s, read: 304.0 MiB/s, write: 262.7 MiB/s
INFO:   7% (1.3 GiB of 17.7 GiB) in 3m 58s, read: 1.8 MiB/s, write: 1.3 MiB/s
INFO:   9% (1.8 GiB of 17.7 GiB) in 20m 17s, read: 506.2 KiB/s, write: 418.4 KiB/s
INFO:  10% (1.8 GiB of 17.7 GiB) in 49m 43s, read: 4.6 KiB/s, write: 4.6 KiB/s
INFO:  11% (2.0 GiB of 17.7 GiB) in 1h 22m 3s, read: 118.2 KiB/s, write: 107.7 KiB/s
INFO:  12% (2.2 GiB of 17.7 GiB) in 1h 22m 6s, read: 74.7 MiB/s, write: 36.0 MiB/s
INFO:  13% (2.4 GiB of 17.7 GiB) in 2h 40m 47s, read: 32.1 KiB/s, write: 26.0 KiB/s
INFO:  15% (2.7 GiB of 17.7 GiB) in 3h 33m 15s, read: 109.3 KiB/s, write: 71.6 KiB/s
INFO:  16% (2.9 GiB of 17.7 GiB) in 4h 59m 47s, read: 41.8 KiB/s, write: 25.2 KiB/s
INFO:  18% (3.3 GiB of 17.7 GiB) in 5h 22m 10s, read: 286.7 KiB/s, write: 259.2 KiB/s
INFO:  19% (3.4 GiB of 17.7 GiB) in 7h 22m 47s, read: 19.2 KiB/s, write: 18.1 KiB/s
INFO:  20% (3.7 GiB of 17.7 GiB) in 8h 4m 54s, read: 123.2 KiB/s, write: 95.6 KiB/s

It's still running, albeit at 1999 DSL speeds :D
 
Last edited:
No luck, all nodes now froze their backups:
Code:
Every 5.0s: ss -ti sport 8007                                                                                                                                                                                            prx-backup: Wed Dec 17 11:57:23 2025

State                          Recv-Q                           Send-Q                                                            Local Address:Port                                                             Peer Address:Port

ESTAB                          0                                0                                                          [::ffff:10.31.14.32]:8007                                                     [::ffff:10.31.14.24]:44702
         cubic wscale:7,10 rto:201 rtt:0.082/0.008 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:19 bytes_sent:2562889 bytes_retrans:123 bytes_acked:2562766 bytes_received:1180908594 segs_out:178584 segs_in:182290 data_segs_out:993
data_segs_in:181688 send 8.73Gbps lastsnd:97452 lastrcv:110 lastack:110 pacing_rate 10.5Gbps delivery_rate 2.57Gbps delivered:994 app_limited busy:486ms retrans:0/1 dsack_dups:1 rcv_rtt:207.234 rcv_space:111902 rcv_ssthresh:1160894 minrtt:0.042 rcv_ooop
ack:43 snd_wnd:3487488 rcv_wnd:4096
ESTAB                          0                                0                                                          [::ffff:10.31.14.32]:8007                                                     [::ffff:10.31.14.27]:57566
         cubic wscale:10,10 rto:201 rtt:0.08/0.006 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:2632483 bytes_acked:2632483 bytes_received:3201652887 segs_out:193920 segs_in:255877 data_segs_out:1327 data_segs_in:2545
55 send 8.95Gbps lastsnd:304897 lastrcv:179 lastack:179 pacing_rate 10.6Gbps delivery_rate 4.18Gbps delivered:1328 app_limited busy:2420ms rcv_rtt:206.776 rcv_space:109371 rcv_ssthresh:692321 minrtt:0.043 rcv_ooopack:8 snd_wnd:1446912 rcv_wnd:4096
ESTAB                          0                                0                                                          [::ffff:10.31.14.32]:8007                                                     [::ffff:10.31.14.25]:60382
         cubic wscale:7,10 rto:201 rtt:0.083/0.008 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:20 bytes_sent:1785006 bytes_acked:1785006 bytes_received:1278331644 segs_out:174551 segs_in:183555 data_segs_out:1429 data_segs_in:1827
58 send 8.62Gbps lastsnd:39679 lastrcv:159 lastack:159 pacing_rate 10.2Gbps delivery_rate 2.76Gbps delivered:1430 app_limited busy:1263ms rcv_rtt:207.523 rcv_space:95072 rcv_ssthresh:463231 minrtt:0.053 rcv_ooopack:11 snd_wnd:516480 rcv_wnd:4096
ESTAB                          0                                0                                                          [::ffff:10.31.14.32]:8007                                                     [::ffff:10.31.14.31]:40856
         cubic wscale:10,10 rto:201 rtt:0.057/0.01 ato:40 mss:8948 pmtu:9000 rcvmss:8192 advmss:8948 cwnd:10 bytes_sent:1832498 bytes_acked:1832498 bytes_received:1228981092 segs_out:150983 segs_in:149414 data_segs_out:1664 data_segs_in:147713 send 12.6
Gbps lastsnd:145958 lastrcv:153 lastack:153 pacing_rate 25.1Gbps delivery_rate 4.77Gbps delivered:1665 app_limited busy:413ms rcv_rtt:207.307 rcv_space:49465 rcv_ssthresh:248909 minrtt:0.015 rcv_ooopack:14 snd_wnd:280576 rcv_wnd:8192
ESTAB                          0                                0                                                          [::ffff:10.31.14.32]:8007                                                     [::ffff:10.31.14.23]:55554
         cubic wscale:10,10 rto:201 rtt:0.085/0.006 ato:40 mss:8948 pmtu:9000 rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:933314 bytes_acked:933314 bytes_received:1224661971 segs_out:281514 segs_in:152444 data_segs_out:1001 data_segs_in:151656 send 8.42G
bps lastsnd:26933 lastrcv:106 lastack:106 pacing_rate 16.8Gbps delivery_rate 2.74Gbps delivered:1002 app_limited busy:435ms rcv_rtt:206.999 rcv_space:96242 rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168

root@prx-backup:~# uname -a
Linux prx-backup 6.17.11-3-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-3 (2025-12-09T09:02Z) x86_64 GNU/Linux

@Chris dm with a short tcpdump incoming

edit: what's interesting is that most backups that are stuck are either DB2 installations or Windows clients (both have a high rate of change). Also 6.18 does seem to make a stall a lot less likely, as with 6.18 I mostly get one or two nodes stalled, not all five of them, and some even do recover after two hours of stalling.
can you also include the socket memory stats for ss in this state ss -tim sport 8007.
 
Sure:
Code:
root@prx-backup:~# uname -a                                                                                                                                                                           
Linux prx-backup 6.17.11-3-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-3 (2025-12-09T09:02Z) x86_64 GNU/Linux                                                                                         
root@prx-backup:~# ss -tim sport 8007                                                                                                                                                                 
State                          Recv-Q                          Send-Q                                                            Local Address:Port                                                              Peer Address:Port                                                                                                                                                                                                                                                                                       
ESTAB                          0                               0                                                          [::ffff:10.x.y.a]:8007                                                      [::ffff:10.x.y.c]:44702                         
         skmem:(r0,rb4352,t0,tb332800,f0,w0,o0,bl0,d127) cubic wscale:7,10 rto:201 rtt:0.084/0.011 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:19 bytes_sent:2564308 bytes_retrans:123 bytes_acked:2564185 bytes_received:1221479474 segs_out:188501 segs_in:192207 data_segs_out:1005 data_segs_in:191593 send 8.52Gbps lastsnd:62217 lastrcv:26 lastack:26 pacing_rate 10.2Gbps delivery_rate 2.57Gbps delivered:1006 app_limited busy:486ms retrans:0/1 dsack_dups:1 rcv_rtt:207.232 rcv_space:111902 rcv_ssthresh:1160894 minrtt:0.042 rcv_ooopack:43 snd_wnd:3487488 rcv_wnd:4096
ESTAB                          0                               0                                                          [::ffff:10.x.y.a]:8007                                                      [::ffff:10.x.y.b]:57566                         
         skmem:(r0,rb4352,t0,tb332800,f0,w0,o0,bl0,d1032) cubic wscale:10,10 rto:201 rtt:0.085/0.014 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:2634947 bytes_acked:2634947 bytes_received:3242215575 segs_out:203842 segs_in:265799 data_segs_out:1346 data_segs_in:264458 send 8.42Gbps lastsnd:15706 lastrcv:111 lastack:111 pacing_rate 10.1Gbps delivery_rate 4.18Gbps delivered:1347 app_limited busy:2420ms rcv_rtt:206.987 rcv_space:109371 rcv_ssthresh:692321 minrtt:0.043 rcv_ooopack:8 snd_wnd:1446912 rcv_wnd:4096
ESTAB                          0                               0                                                          [::ffff:10.x.y.a]:8007                                                      [::ffff:10.x.y.z]:60382                         
         skmem:(r0,rb4352,t0,tb332800,f0,w0,o0,bl0,d19) cubic wscale:7,10 rto:201 rtt:0.074/0.005 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 ssthresh:20 bytes_sent:1788323 bytes_acked:1788323 bytes_received:1318890236 segs_out:184478 segs_in:193482 data_segs_out:1454 data_segs_in:192660 send 9.67Gbps lastsnd:92287 lastrcv:141 lastack:141 pacing_rate 11.5Gbps delivery_rate 2.76Gbps delivered:1455 app_limited busy:1265ms rcv_rtt:207 rcv_space:95072 rcv_ssthresh:463231 minrtt:0.053 rcv_ooopack:11 snd_wnd:516480 rcv_wnd:4096
ESTAB                          0                               0                                                          [::ffff:10.x.y.a]:8007                                                      [::ffff:10.x.z.d]:40856                         
         skmem:(r0,rb8432,t0,tb332800,f0,w0,o0,bl0,d306) cubic wscale:10,10 rto:201 rtt:0.053/0.016 ato:40 mss:8948 pmtu:9000 rcvmss:8192 advmss:8948 cwnd:10 bytes_sent:1857063 bytes_acked:1857063 bytes_received:1310073700 segs_out:161056 segs_in:159487 data_segs_out:1838 data_segs_in:157612 send 13.5Gbps lastsnd:2521 lastrcv:46 lastack:46 pacing_rate 26.6Gbps delivery_rate 4.77Gbps delivered:1839 app_limited busy:425ms rcv_rtt:207 rcv_space:49465 rcv_ssthresh:248909 minrtt:0.015 rcv_ooopack:14 snd_wnd:280576 rcv_wnd:8192
ESTAB                          0                               0                                                          [::ffff:10.x.y.a]:8007                                                      [::ffff:10.x.z.e]:55554                         
         skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000 rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478 bytes_received:1295747055 segs_out:301010 segs_in:162410 data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308 lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242 rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168
 
Great, so this can be hopefully used for faster reproducers.

Gather the same info as above, so nstat, tcpdump, ss output with socket buffer size and try to trace calls to tcp_rcv_space_adjust() by running:
Code:
perf record -a -e tcp:tcp_rcv_space_adjust
perf script
A 1-2 seconds should already be enough to see if this is ever called.
 
Please also try to get these traces once in a stuck state (always on the6.17.11-3-test-pve kernel)
Code:
perf record -a -e tcp:tcp_rcv_space_adjust,tcp:tcp_rcvbuf_grow
perf script

Further, it would be interesting to get tcpdump for the full TCP session, this should be doable with the proxmox-backup-client benchmark using the following script:
Code:
#!/bin/bash

trap 'killall tcpdump; exit 1' EXIT HUP INT TERM

PBS_REPOSITORY='your-repo'
PBS_PASSWORD='your-password'
iface=your-interface
elapsed=0

while [ $elapsed -lt 60 ]
do
    rm dump.pcap || true
    start=$(date +%s)
    tcpdump port 8007 -i $iface -w dump.pcap &
    proxmox-backup-client benchmark
    end=$(date +%s)
    elapsed=$((end - start))
    killall tcpdump
    echo $elapsed
done

Edit: Please adapt the 60 second elapsed threshold if needed, but 1 minute should be plenty for the benchmark if not stuck
 
Last edited:
  • Like
Reactions: fabian
Hello,


we're currently on PBS 4.1.0 and Kernel 6.17.2-2-pve. Some days ago a VM backup got stuck for hours and backup job needs to be interrupted and VM required a hard reset to come back online. We are using LACP with bonding but default mtu.

Logs from backup:
Code:
INFO: Starting Backup of VM 265 (qemu)
INFO: Backup started at 2025-12-12 03:33:08
INFO: status = running
INFO: VM Name: anonymized
INFO: include disk 'scsi0' 'zfs:vm-265-disk-0' 161G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/265/2025-12-12T02:33:08Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '55ff0f38-3b86-4d42-a96d-c4abd0296dfa'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (9.3 GiB of 161.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 9.3 GiB dirty of 161.0 GiB total
INFO:  16% (1.6 GiB of 9.3 GiB) in 3s, read: 536.0 MiB/s, write: 536.0 MiB/s
INFO:  28% (2.6 GiB of 9.3 GiB) in 6s, read: 368.0 MiB/s, write: 368.0 MiB/s
INFO:  40% (3.8 GiB of 9.3 GiB) in 9s, read: 386.7 MiB/s, write: 386.7 MiB/s
INFO:  55% (5.1 GiB of 9.3 GiB) in 12s, read: 452.0 MiB/s, write: 450.7 MiB/s
INFO:  65% (6.1 GiB of 9.3 GiB) in 15s, read: 334.7 MiB/s, write: 334.7 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 265 failed - interrupted by signal
INFO: Failed at 2025-12-12 05:10:24
ERROR: Backup job failed - interrupted by signal

TASK ERROR: interrupted by signal


This was the only failed VM backup of about 300 or more VMs. Speed seems quite ok and the VM got stuck with high CPU load and was not responsible anymore. Searching logs inside the VM showed a lot disk I/O errors, but at least the log could be written. Since many guys post here about a lot of backups failed, do you think this is related?

Not sure if I should revert to old Kernel before christmas and new year....
 
@Chris since using that script since mid-day, I couldn't replicate a single stall (which it should have found through the elapsed-measurement). Will revert to 6.14 for the nightly backups and restart the tests tomorrow.

@And1986 I'd recommend you revert to the last 6.14 kernel for the holidays for your sanitys sake. With any luck we'll have a fix afterwards :)
 
@Chris, I sent you the link via DMs.

Also, I discussed this problem with a colleague, and he mentioned that when he recently built the kernel (version 6.16/6.17? I don't know exactly), he found that many parameters and options had been modified, renamed, or moved, which made his old config unusable - there was no automatic "migration" of the old config for the kernel to the new parameters.
He claimed that this resulted in a lot of problems, for example with containers, etc.

I don't know if this is related in any way, but maybe this story will be useful.