Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

  • Like
Reactions: Chris
Could it be that the problem is not solely due to the kernel? A few days after I had downgraded to version 6.14, the backup stopped working from one of the two cluster nodes. In my case, the network configuration is an ABL bond with an MTU of 1500. I can ping the PBS from one node, but not the other; after restarting the networking service, it started working again, the same thing that happened with version 6.17.2.
 
In our case. We also use a network ALB Bond with an MTU of 1500. If you kill the task backup/VM on the PBS side. It just continues normal on with the rest of the job on PVE side.

1765537858581.png

no need to restart the netwerk service
 
Could it be that the problem is not solely due to the kernel? A few days after I had downgraded to version 6.14, the backup stopped working from one of the two cluster nodes. In my case, the network configuration is an ABL bond with an MTU of 1500. I can ping the PBS from one node, but not the other; after restarting the networking service, it started working again, the same thing that happened with version 6.17.2.
those very much sound like two different issues. this thread here is about a single connections stalling, not the whole network no longer working.
 
testing whether the https://kernel.ubuntu.com/mainline/v6.16/ and https://kernel.ubuntu.com/mainline/v6.16.12/ kernels are showing the issue on your system(s) would be highly appreciated to narrow down the cause. we still cannot reproduce any issues on kernels after 6.17.4 in our test lab..
Ok, so I'll test 6.16, 6.16.12 and maybe 6.15 later today. It usually takes around 1,5hrs till the stall occurs.

Also, I will try to kill just the stalled backup and see whether it recovers with the next one, as well as try iperf while the backup is halted, just to get a grip as to what actually happens.

Had a look at the switch stats as well and it doesn't seem to log any problems with either rx or tx on the correspondig ports.

Update 1: 6.16.0 went through with the backups, testing 6.16.12 now
 
Last edited:
  • Like
Reactions: Chris
Also the output of nstat after the backup hangs might be of interest. Especially since your ss outputs show out of order packets.
 
Ok, 6.16.0 and 6.16.12 were both fine, but I tested each only once, so it might not have triggered. Back on 6.17.11-2-test-pve and we get the stall once again. What's interesting this time, prx-hanspeter got stuck and then recovered after 17 minutes - it finished the rest of the backups successfully afterwards:

Code:
241: 2025-12-12 16:05:38 INFO: Starting Backup of VM 241 (qemu)
241: 2025-12-12 16:05:38 INFO: status = running
241: 2025-12-12 16:05:38 INFO: VM Name: lab-ldap
241: 2025-12-12 16:05:38 INFO: include disk 'ide0' 'local-efidisks:241/vm-241-disk-0.qcow2' 32G
241: 2025-12-12 16:05:38 INFO: include disk 'efidisk0' 'san-prx:241/vm-241-disk-0.qcow2' 528K
241: 2025-12-12 16:05:38 INFO: include disk 'tpmstate0' 'san-prx:241/vm-241-disk-0.raw' 4M
241: 2025-12-12 16:05:38 INFO: backup mode: snapshot
241: 2025-12-12 16:05:38 INFO: ionice priority: 7
241: 2025-12-12 16:05:38 INFO: creating Proxmox Backup Server archive 'vm/241/2025-12-12T15:05:38Z'
241: 2025-12-12 16:05:38 INFO: attaching TPM drive to QEMU for backup
241: 2025-12-12 16:05:39 INFO: could not determine block node size of drive 'tpmstate0-backup' - using fallback
241: 2025-12-12 16:05:40 INFO: drive-ide0: attaching fleecing image local-zfs:vm-241-fleece-0 to QEMU
241: 2025-12-12 16:05:40 INFO: started backup task 'dffb0bbc-0b38-4828-a139-bd33e5bf2392'
241: 2025-12-12 16:05:40 INFO: resuming VM again
241: 2025-12-12 16:05:40 INFO: efidisk0: dirty-bitmap status: OK (drive clean)
241: 2025-12-12 16:05:40 INFO: ide0: dirty-bitmap status: OK (664.0 MiB of 32.0 GiB dirty)
241: 2025-12-12 16:05:40 INFO: tpmstate0-backup: dirty-bitmap status: created new
241: 2025-12-12 16:05:40 INFO: using fast incremental mode (dirty-bitmap), 664.0 MiB dirty of 32.0 GiB total
241: 2025-12-12 16:05:43 INFO: 100% (664.0 MiB of 664.0 MiB) in 3s, read: 221.3 MiB/s, write: 216.0 MiB/s
241: 2025-12-12 16:23:08 INFO: backup was done incrementally, reused 31.37 GiB (98%)
241: 2025-12-12 16:23:08 INFO: transferred 664.02 MiB in 1048 seconds (648.8 KiB/s)
241: 2025-12-12 16:23:08 INFO: adding notes to backup
241: 2025-12-12 16:23:08 INFO: removing (old) fleecing image 'local-zfs:vm-241-fleece-0'
241: 2025-12-12 16:23:08 INFO: Finished Backup of VM 241 (00:17:30)

Code:
root@prx-backup:~# uname -a
Linux prx-backup 6.17.11-2-test-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.11-2 (2025-12-09T09:02Z) x86_64 GNU/Linux
root@prx-backup:~# ss -ti sport 8007
State                                           Recv-Q                                            Send-Q                                                                                              Local Address:Port                                                                                                Peer Address:Port                                                                                                                                                                                                                                                                                                                                                                                                                                 
ESTAB                                           0                                                 0                                                                                            [::ffff:10.x.y.a]:8007                                                                                        [::ffff:10.x.y.c]:48288                                           
         cubic wscale:7,10 rto:207 rtt:6.582/11.374 ato:40 mss:8948 pmtu:9000 rcvmss:3072 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:1084107 bytes_retrans:123 bytes_acked:1083984 bytes_received:3703857790 segs_out:317478 segs_in:315112 data_segs_out:2343 data_segs_in:314619 send 109Mbps lastsnd:423225 lastrcv:76 lastack:76 pacing_rate 131Mbps delivery_rate 3.33Gbps delivered:2344 app_limited busy:3592ms retrans:0/1 dsack_dups:1 rcv_rtt:207.33 rcv_space:146392 rcv_ssthresh:592739 minrtt:0.044 rcv_ooopack:890 snd_wnd:1065728 rcv_wnd:3072
ESTAB                                           0                                                 0                                                                                            [::ffff:10.x.y.a]:8007                                                                                        [::ffff:10.x.y.b]:46712                                           
         cubic wscale:10,10 rto:201 rtt:0.333/0.496 ato:40 mss:8948 pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 bytes_sent:861063 bytes_acked:861063 bytes_received:181206715 segs_out:17834 segs_in:17552 data_segs_out:382 data_segs_in:17280 send 2.15Gbps lastsnd:53439 lastrcv:191 lastack:191 pacing_rate 4.29Gbps delivery_rate 2.95Gbps delivered:383 app_limited busy:405ms rcv_rtt:207.33 rcv_space:95745 rcv_ssthresh:246825 minrtt:0.04 rcv_ooopack:75 snd_wnd:193536 rcv_wnd:4096


root@prx-gisela:~# uname -a
Linux prx-gisela 6.17.2-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.2-2 (2025-11-26T12:33Z) x86_64 GNU/Linux
root@prx-gisela:~# ss -ti dport 8007
State                Recv-Q                Send-Q                                Local Address:Port                                 Peer Address:Port                
ESTAB                0                     1115932                                 10.x.y.b:46712                                 10.x.y.a:8007                
         cubic wscale:10,10 rto:201 rtt:0.21/0.033 ato:91 mss:8948 pmtu:9000 rcvmss:7199 advmss:8948 cwnd:2 ssthresh:2 bytes_sent:187424293 bytes_retrans:827242 bytes_acked:186597052 bytes_received:862120 segs_out:27880 segs_in:19159 data_segs_out:27600 data_segs_in:390 send 682Mbps lastrcv:26207 pacing_rate 815Mbps delivery_rate 387Mbps delivered:27494 busy:2153952ms rwnd_limited:2153620ms(100.0%) retrans:0/109 rcv_rtt:0.594 rcv_space:97899 rcv_ssthresh:193315 notsent:1115932 minrtt:0.048 snd_wnd:4096 rcv_wnd:193536 rehash:3

root@prx-takefive:~# uname -a
Linux prx-takefive 6.14.11-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-2 (2025-09-12T09:46Z) x86_64 GNU/Linux
root@prx-takefive:~# ss -ti dport 8007 
State                Recv-Q                Send-Q                                Local Address:Port                                 Peer Address:Port                
ESTAB                0                     3299824                                 10.x.y.c:48288                                 10.x.y.a:8007                
         cubic wscale:10,7 rto:201 rtt:0.147/0.009 ato:40 mss:8948 pmtu:9000 rcvmss:7199 advmss:8948 cwnd:2 ssthresh:2 bytes_sent:3719684000 bytes_retrans:11586850 bytes_acked:3708097151 bytes_received:1084019 segs_out:468201 segs_in:320239 data_segs_out:467707 data_segs_in:2344 send 974Mbps lastsnd:123 lastrcv:141561 lastack:123 pacing_rate 1.16Gbps delivery_rate 501Mbps delivered:466241 busy:1725945ms rwnd_limited:1718617ms(99.6%) retrans:0/1488 dsack_dups:4 rcv_rtt:0.702 rcv_space:107185 rcv_ssthresh:1065675 notsent:3299824 minrtt:0.044 snd_wnd:3072 rcv_wnd:1065728 rehash:10

nstat output from prx-backup is attached. I also did a one-minute iperf run from both affected nodes while the backup is still stalled and they both went reasonably fast (not exactly linespeed, but close enough) - didn't expect otherwise tbh since that's entirely new tcp-connections:

Code:
# takefive
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.47  sec  13.3 GBytes  1.89 Gbits/sec  966            sender
[  5]   0.00-60.47  sec  0.00 Bytes  0.00 bits/sec                  receiver
[  7]   0.00-60.47  sec  12.7 GBytes  1.81 Gbits/sec  1184            sender
[  7]   0.00-60.47  sec  0.00 Bytes  0.00 bits/sec                  receiver
[  9]   0.00-60.47  sec  12.9 GBytes  1.83 Gbits/sec  827            sender
[  9]   0.00-60.47  sec  0.00 Bytes  0.00 bits/sec                  receiver
[ 11]   0.00-60.47  sec  13.6 GBytes  1.93 Gbits/sec  859            sender
[ 11]   0.00-60.47  sec  0.00 Bytes  0.00 bits/sec                  receiver
[ 13]   0.00-60.47  sec  11.8 GBytes  1.68 Gbits/sec  884            sender
[ 13]   0.00-60.47  sec  0.00 Bytes  0.00 bits/sec                  receiver
[ 15]   0.00-60.47  sec  14.6 GBytes  2.07 Gbits/sec  932            sender
[ 15]   0.00-60.47  sec  0.00 Bytes  0.00 bits/sec                  receiver
[ 17]   0.00-60.47  sec  13.2 GBytes  1.87 Gbits/sec  862            sender
[ 17]   0.00-60.47  sec  0.00 Bytes  0.00 bits/sec                  receiver
[ 19]   0.00-60.47  sec  13.5 GBytes  1.92 Gbits/sec  794            sender
[ 19]   0.00-60.47  sec  0.00 Bytes  0.00 bits/sec                  receiver
[SUM]   0.00-60.47  sec   106 GBytes  15.0 Gbits/sec  7308             sender
[SUM]   0.00-60.47  sec  0.00 Bytes  0.00 bits/sec                  receiver
Code:
# gisela
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.55  sec  14.9 GBytes  2.11 Gbits/sec  960            sender
[  5]   0.00-60.55  sec  0.00 Bytes  0.00 bits/sec                  receiver
[  7]   0.00-60.55  sec  13.7 GBytes  1.95 Gbits/sec  1160            sender
[  7]   0.00-60.55  sec  0.00 Bytes  0.00 bits/sec                  receiver
[  9]   0.00-60.55  sec  15.4 GBytes  2.19 Gbits/sec  1049            sender
[  9]   0.00-60.55  sec  0.00 Bytes  0.00 bits/sec                  receiver
[ 11]   0.00-60.55  sec  14.0 GBytes  1.99 Gbits/sec  1091            sender
[ 11]   0.00-60.55  sec  0.00 Bytes  0.00 bits/sec                  receiver
[ 13]   0.00-60.55  sec  16.6 GBytes  2.36 Gbits/sec  1008            sender
[ 13]   0.00-60.55  sec  0.00 Bytes  0.00 bits/sec                  receiver
[ 15]   0.00-60.55  sec  13.8 GBytes  1.96 Gbits/sec  856            sender
[ 15]   0.00-60.55  sec  0.00 Bytes  0.00 bits/sec                  receiver
[ 17]   0.00-60.55  sec  14.2 GBytes  2.01 Gbits/sec  1121            sender
[ 17]   0.00-60.55  sec  0.00 Bytes  0.00 bits/sec                  receiver
[ 19]   0.00-60.55  sec  14.5 GBytes  2.06 Gbits/sec  835            sender
[ 19]   0.00-60.55  sec  0.00 Bytes  0.00 bits/sec                  receiver
[SUM]   0.00-60.55  sec   117 GBytes  16.6 Gbits/sec  8080             sender
[SUM]   0.00-60.55  sec  0.00 Bytes  0.00 bits/sec                  receiver

I can let the stall sit there for until tonight (21:30) if you need me to run anything else.
 

Attachments

  • Like
Reactions: Heracleos