[SOLVED] Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

Heracleos · Dec 4, 2025

Chris said:
Please check if the Kernel 6.17.4-1 in the pbs-test repo solves the issue for you if possible. Thanks!

Yes, I just installed it. I'll let you know if there are any problems. However, the issue was intermittent, so I think it will take a few days to verify. For instance, I haven't had any problems also with 6.17.2 over the last two days.

Chris · Dec 4, 2025

Make sure to also set MTU 9000 again for testing if you changed this back to default.

pfornara · Dec 4, 2025

Chris said:
Please check if the Kernel 6.17.4-1 in the pbs-test repo solves the issue for you if possible. Thanks!

I will try it adding test repo to my PBS tomorrow.

Could you confirm that 6.14 will be still available with proxmox-boot-tool kernel pin 6.14.11-4-pve --next-boot for downgrade in case of issues also with the new one ?

thank you

Heracleos · Dec 4, 2025

pfornara said:
Could you confirm that 6.14 will be still available with proxmox-boot-tool kernel pin 6.14.11-4-pve --next-boot for downgrade in case of issues also with the new one ?

Well, it seems the old kernels are not removed.
Here my /boot dir, and as you can see all the kernels are available

Code:

root@pbs:~# ls -l /boot/
total 523016
-rw-r--r-- 1 root root   296189 Oct 10 03:04 config-6.14.11-4-pve
-rw-r--r-- 1 root root   296152 Jul 22 05:04 config-6.14.8-2-pve
-rw-r--r-- 1 root root   302240 Oct 21 06:55 config-6.17.2-1-pve
-rw-r--r-- 1 root root   302196 Nov 26 06:33 config-6.17.2-2-pve
-rw-r--r-- 1 root root   302297 Dec  3 09:42 config-6.17.4-1-pve
drwxr-xr-x 2 root root     4096 Sep  8 08:34 efi
drwxr-xr-x 6 root root     4096 Dec  4 02:27 grub
-rw-r--r-- 1 root root 78716133 Nov 17 03:59 initrd.img-6.14.11-4-pve
-rw-r--r-- 1 root root 78374240 Sep  8 08:37 initrd.img-6.14.8-2-pve
-rw-r--r-- 1 root root 84867602 Nov 26 04:14 initrd.img-6.17.2-1-pve
-rw-r--r-- 1 root root 84869803 Dec  1 01:17 initrd.img-6.17.2-2-pve
-rw-r--r-- 1 root root 84938101 Dec  4 02:27 initrd.img-6.17.4-1-pve
-rw-r--r-- 1 root root   151020 Nov 17  2024 memtest86+ia32.bin
-rw-r--r-- 1 root root   152064 Nov 17  2024 memtest86+ia32.efi
-rw-r--r-- 1 root root   155992 Nov 17  2024 memtest86+x64.bin
-rw-r--r-- 1 root root   157184 Nov 17  2024 memtest86+x64.efi
drwxr-xr-x 2 root root     4096 Dec  4 02:27 pve
-rw-r--r-- 1 root root  8941891 Oct 10 03:04 System.map-6.14.11-4-pve
-rw-r--r-- 1 root root  8938356 Jul 22 05:04 System.map-6.14.8-2-pve
-rw-r--r-- 1 root root  9125340 Oct 21 06:55 System.map-6.17.2-1-pve
-rw-r--r-- 1 root root  9125340 Nov 26 06:33 System.map-6.17.2-2-pve
-rw-r--r-- 1 root root  9129241 Dec  3 09:42 System.map-6.17.4-1-pve
-rw-r--r-- 1 root root 14912616 Oct 10 03:04 vmlinuz-6.14.11-4-pve
-rw-r--r-- 1 root root 14908520 Jul 22 05:04 vmlinuz-6.14.8-2-pve
-rw-r--r-- 1 root root 15367272 Oct 21 06:55 vmlinuz-6.17.2-1-pve
-rw-r--r-- 1 root root 15367272 Nov 26 06:33 vmlinuz-6.17.2-2-pve
-rw-r--r-- 1 root root 15817512 Dec  3 09:42 vmlinuz-6.17.4-1-pve
root@pbs:~#

And also from the package version popup:

Code:

proxmox-backup: 4.0.0 (running kernel: 6.17.4-1-pve)
proxmox-backup-server: 4.1.0-1 (running version: 4.1.0)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17.4-1-pve-signed: 6.17.4-1
proxmox-kernel-6.17: 6.17.4-1
proxmox-kernel-6.17.2-2-pve-signed: 6.17.2-2
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
proxmox-kernel-6.14.11-4-pve-signed: 6.14.11-4
proxmox-kernel-6.14: 6.14.11-4
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
ifupdown2: 3.3.0-1+pmx11
libjs-extjs: 7.0.0-5
proxmox-backup-docs: 4.1.0-1
proxmox-backup-client: 4.1.0-1
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.2
pve-xtermjs: 5.5.0-3
smartmontools: 7.4-pve1
zfsutils-linux: 2.3.4-pve1

e78413b1 · Dec 4, 2025

Unfortunately we can't test this with the newest kernel because this is our only cluster which we need for production... But I'm curious what others will report. Thanks in advance.
I wanted to add though: Until recently I thought this is what we get for not paying for a proper PVE/PBS subscription, but it seems you actually pushed this into the stable subscription repository despite our reports. Is that correct?

Chris · Dec 4, 2025

e78413b1 said:
Unfortunately we can't test this with the newest kernel because this is our only cluster which we need for production... But I'm curious what others will report. Thanks in advance.
I wanted to add though: Until recently I thought this is what we get for not paying for a proper PVE/PBS subscription, but it seems you actually pushed this into the stable subscription repository despite our reports. Is that correct?

The new Linux kernel version was part of the 4.1 release, the issue got unnoticed during the extended public testing phase as well as for the Proxmox VE 9.1 release a week earlier, which was released with the same kernel as well. Further, the issue is very specific as it only affects setups with MTU 9000 and even then is not triggered consistently.

In the mean time we managed to bisect the commit which fixes the issue with the reproducer we were able to obtain.

LKo · Dec 4, 2025

Can you link to the possible commit that fixes (or causes) this issue? We ran into the same problem, I have downgraded to 6.14.11-4-pve to see whether the backups tonight will hang again or not - my manual tests weren't able to trigger the condition anymore, but it occured only after about 40-50% of the backups were done in the last days.

If that works I can test 6.17.4-1 tomorrow night.

We're using a mix of 10 and 25gbe Intel cards with MTU 9000, no LACP, no VLAN on the storage NIC, pbs running on a VM using TrueNAS Scale-backed NFS storage - this worked fine until we updated to PBS 4.1 from 4.0, with currently approx. 17TiB used on the backup storage.

Most of the VMs are stored on a different TrueNAS Scale (running 6 Enterprise-SSDs in a 2x3 RAIDZ1 plus Optane-mirror as ZIL) and run directly from NFS. I'm still flabbergasted by the fact this runs so extremely well tbh - 140 VMs, mixed workloads Linux/Windows, desktop and enterprisy-software. Only <10% of them are DB servers on local zfs Enterprise-ssd storage (replicated via zfs in the cluster).

Kudos to you for making KVM useable and manageable.

t.lamprecht · Dec 4, 2025

LKo said:
Can you link to the possible commit that fixes (or causes) this issue?

I was not involved in the investigation myself, but IIRC Chris mentioned that the following commit was the one that fixed the issue:
https://git.proxmox.com/?p=mirror_ubuntu-kernels.git;a=commitdiff;h=82400d46

There is a suspected commit for the cause, but that one did not get verified for sure.

LKo · Dec 4, 2025

t.lamprecht said:
I was not involved in the investigation myself, but IIRC Chris mentioned that the following commit was the one that fixed the issue:
https://git.proxmox.com/?p=mirror_ubuntu-kernels.git;a=commitdiff;h=82400d46

There is a suspected commit for the cause, but that one did not get verified for sure.

Thank you, I missed that post obviously. Damn, that looks like typical LKML sorcery - and it seems absolutely plausible that this would be able to cause stalls on specific tcp connections only. As I said, I can test it tomorrow night, tonight is reserved for backups that will finally (hopefully) work again

GE_Admin · Dec 5, 2025

Chris said:
The new Linux kernel version was part of the 4.1 release, the issue got unnoticed during the extended public testing phase as well as for the Proxmox VE 9.1 release a week earlier, which was released with the same kernel as well. Further, the issue is very specific as it only affects setups with MTU 9000 and even then is not triggered consistently.

In the mean time we managed to bisect the commit which fixes the issue with the reproducer we were able to obtain.

Hi Chris,
in my production environment, I have two physical PBS: the first has two 10 Gbps NICs bonded with LACP with a 9000 MTU, while the second has two 1 Gbps NICs bonded with LACP with a 1500 MTU. Both are experiencing issues with Kernel 6.17.2-1 or 6.17.2.2, but worked perfectly with 6.14.x. Therefore, the problem is NOT exclusively related to LACP with MTU 9000, but also to LACP with MTU 1500.

I'm currently patching the situation by backing up to two virtual PBS with PBS version 3.4.8, and replicating the backups to the physical PBS via a Push Sync Job. Unfortunately, I must report that Push Jobs to PBS 4 also fail randomly. To add further details, both PBS 4s use ZFS datastores with Namespaces.

Below is the log of the failed Sync job pushes:

2025-11-29T02:06:31+01:00: starting new backup on datastore 'Pool_RZ_1' from ::ffff:172.16.160.152: "ns/cls-001/vm/111/2025-11-28T19:33:19Z"
2025-11-29T02:06:31+01:00: download 'index.json.blob' from previous backup 'vm/111/2025-11-27T19:33:13Z'.
2025-11-29T02:06:31+01:00: add blob "/mnt/datastore/Pool_RZ_1/ns/cls-apw-001/vm/111/2025-11-28T19:33:19Z/qemu-server.conf.blob" (451 bytes, comp: 451)
2025-11-29T02:06:31+01:00: register chunks in 'drive-scsi0.img.fidx' from previous backup 'vm/111/2025-11-27T19:33:13Z'.
2025-11-29T02:06:31+01:00: download 'drive-scsi0.img.fidx' from previous backup 'vm/111/2025-11-27T19:33:13Z'.
2025-11-29T02:06:31+01:00: created new fixed index 1 ("ns/cls-001/vm/111/2025-11-28T19:33:19Z/drive-scsi0.img.fidx")
2025-12-02T14:11:31+01:00: backup failed: connection error
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: removing failed backup
2025-12-02T14:11:31+01:00: removing backup snapshot "/mnt/datastore/Pool_RZ_1/ns/cls-001/vm/111/2025-11-28T19:33:19Z"
2025-12-02T14:11:31+01:00: TASK ERROR: connection error: connection reset
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: PUT /fixed_index: 400 Bad Request: Problems reading request body: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection

Chris · Dec 5, 2025

GE_Admin said:
Hi Chris,
in my production environment, I have two physical PBS: the first has two 10 Gbps NICs bonded with LACP with a 9000 MTU, while the second has two 1 Gbps NICs bonded with LACP with a 1500 MTU. Both are experiencing issues with Kernel 6.17.2-1 or 6.17.2.2, but worked perfectly with 6.14.x. Therefore, the problem is NOT exclusively related to LACP with MTU 9000, but also to LACP with MTU 1500.

I'm currently patching the situation by backing up to two virtual PBS with PBS version 3.4.8, and replicating the backups to the physical PBS via a Push Sync Job. Unfortunately, I must report that Push Jobs to PBS 4 also fail randomly. To add further details, both PBS 4s use ZFS datastores with Namespaces.

Below is the log of the failed Sync job pushes:

2025-11-29T02:06:31+01:00: starting new backup on datastore 'Pool_RZ_1' from ::ffff:172.16.160.152: "ns/cls-001/vm/111/2025-11-28T19:33:19Z"
2025-11-29T02:06:31+01:00: download 'index.json.blob' from previous backup 'vm/111/2025-11-27T19:33:13Z'.
2025-11-29T02:06:31+01:00: add blob "/mnt/datastore/Pool_RZ_1/ns/cls-apw-001/vm/111/2025-11-28T19:33:19Z/qemu-server.conf.blob" (451 bytes, comp: 451)
2025-11-29T02:06:31+01:00: register chunks in 'drive-scsi0.img.fidx' from previous backup 'vm/111/2025-11-27T19:33:13Z'.
2025-11-29T02:06:31+01:00: download 'drive-scsi0.img.fidx' from previous backup 'vm/111/2025-11-27T19:33:13Z'.
2025-11-29T02:06:31+01:00: created new fixed index 1 ("ns/cls-001/vm/111/2025-11-28T19:33:19Z/drive-scsi0.img.fidx")
2025-12-02T14:11:31+01:00: backup failed: connection error
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: removing failed backup
2025-12-02T14:11:31+01:00: removing backup snapshot "/mnt/datastore/Pool_RZ_1/ns/cls-001/vm/111/2025-11-28T19:33:19Z"
2025-12-02T14:11:31+01:00: TASK ERROR: connection error: connection reset
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: PUT /fixed_index: 400 Bad Request: Problems reading request body: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection
2025-12-02T14:11:31+01:00: POST /fixed_chunk: 400 Bad Request: error reading a body from connection

Please test 6.17.4-1 as available in the pbs-test repository

LKo · Dec 5, 2025

Chris said:
Please test 6.17.4-1 as available in the pbs-test repository

Ok, so my testing showed what we already know - with 6.14.11-4-pve, all backups went through smoothly tonight.

I installed 6.17.4-1-pve and rebooted pbs, and the first manual backup I tried started stalling after approx. 4GiB transfer.

This did in fact also halt the VM on PVE, making the console and monitor inaccessible. Yesterday it recovered from this state after about 2hrs, today I just killed the VM.

After some digging with the help of our mighty AI overloards I set the following sysctls:

sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.rmem_default=262144
sysctl -w net.ipv4.tcp_rmem="4096 262144 134217728"

ftr, the defaults are:
net.core.rmem_max = 212992
net.core.rmem_default = 212992
net.ipv4.tcp_rmem = 4096 131072 33554432

[EDIT: these are the defaults for 6.14.11-4-pve:
net.core.rmem_max = 212992
net.core.rmem_default = 212992
net.ipv4.tcp_rmem = 4096 131072 6291456]

very noticably net.core.rmem_max == net.core.rmem_default == 208KiB which is 52 4k-pages which is a maximum of 52 packages in the tcp ingest pipeline if I understand this correctly - less so if the MTU is 9000, which seems low-ish to my amateurish understanding of the networking stack. ChatGPT theorized that this needs to be larger for the patch to tcp_can_ingest to be able to work at all.

With these settings I was able to successfully backup this vm multiple times, with or without the dirty-bitmap present (e.g. a backup after VM stop/start and then subsequent backups after downloading random files to the VM to have some chunks changed).

So, 6.17.4-1-pve *only* works for me when the net.core.rmem_max, net.core.rmem_default and net.ipv4.tcp_rmem values are bigger than default.

I tested the same sysctl-values with 6.17.2-2-pve. This did not work, the backup stalled again after a few GiB and also stalled the VM on the PVE-node.

Let me know if you need more or more detailed information. Since I now have known-good configurations and some reproducible failure triggers, I'm willing to test whatever you throw at me.

EDIT to make it crystal clear: 6.17.4-1-pve in pbs-test does *not* by itself fix the problem. Also, since the VMs that stall during backup do get stuck this *can lead to data corruption*, I've had a few write errors followed by necessary fsck-runs on linux VMs that got stalled during the backup (luckily no further damage as of yet).

Chris · Dec 5, 2025

First off, thanks for testing!

LKo said:
So, 6.17.4-1-pve *only* works for me when the net.core.rmem_max, net.core.rmem_default and net.ipv4.tcp_rmem values are bigger than default.

This is an interesting discrepancy between your and our findings. So far non of the backups failed when using kernel 6.17.4-1 here. Could maybe be related to the network setup and topology, on our end a bond with MTU and LACP with direct connections between the nics was used.

LKo said:
I tested the same sysctl-values with 6.17.2-2-pve. This did not work, the backup stalled again after a few GiB and also stalled the VM on the PVE-node.

This on the other hand is consistent with our findings.

LKo said:
Let me know if you need more or more detailed information. Since I now have known-good configurations and some reproducible failure triggers, I'm willing to test whatever you throw at me.

Could you maybe check what is the threshold value for buffer size in your case? Also the output of ss -ti dport 8007 on PVE and ss -ti sport 8007 on the PBS during backup might be of interest, for the good and the staled state.

LKo · Dec 5, 2025

Chris said:
First off, thanks for testing!

This is an interesting discrepancy between your and our findings. So far non of the backups failed when using kernel 6.17.4-1 here. Could maybe be related to the network setup and topology, on our end a bond with MTU and LACP with direct connections between the nics was used.

Absolutely.

We don't use a bond, but we do use a bridge on the NIC for the storage-net on PVE node with MTU 9000 (slightly anonymized):

Code:

# pve-node
auto ens27f1np1
iface ens27f1np1 inet manual
        mtu 9000

auto vmbr1
iface vmbr1 inet static
        address 10.x.y.z/24
        bridge-ports ens27f1np1
        bridge-stp off
        bridge-fd 0
        mtu 9000
#STORAGE

This is the bridge the second nic of the pbs-VM is assigned to:

Code:

# pbs storage nic
auto ens19
iface ens19 inet static
        address 10.x.y.a/24
        mtu 9000

The storage is mounted on pbs via nfs:

Code:

storage-backup.my.domain:/mnt/backup/Proxmox on /backup type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,fatal_neterrors=none,proto=tcp,nconnect=16,timeo=600,retrans=2,sec=sys,clientaddr=10.x.y.a,local_lock=none,addr=10.x.y.b)

The VM-storage comes from a different TrueNAS, but on the same MTU-9000 network:

Code:

nfs: san-prx
        export /mnt/SSD_SAN/prox_bbg_store
        path /mnt/pve/san-prx
        server storage-san.my.domain
        content vztmpl,images,iso,rootdir
        options nconnect=16,noatime,nodiratime
        prune-backups keep-all=1

So, each backup goes at least doubly through the same NIC (not ideal, I know)

NIC is

Code:

43:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
43:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)

Chris said:
This on the other hand is consistent with our findings.

Could you maybe check what is the threshold value for buffer size in your case? Also the output of ss -ti dport 8007 on PVE and ss -ti sport 8007 on the PBS during backup might be of interest, for the good and the staled state.

Ok, some results.

6.17.4-1-pve with altered sysctls, during a working backup: (10.x.y.a is the pbs on storage network, 10.x.y.z is the storage network IP of the PVE node the VM that's backed up is running on): see attached 1.txt

When I tried the same with the default sysctl values, I could not reproduce it anymore (tried around 10 backups 5-10GiB each). I'll try again later. Seems unfortunatly not as reproducible as I thought it would be, must have gotten "lucky" earlier today.

6.17.4-1-pve with default sysctls, during a working backup: see attatched 2.txt

With threshold value you mean experimenting with values in-between default and the ones I set? Will do that as soon as I can reproduce the stalling again.

EDIT: at least this is consistent with the backup failing the last few days after approx. half of the 140-some VMs were through with the backup, so only after a few 100 GiB.

LKo · Dec 5, 2025

Ok, I let the nightly backup start on stock 6.17.4-1-pve and ran into the stall a minute into the first backup after it wrote 17GiB with approx. 250MiB/s and bingo, the rcv_wnd is 7168, not 180224 or similar as it was when it worked.

Code:

#PBS
root@prx-backup:~# ss -ti sport 8007
State           Recv-Q            Send-Q                              Local Address:Port                               Peer Address:Port          
ESTAB           0                 0                            [::ffff:10.x.y.a]:8007                       [::ffff:10.x.y.z]:53414          
         cubic wscale:7,10 rto:201 rtt:0.117/0.075 ato:40 mss:8948 pmtu:9000 rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:1641323 bytes_acked:1641323 bytes_received:6420848127 segs_out:608708 segs_in:626512 data_segs_out:5276 data_segs_in:626049 send 6.12Gbps lastsnd:7932 lastrcv:192 lastack:192 pacing_rate 12.2Gbps delivery_rate 3.29Gbps delivered:5277 app_limited busy:3030ms rcv_rtt:206.497 rcv_space:130377 rcv_ssthresh:275097 minrtt:0.04 rcv_ooopack:9 snd_wnd:454272 rcv_wnd:7168

#PVE-Node
root@prx-hanspeter:~# ss -ti dport 8007                                                                                                         
State             Recv-Q              Send-Q                            Local Address:Port                            Peer Address:Port         
ESTAB             0                   1646484                             10.x.y.z:53414                            10.x.y.a:8007         
         cubic wscale:10,7 rto:201 rtt:0.17/0.022 ato:50 mss:8948 pmtu:9000 rcvmss:7199 advmss:8948 cwnd:2 ssthresh:2 bytes_sent:6443149459 bytes_retrans:183852 bytes_acked:6442965608 bytes_received:1643437 segs_out:731147 segs_in:614702 data_segs_out:730668 data_segs_in:5292 send 842Mbps lastsnd:4 lastrcv:38316 lastack:4 pacing_rate 1.01Gbps delivery_rate 480Mbps delivered:730654 busy:1990239ms rwnd_limited:1989678ms(100.0%) retrans:0/31 dsack_dups:15 rcv_rtt:0.607 rcv_space:97894 rcv_ssthresh:454209 notsent:1646484 minrtt:0.049 snd_wnd:7168 rcv_wnd:454272 rehash:2

I have now stopped the backup, set the sysctls as below, restarted proxmox-backup-proxy and proxmox-backup and restarted the backup:

Code:

sysctl -w net.core.rmem_default=262144                                                                                      
sysctl -w net.core.rmem_max=134217728                                                                                        
sysctl -w net.ipv4.tcp_rmem="4096 262144 134217728"

Code:

root@prx-backup:~# ss -ti sport 8007
ESTAB           0                0                            [::ffff:10.x.y.a]:8007                        [::ffff:10.x.y.z]:51148          
         cubic wscale:7,12 rto:205 rtt:4.258/5.234 ato:40 mss:8948 pmtu:9000 rcvmss:8948 advmss:8948 cwnd:18 ssthresh:18 bytes_sent:1142970 bytes_acked:1142970 bytes_received:832914641 segs_out:15259 segs_in:32379 data_segs_out:800 data_segs_in:32132 send 303Mbps lastsnd:111 lastrcv:2 lastack:2 pacing_rate 363Mbps delivery_rate 2.96Gbps delivered:801 app_limited busy:1252ms rcv_rtt:0.094 rcv_space:98011 rcv_ssthresh:217282 minrtt:0.054 rcv_ooopack:2 snd_wnd:750592 rcv_wnd:151552

It has now backed up 2 VMs with approx. 40GiB dirty each with rates between 100 and 350MiB/s. On the third VM it stalled yet again:

Code:

root@prx-backup:~# ss -ti sport 8007                                                                                                            
State           Recv-Q           Send-Q                              Local Address:Port                                Peer Address:Port                                                                                                                                                            
ESTAB           0                0                            [::ffff:10.x.y.a]:8007                        [::ffff:10.x.y.z]:49616      
         cubic wscale:7,12 rto:201 rtt:0.121/0.07 ato:40 mss:8948 pmtu:9000 rcvmss:8192 advmss:8948 cwnd:4 ssthresh:4 bytes_sent:2098387 bytes_retrans:246 bytes_acked:2098141 bytes_received:13852668567 segs_out:242051 segs_in:562366 data_segs_out:9270 data_segs_in:559253 send 2.37Gbps lastsnd:18817 lastrcv:97 lastack:97 pacing_rate 2.84Gbps delivery_rate 2.75Gbps delivered:9269 app_limited busy:23248ms retrans:0/2 rcv_rtt:206.523 rcv_space:172933 rcv_ssthresh:433661 minrtt:0.047 rcv_ooopack:82 snd_wnd:422784 rcv_wnd:8192                                                              
root@prx-hanspeter:~# ss -ti dport 8007                                                                                                        
State             Recv-Q              Send-Q                            Local Address:Port                            Peer Address:Port        
ESTAB             0                   1299100                             10.x.y.z:49616                            10.x.y.a:8007        
         cubic wscale:12,7 rto:201 rtt:0.173/0.021 ato:42 mss:8948 pmtu:9000 rcvmss:7199 advmss:8948 cwnd:2 ssthresh:2 bytes_sent:13853721213 bytes_retrans:937958 bytes_acked:13852783256 bytes_received:2098141 segs_out:1626583 segs_in:242066 data_segs_out:1623470 data_segs_in:9268 send 828Mbps lastsnd:165 lastrcv:21799 lastack:165 pacing_rate 993Mbps delivery_rate 333Mbps delivered:1623381 busy:518968ms rwnd_limited:514301ms(99.1%) retrans:0/134 dsack_dups:43 rcv_rtt:0.677 rcv_space:98940 rcv_ssthresh:422735 notsent:1299100 minrtt:0.042 rcv_ooopack:2 snd_wnd:8192 rcv_wnd:422784 rehash:2

So for now, I'll go back to 6.14 and let the backups finish. Let me know if I can test anything else.

stuartthebruce · Dec 7, 2025

I have been able to reproduce hung backups to PBS that started after upgrading to 6.14.0-{1,2}-pve, as well as failed live VM migrations on a new PVE cluster with 2x400G LACP when running the same kernels that log,

QEMU[949247]: kvm: ../util/bitmap.c:167: bitmap_set: Assertion `start >= 0 && nr >= 0' failed.

After upgrading to 6.17.4-1-pve I have not been able to reproduce either failure yet. The statistics are significant that 6.17.4-1-pve is much better on my systems than either of 6.14.0-{1,2}-pve. However, I will keep running tests with a set of large VMs (1TB RAM + 2TB local storage) to see if I can break it.

icepicknz · Dec 7, 2025

I'm having the same issue, everything has been running nicely for some time, and recently I noticed backups failing running for 24hrs...

today I updated my PBS server which was a CT (originally built from the helper script) which went bad and it wouldn't boot anymore, so I just re-installed it with a VM and set everything up. Backups seem to stall, on both my VE8 and VE9 installations which are both on latest editions (recently updated as we added more LACP interfaces).

If I backup to NFS its fast, but to PBS its stalls with
INFO: 0% (348.0 MiB of 100.0 GiB) in 3s, read: 116.0 MiB/s, write: 77.3 MiB/s
INFO: 1% (1.1 GiB of 100.0 GiB) in 11s, read: 99.0 MiB/s, write: 91.5 MiB/s
INFO: 2% (2.0 GiB of 100.0 GiB) in 6m 16s, read: 2.6 MiB/s, write: 2.5 MiB/s

Disks written to locally on PBS with fio give 800MB/s , reads from fio source are 2Gb/s.

Have retested all networking/mtu/lacp/iperf and speeds between servers are great.

not sure if related, but I did notice that when migrating VM's between hosts, I used to get line speed memory / RAM copy between hosts, but since latest updates thats even only 1Gbps.

icepicknz · Dec 7, 2025

Thanks for suggestions,

I have reverted from 6.17.2-1-pve to 6.14.11-4-pve and backups are progressing again, so the ISO for PBS is broken. I think all my PVE's need to be reverted to 6.14.11 too as t he speed is still not right but at least its not stalling now.

Code:

root@pbs-backup-server:~# uname -a
Linux pbs-backup-server 6.14.11-4-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.11-4 (2025-10-10T08:04Z) x86_64 GNU/Linux
root@pbs-backup-server:~#

Heracleos · Dec 7, 2025

Hello everyone,
Unfortunately, it seems that even kernel 6.17.4-1, installed on PBS, does not solve the problem.
After a few days of everything working good, today, during the daily backup, I found a VM with very long backup times and another that had failed. Both VMs running on the same node.
Here the log of the failed vm:

Code:

150: 2025-12-07 02:33:04 INFO: Starting Backup of VM 150 (qemu)
150: 2025-12-07 02:33:04 INFO: status = running
150: 2025-12-07 02:33:04 INFO: VM Name: LabsReports
150: 2025-12-07 02:33:04 INFO: include disk 'scsi0' 'CephRBD:vm-150-disk-0' 90G
150: 2025-12-07 02:33:05 INFO: backup mode: snapshot
150: 2025-12-07 02:33:05 INFO: ionice priority: 7
150: 2025-12-07 02:33:05 INFO: creating Proxmox Backup Server archive 'vm/150/2025-12-07T08:33:04Z'
150: 2025-12-07 02:33:05 INFO: issuing guest-agent 'fs-freeze' command
150: 2025-12-07 02:33:18 INFO: issuing guest-agent 'fs-thaw' command
150: 2025-12-07 02:36:18 ERROR: VM 150 qga command 'guest-fsfreeze-thaw' failed - got timeout
150: 2025-12-07 02:36:18 INFO: started backup task '5c4b9f51-d8b2-49c0-a492-53b1aa478922'
150: 2025-12-07 02:36:18 INFO: resuming VM again
150: 2025-12-07 02:37:03 ERROR: VM 150 qmp command 'cont' failed - unable to connect to VM 150 qmp socket - timeout after 449 retries
150: 2025-12-07 02:37:03 INFO: aborting backup job
150: 2025-12-07 02:56:39 ERROR: VM 150 qmp command 'backup-cancel' failed - got timeout
150: 2025-12-07 02:56:39 INFO: resuming VM again
150: 2025-12-07 02:57:24 ERROR: Backup of VM 150 failed - VM 150 qmp command 'cont' failed - unable to connect to VM 150 qmp socket - timeout after 450 retries

When I restarted the backup manually, it was completed successfully in a couple of minutes.

stuartthebruce · Dec 7, 2025

stuartthebruce said:
I have been able to reproduce hung backups to PBS that started after upgrading to 6.14.0-{1,2}-pve, as well as failed live VM migrations on a new PVE cluster with 2x400G LACP when running the same kernels that log,

QEMU[949247]: kvm: ../util/bitmap.c:167: bitmap_set: Assertion `start >= 0 && nr >= 0' failed.

After upgrading to 6.17.4-1-pve I have not been able to reproduce either failure yet. The statistics are significant that 6.17.4-1-pve is much better on my systems than either of 6.14.0-{1,2}-pve. However, I will keep running tests with a set of large VMs (1TB RAM + 2TB local storage) to see if I can break it.

I have now also seen backups grind to a halt with 6.17.4-1-pve and ss -ti sport 8007 reporting rcv_wnd:6144, time to proxmox-boot-tool kernel pin 6.14.11-4-pve

[SOLVED] Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

Member

Proxmox Staff Member

New Member

Member

New Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Attachments

Member

New Member

Member

Member

Member

Attachments

New Member

We value your privacy