Backup speed regression: 2h20m → 8h39m after PVE 8→9 / QEMU 9.2→10.1 / Ceph 18→19 upgrade

portedaix

Member
Sep 4, 2023
6
0
6
Hi,

Since upgrading on February 11, 2026, VM backup duration has jumped from 2h20m to ~8h39m for a VM with a 6TB Ceph RBD disk. The slowdown is immediate and reproducible on every backup since. Note : the VM is shutdown every evening for power savings. Bitmap is created new every morning.


VERSIONS
- PVE: 9.1.6 (kernel 6.17.13-2-pve)
- QEMU: 10.1.2-7
- Ceph: 19.2.3-pve4
- PBS: 4.1.4 (kernel 6.17.13-2-pve)


VM SETUP
- scsi0: 64GB on NVMe-backed Ceph pool
- scsi1: 6TB on HDD-backed Ceph pool (9x HDD OSDs, 3 nodes)
- Backup to PBS with fleecing on local-zfs
- Cluster shuts down every evening → dirty bitmaps always invalidated → every backup is a full 6TB read


BEFORE/AFTER (from PBS task logs)

Feb 10 — last backup BEFORE upgrade (PVE 8 / QEMU 9.2 / Ceph 18):
Start: 2026-02-10T11:00:05 End: 2026-02-10T13:20:22 → 2h 20m ✓

Upgrade on Feb 11: Ceph 18→19 at 14:15, QEMU 9.2→10.1 + PVE 8→9 at 15:55

Feb 11 — first backup AFTER upgrade, started 90 minutes later:
Start: 2026-02-11T17:27:59 End: 2026-02-12T02:06:48 → 8h 39m ✗

Both backups show 99% deduplication — identical data volume.


KEY OBSERVATION
During backup, vzdump reads the 6TB disk at a consistent 100-135 MB/s.
Direct rbd bench on the same volume: 780 MiB/s (16 threads, 4MB blocks).
That is a 5-6x gap between raw Ceph read speed and the backup pipeline.


WHAT I HAVE RULED OUT
- PBS kernel 6.17.2 TCP regression: running 6.17.13, not affected
- Network: bond0 is 20 Gbps, iperf confirms full speed
- Ceph health: HEALTH_OK, all 12 OSDs up
- PG distribution: well balanced, ~28 PGs primary per HDD OSD
- mclock miscalibration: 3 OSDs had low osd_mclock_max_capacity_iops_hdd (239-286), corrected to 478 — no effect on backup speed
- osd.6 missing NVMe DB/WAL: fixed via ceph-bluestore-tool bluefs-bdev-migrate — no effect on backup speed
- detect_zeroes=off on scsi1: tested, no change
- Fleecing disabled: tested without --fleecing, no change
- rbd_cache=false: tested, no change
- rbd_read_from_replica_policy: already 'default'

The 5-6x gap between rbd bench (780 MB/s) and vzdump read speed (120 MB/s), appearing immediately after the upgrade, might suggest a regression in the QEMU backup block job or libproxmox-backup-qemu in QEMU 10.x / PVE 9.x ?

Has anyone else seen this? Any suggestions welcome.
 

Attachments

Hi,

Thank you for the output!

Could you please share the output of `pveversion -v` from Proxmox VE and the output of `proxmox-backup-manager versions --verbose` from the Proxmox Backup Server, and the storage config i.e., `/etc/pve/storage.cfg` from PVE side?

Have you tried to boot from the older kernel on the PBS or PVE?
All VMs have slow backup or only the VMs who stored on Ceph?
How about the restore?