Hi,
Since upgrading on February 11, 2026, VM backup duration has jumped from 2h20m to ~8h39m for a VM with a 6TB Ceph RBD disk. The slowdown is immediate and reproducible on every backup since. Note : the VM is shutdown every evening for power savings. Bitmap is created new every morning.
VERSIONS
- PVE: 9.1.6 (kernel 6.17.13-2-pve)
- QEMU: 10.1.2-7
- Ceph: 19.2.3-pve4
- PBS: 4.1.4 (kernel 6.17.13-2-pve)
VM SETUP
- scsi0: 64GB on NVMe-backed Ceph pool
- scsi1: 6TB on HDD-backed Ceph pool (9x HDD OSDs, 3 nodes)
- Backup to PBS with fleecing on local-zfs
- Cluster shuts down every evening → dirty bitmaps always invalidated → every backup is a full 6TB read
BEFORE/AFTER (from PBS task logs)
Feb 10 — last backup BEFORE upgrade (PVE 8 / QEMU 9.2 / Ceph 18):
Start: 2026-02-10T11:00:05 End: 2026-02-10T13:20:22 → 2h 20m ✓
Upgrade on Feb 11: Ceph 18→19 at 14:15, QEMU 9.2→10.1 + PVE 8→9 at 15:55
Feb 11 — first backup AFTER upgrade, started 90 minutes later:
Start: 2026-02-11T17:27:59 End: 2026-02-12T02:06:48 → 8h 39m ✗
Both backups show 99% deduplication — identical data volume.
KEY OBSERVATION
During backup, vzdump reads the 6TB disk at a consistent 100-135 MB/s.
Direct rbd bench on the same volume: 780 MiB/s (16 threads, 4MB blocks).
That is a 5-6x gap between raw Ceph read speed and the backup pipeline.
WHAT I HAVE RULED OUT
- PBS kernel 6.17.2 TCP regression: running 6.17.13, not affected
- Network: bond0 is 20 Gbps, iperf confirms full speed
- Ceph health: HEALTH_OK, all 12 OSDs up
- PG distribution: well balanced, ~28 PGs primary per HDD OSD
- mclock miscalibration: 3 OSDs had low osd_mclock_max_capacity_iops_hdd (239-286), corrected to 478 — no effect on backup speed
- osd.6 missing NVMe DB/WAL: fixed via ceph-bluestore-tool bluefs-bdev-migrate — no effect on backup speed
- detect_zeroes=off on scsi1: tested, no change
- Fleecing disabled: tested without --fleecing, no change
- rbd_cache=false: tested, no change
- rbd_read_from_replica_policy: already 'default'
The 5-6x gap between rbd bench (780 MB/s) and vzdump read speed (120 MB/s), appearing immediately after the upgrade, might suggest a regression in the QEMU backup block job or libproxmox-backup-qemu in QEMU 10.x / PVE 9.x ?
Has anyone else seen this? Any suggestions welcome.
Since upgrading on February 11, 2026, VM backup duration has jumped from 2h20m to ~8h39m for a VM with a 6TB Ceph RBD disk. The slowdown is immediate and reproducible on every backup since. Note : the VM is shutdown every evening for power savings. Bitmap is created new every morning.
VERSIONS
- PVE: 9.1.6 (kernel 6.17.13-2-pve)
- QEMU: 10.1.2-7
- Ceph: 19.2.3-pve4
- PBS: 4.1.4 (kernel 6.17.13-2-pve)
VM SETUP
- scsi0: 64GB on NVMe-backed Ceph pool
- scsi1: 6TB on HDD-backed Ceph pool (9x HDD OSDs, 3 nodes)
- Backup to PBS with fleecing on local-zfs
- Cluster shuts down every evening → dirty bitmaps always invalidated → every backup is a full 6TB read
BEFORE/AFTER (from PBS task logs)
Feb 10 — last backup BEFORE upgrade (PVE 8 / QEMU 9.2 / Ceph 18):
Start: 2026-02-10T11:00:05 End: 2026-02-10T13:20:22 → 2h 20m ✓
Upgrade on Feb 11: Ceph 18→19 at 14:15, QEMU 9.2→10.1 + PVE 8→9 at 15:55
Feb 11 — first backup AFTER upgrade, started 90 minutes later:
Start: 2026-02-11T17:27:59 End: 2026-02-12T02:06:48 → 8h 39m ✗
Both backups show 99% deduplication — identical data volume.
KEY OBSERVATION
During backup, vzdump reads the 6TB disk at a consistent 100-135 MB/s.
Direct rbd bench on the same volume: 780 MiB/s (16 threads, 4MB blocks).
That is a 5-6x gap between raw Ceph read speed and the backup pipeline.
WHAT I HAVE RULED OUT
- PBS kernel 6.17.2 TCP regression: running 6.17.13, not affected
- Network: bond0 is 20 Gbps, iperf confirms full speed
- Ceph health: HEALTH_OK, all 12 OSDs up
- PG distribution: well balanced, ~28 PGs primary per HDD OSD
- mclock miscalibration: 3 OSDs had low osd_mclock_max_capacity_iops_hdd (239-286), corrected to 478 — no effect on backup speed
- osd.6 missing NVMe DB/WAL: fixed via ceph-bluestore-tool bluefs-bdev-migrate — no effect on backup speed
- detect_zeroes=off on scsi1: tested, no change
- Fleecing disabled: tested without --fleecing, no change
- rbd_cache=false: tested, no change
- rbd_read_from_replica_policy: already 'default'
The 5-6x gap between rbd bench (780 MB/s) and vzdump read speed (120 MB/s), appearing immediately after the upgrade, might suggest a regression in the QEMU backup block job or libproxmox-backup-qemu in QEMU 10.x / PVE 9.x ?
Has anyone else seen this? Any suggestions welcome.