Backup speed regression: 2h20m → 8h39m after PVE 8→9 / QEMU 9.2→10.1 / Ceph 18→19 upgrade

portedaix

Member
Sep 4, 2023
9
0
6
Hi,

Since upgrading on February 11, 2026, VM backup duration has jumped from 2h20m to ~8h39m for a VM with a 6TB Ceph RBD disk. The slowdown is immediate and reproducible on every backup since. Note : the VM is shutdown every evening for power savings. Bitmap is created new every morning.


VERSIONS
- PVE: 9.1.6 (kernel 6.17.13-2-pve)
- QEMU: 10.1.2-7
- Ceph: 19.2.3-pve4
- PBS: 4.1.4 (kernel 6.17.13-2-pve)


VM SETUP
- scsi0: 64GB on NVMe-backed Ceph pool
- scsi1: 6TB on HDD-backed Ceph pool (9x HDD OSDs, 3 nodes)
- Backup to PBS with fleecing on local-zfs
- Cluster shuts down every evening → dirty bitmaps always invalidated → every backup is a full 6TB read


BEFORE/AFTER (from PBS task logs)

Feb 10 — last backup BEFORE upgrade (PVE 8 / QEMU 9.2 / Ceph 18):
Start: 2026-02-10T11:00:05 End: 2026-02-10T13:20:22 → 2h 20m ✓

Upgrade on Feb 11: Ceph 18→19 at 14:15, QEMU 9.2→10.1 + PVE 8→9 at 15:55

Feb 11 — first backup AFTER upgrade, started 90 minutes later:
Start: 2026-02-11T17:27:59 End: 2026-02-12T02:06:48 → 8h 39m ✗

Both backups show 99% deduplication — identical data volume.


KEY OBSERVATION
During backup, vzdump reads the 6TB disk at a consistent 100-135 MB/s.
Direct rbd bench on the same volume: 780 MiB/s (16 threads, 4MB blocks).
That is a 5-6x gap between raw Ceph read speed and the backup pipeline.


WHAT I HAVE RULED OUT
- PBS kernel 6.17.2 TCP regression: running 6.17.13, not affected
- Network: bond0 is 20 Gbps, iperf confirms full speed
- Ceph health: HEALTH_OK, all 12 OSDs up
- PG distribution: well balanced, ~28 PGs primary per HDD OSD
- mclock miscalibration: 3 OSDs had low osd_mclock_max_capacity_iops_hdd (239-286), corrected to 478 — no effect on backup speed
- osd.6 missing NVMe DB/WAL: fixed via ceph-bluestore-tool bluefs-bdev-migrate — no effect on backup speed
- detect_zeroes=off on scsi1: tested, no change
- Fleecing disabled: tested without --fleecing, no change
- rbd_cache=false: tested, no change
- rbd_read_from_replica_policy: already 'default'

The 5-6x gap between rbd bench (780 MB/s) and vzdump read speed (120 MB/s), appearing immediately after the upgrade, might suggest a regression in the QEMU backup block job or libproxmox-backup-qemu in QEMU 10.x / PVE 9.x ?

Has anyone else seen this? Any suggestions welcome.
 

Attachments

Hi,

Thank you for the output!

Could you please share the output of `pveversion -v` from Proxmox VE and the output of `proxmox-backup-manager versions --verbose` from the Proxmox Backup Server, and the storage config i.e., `/etc/pve/storage.cfg` from PVE side?

Have you tried to boot from the older kernel on the PBS or PVE?
All VMs have slow backup or only the VMs who stored on Ceph?
How about the restore?
 
Hi Moayad, thanks for your reply. Below the requested data plus two controlled tests that strongly narrow down the issue.

1. Versions (current state)

PVE cluster nodes (ren6 / ren7 / ren9):
  • pve-manager: 9.1.7, pve-qemu-kvm: 10.1.2-7, qemu-server: 9.1.6
  • libproxmox-backup-qemu0: 2.0.2, proxmox-backup-client: 4.1.5-1
  • kernel 6.17.13-2-pve, Ceph 19.2.3

Standalone PVE node (ren01, isolated, hosts only the firewall VM):
  • pve-manager: 9.1.9, pve-qemu-kvm: 10.1.2-7, qemu-server: 9.1.8
  • libproxmox-backup-qemu0: 2.0.2, proxmox-backup-client: 4.1.5-1
  • kernel 6.17.13-2-pve, storage = local LVM-thin (no Ceph)

PBS server (ren2):
  • proxmox-backup-server: 4.1.7 (4.1.10-1 installed, running 4.1.7), kernel 6.17.13-3-pve
  • Datastore: ZFS mirror with SLOG and special device

2. Older PVE kernel test (already done)

Installed proxmox-kernel-6.5.13-6-pve-signed and proxmox-kernel-6.8.12-15-pve-signed and rebooted cluster nodes onto them. Backup speed of VM 162 (Ceph RBD) did not change — same ~100-135 MiB/s, same ~9h total. Removed those kernels two weeks ago.

PBS-side kernel rollback: not yet tested. Can do this if you think it's worth it (PBS still has 6.8.12-9 installed).

3. All VMs on Ceph are slow, VM on LVM-thin is fast — controlled test

Code:
VM                  Storage              Mode      Bitmap         Read speed
------------------- -------------------- --------- -------------- ------------------
VM 162 (omv6, 6T)   Ceph RBD ceph6+nvme  snapshot  created new    100-135 MiB/s
VM 161 (deb6, 40G)  Ceph RBD             snapshot  created new    same range
VM 101 (pfsense)    local-lvm (LVM-thin) snapshot  created new    390 MiB/s avg
                    [32G]                                          (450 peak)

Same QEMU 10.1.2-7, same libproxmox-backup-qemu0 2.0.2, same proxmox-backup-client 4.1.5-1, same kernel 6.17.13-2-pve, same target PBS, same network, same backup mode (snapshot), same bitmap state (created new). Only difference is the storage backend.

Combined with my earlier rbd bench numbers (780 MiB/s, 16 threads, 4M blocks), Ceph itself is not the bottleneck. The slowdown appears to be in the librbd <-> pve-backup-stream interaction in QEMU 10.1, not in the generic backup pipeline.

4. Restore speed

Restored VM 161 from latest snapshot to a throwaway VMID on ceph-nvme:
  • Total: 40 GiB transferred in 363.22s = 112.77 MB/s (with 48% zero blocks skipped on the PBS side)
  • During the actual data-write phase (after zero-skip), throughput dropped to ~17-30 MB/s on the non-zero portion

Restore is as slow as backup. The bottleneck is bidirectional, affecting the PBS <-> Ceph RBD path in both directions. Same versions, same nodes, same Ceph cluster.

What I'd appreciate next

If there is a way to enable librbd debug logging or QEMU block-job tracing during a snapshot backup so I can capture per-IO timings, please tell me which knob — happy to provide traces. Also open to trying any specific block-driver tunable on QEMU side (aio, iothread, cache, queue depth, etc.) you'd suspect.
 
Thank you for the test and the output!

At this point, this looks more like the Ceph RBD access path on the PVE side than a generic PBS side issue. Before going deeper into tracing, could you please test one non-critical ceph backend VM with `krbd` enabled and then rerun one backp there?

If you do not want to change the existing storage globally, the safer variant is to create a temporary second RBD storage entry pointing to the same pool with `krbd 1` and test with one non-critical VM on that storage ID.

After that test, please post the output of the following commands:
Code:
qm showcmd <vmid> --pretty
rbd showmapped

together with the backup task log.
 
Hi Moayad, thanks for the suggestion. Done, results below — interesting, krbd does not help.

Setup

Created a temporary RBD storage pointing to the same Ceph pool with krbd 1:
Code:
rbd: ceph-nvme-krbd
    content images
    krbd 1
    pool pool_nvme_vm

Then full-cloned VM 161 (deb6, 40 GiB Linux Debian, the smallest VM I have on Ceph that's still meaningful) onto that storage as VM 701. Note: PVE clone failed first with rbd: image vm-701-disk-0: image uses unsupported features: 0x40 (journaling, not supported by krbd). I worked around it by setting rbd config pool set pool_nvme_vm rbd_default_features 61 (= layering+exclusive-lock+object-map+fast-diff+deep-flatten, no journaling) before the clone, then restored the default after. Worth noting that PVE's default RBD image features include journaling on this cluster — that may bite other people who try the same workaround.

qm showcmd 701 --pretty

Code:
/usr/bin/kvm \
  -id 701 \
  -name 'deb6-krbd-test,debug-threads=on' \
  ...
  -object 'iothread,id=iothread-virtioscsi0' \
  -object '{"id":"throttle-drive-scsi0","limits":{},"qom-type":"throttle-group"}' \
  ...
  -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' \
  -blockdev '{"detect-zeroes":"on","discard":"ignore","driver":"throttle","file":{"cache":{"direct":true,"no-flush":false},"detect-zeroes":"on","discard":"ignore","driver":"raw","file":{"aio":"io_uring","cache":{"direct":true,"no-flush":false},"detect-zeroes":"on","discard":"ignore","driver":"host_device","filename":"/dev/rbd-pve/b603a1d4-026f-4ff3-812a-5edcc9759980/pool_nvme_vm/vm-701-disk-0","node-name":"e2af3871ba7d5abf43f2c5e052b7d8a","read-only":false},"node-name":"f2af3871ba7d5abf43f2c5e052b7d8a","read-only":false},"node-name":"drive-scsi0","read-only":false,"throttle-group":"throttle-drive-scsi0"}' \
  -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,device_id=drive-scsi0,bootindex=100,write-cache=on' \
  ...
  -machine 'hpet=off,type=pc+pve0'

So QEMU is using driver=host_device on /dev/rbd-pve/.../vm-701-disk-0 with aio=io_uring, iothread, cache.direct=true, no librbd in this path.

rbd showmapped

Code:
id  pool          namespace  image          snap  device
0   pool_nvme_vm             vm-701-disk-0  -     /dev/rbd0

Backup task log (truncated to key lines)

Code:
INFO: Starting Backup of VM 701 (qemu)
INFO: VM Name: deb6-krbd-test
INFO: include disk 'scsi0' 'ceph-nvme-krbd:vm-701-disk-0' 40G
INFO: backup mode: snapshot
INFO: scsi0: dirty-bitmap status: created new
INFO:   5% (2.0 GiB of 40.0 GiB) in 48s,  read: 43.1 MiB/s,  write: 43.1 MiB/s
INFO:  10% (4.0 GiB of 40.0 GiB) in 1m 38s, read: 21.9 MiB/s,  write: 21.9 MiB/s
INFO:  20% (8.1 GiB of 40.0 GiB) in 2m 58s, read: 66.9 MiB/s,  write: 66.9 MiB/s
INFO:  30% (12.2 GiB of 40.0 GiB) in 4m 12s, read: 117.3 MiB/s, write: 117.3 MiB/s
INFO:  40% (..., zero-skip burst, read 2.7 GiB/s effective)
INFO:  80% (32.2 GiB of 40.0 GiB) in 4m 43s, read: 160.0 MiB/s, write: 74.7 MiB/s
INFO:  85% (34.0 GiB of 40.0 GiB) in 5m 32s, read: 34.5 MiB/s,  write: 34.5 MiB/s
INFO:  90% (36.0 GiB of 40.0 GiB) in 6m 21s, read: 36.4 MiB/s,  write: 36.4 MiB/s
INFO:  95% (38.0 GiB of 40.0 GiB) in 7m 20s, read: 30.2 MiB/s,  write: 30.2 MiB/s
INFO: 100% (40.0 GiB of 40.0 GiB) in 7m 27s, read: 772.0 MiB/s, write: 0 B/s
INFO: backup is sparse: 19.43 GiB (48%) total zero data
INFO: backup was done incrementally, reused 19.74 GiB (49%)
INFO: transferred 40.00 GiB in 447 seconds (91.6 MiB/s)
INFO: Finished Backup of VM 701 (00:07:36)

Comparison summary

Code:
Test                                   Client    Read avg     Notes
-------------------------------------- --------- ------------ ----------------------
VM 162 (omv6, 6T) Ceph RBD             librbd    100-135 MiB/s 6T full read, 9h
VM 161 (deb6, 40G) Ceph RBD restore    librbd    113 MiB/s    PBS->Ceph (restore)
VM 701 (clone of 161) ceph-nvme-krbd   krbd      91.6 MiB/s   THIS test
VM 101 (pfsense, 32G) local LVM-thin   n/a       390 MiB/s    on isolated PVE node

Conclusion from this test

krbd does not noticeably change the throughput. Both librbd and krbd are stuck in the same ~90-130 MiB/s band on this cluster. The throughput on a node using local LVM-thin (same QEMU 10.1.2-7, same proxmox-backup-client, same target PBS) is ~3-4x higher.

So the bottleneck is not in librbd specifically — it's downstream of the QEMU block driver, in one of:
  • The pve-backup-stream / libproxmox-backup-qemu0 pipeline (would explain why local LVM is also fast — but how does it depend on the storage backend then?)
  • The Ceph network path (cluster network is 10G X520 active-backup over a dedicated VLAN, rbd bench direct on the same RBD image gives 780 MiB/s with 16 threads / 4M blocks, so raw bandwidth is fine)
  • A small-IO read pattern that the backup pipeline uses, where Ceph RBD adds round-trip latency that LVM-thin doesn't have. That would also be consistent with krbd not helping (kernel client still has to round-trip to OSDs).

Happy to capture more data if you can suggest a probe — for example:
  • Specific QEMU block-job tracing knob during the backup
  • ceph daemon osd.X perf dump before/during/after to see queue depth on the primary OSDs
  • Per-IO timing via rbd bench --io-size 64K --io-threads 1 (small-IO single-thread, closer to what backup might be doing) — happy to run this if relevant

Or if you have a known knob on the QEMU block-driver side that would change how backup IO sequences itself against RBD (queue depth, batch size, alignment), I'll try it.
 
Thank you for the test!

Could you please run the following read-only benchmarks on the same RBD image used for VM 701?

Bash:
rbd bench --io-type read --io-size 64K --io-threads 1 --io-total 1G pool_nvme_vm/vm-701-disk-0
rbd bench --io-type read --io-size 1M  --io-threads 1 --io-total 4G pool_nvme_vm/vm-701-disk-0
rbd bench --io-type read --io-size 1M  --io-threads 16 --io-total 4G pool_nvme_vm/vm-701-disk-0
rbd bench --io-type read --io-size 4M  --io-threads 1 --io-total 4G pool_nvme_vm/vm-701-disk-0

If practical, one more backup test with a fresh or full run of the same VM 701 and a different worker count would also be useful
Bash:
vzdump 701 --storage <pbs-storage> --mode snapshot --remove 0 --notes-template '{{guestname}}' --performance max-workers=1
and, if you can make it a comparable fresh run again:
Bash:
vzdump 701 --storage <pbs-storage> --mode snapshot --remove 0 --notes-template '{{guestname}}' --performance max-workers=64

This should help us to see if the worker-count tests need to be comparable. If the second run becomes a tiny incremental backup, it will not tell us much...
 
Update with the bigger-VM test (VM 162, the original 6 TiB
problem case from my first post). Cluster: 12 OSDs, 3 mon,
HEALTH_OK, replication 3.

=== rbd bench (scratch image in pool_nvme_vm, deleted after) ===

| Test | Throughput |
|--------------------------------------------|------------|
| read --io-size 64K --io-threads 1 -t 1G | 184 MiB/s |
| read --io-size 1M --io-threads 1 -t 4G | 627 MiB/s |
| read --io-size 1M --io-threads 16 -t 4G | 1.7 GiB/s |
| write --io-size 4M --io-threads 1 -t 4G | 82 MiB/s |

(rbd bench in 19.x requires --io-type explicitly.)

So Ceph itself can deliver ~1.7 GiB/s read with 16 parallel
streams. The cluster is fine.

=== vzdump max-workers, small stopped VM (sparse, dedup'd) ===

VM 199, 32 GiB on ceph6 pool, stopped, 89% sparse, prior
backup chunks 100% reused.

| max-workers | Elapsed | MiB/s |
|-------------|---------|-------|
| 1 | 63 s | 520 |
| 64 | 34 s | 964 |

Ratio 1.85x. (Caveat: these are rbd-side rates with full
dedup, not new-data wire bandwidth.)

=== vzdump on the actual 6 TiB problem VM ===

VM 162, 6 TiB on ceph6 pool, running, no dirty-bitmap
available (host had been rebooted ~22h before). Tested
once with max-workers=64. Aborted at 4% to free cluster
IO; rate was very stable across the four 1% intervals:

| Interval | Rate |
|------------------|------------|
| 0% -> 1% (11m51s)| 89.3 MiB/s |
| 1% -> 2% (10m57s)| 96.8 MiB/s |
| 2% -> 3% (12m41s)| 83.5 MiB/s |
| 3% -> 4% (11m37s)| 91.1 MiB/s |
| Avg over 4% | ~90 MiB/s |

Extrapolated full backup: ~19.5h.

For reference, pre-upgrade (PVE 8) the same backup ran
in 2h20m -> ~770 MiB/s. So we're at ~8.5x slower, even
with max-workers=64.

=== Takeaway ===

* The Ceph backend is healthy and can deliver high-bandwidth
reads when parallelized at the rbd layer (1.7 GiB/s).
* On a small stopped VM with deduped chunks, max-workers
helps almost 2x.
* On the real 6 TiB production case, max-workers=64 does NOT
recover the regression. Backup throughput stays around
90 MiB/s -- about 1/8th of pre-upgrade speed.

So the parallelization knob isn't the dominant factor for
the actual regression case. Bottleneck looks above the rbd
client and not in the worker count.

Anything you'd like next? Some options on my side:
- librbd debug log during a slow segment of the same VM,
- perf top on the source host while a slow backup runs,
- QEMU block-job tracing,
- testing with iothread changes on the disk,
- testing on the older kernel still installed on PBS host.

Happy to gather whichever is most useful.