Backup speed regression: 2h20m → 8h39m after PVE 8→9 / QEMU 9.2→10.1 / Ceph 18→19 upgrade

portedaix

Member
Sep 4, 2023
10
0
6
Hi,

Since upgrading on February 11, 2026, VM backup duration has jumped from 2h20m to ~8h39m for a VM with a 6TB Ceph RBD disk. The slowdown is immediate and reproducible on every backup since. Note : the VM is shutdown every evening for power savings. Bitmap is created new every morning.


VERSIONS
- PVE: 9.1.6 (kernel 6.17.13-2-pve)
- QEMU: 10.1.2-7
- Ceph: 19.2.3-pve4
- PBS: 4.1.4 (kernel 6.17.13-2-pve)


VM SETUP
- scsi0: 64GB on NVMe-backed Ceph pool
- scsi1: 6TB on HDD-backed Ceph pool (9x HDD OSDs, 3 nodes)
- Backup to PBS with fleecing on local-zfs
- Cluster shuts down every evening → dirty bitmaps always invalidated → every backup is a full 6TB read


BEFORE/AFTER (from PBS task logs)

Feb 10 — last backup BEFORE upgrade (PVE 8 / QEMU 9.2 / Ceph 18):
Start: 2026-02-10T11:00:05 End: 2026-02-10T13:20:22 → 2h 20m ✓

Upgrade on Feb 11: Ceph 18→19 at 14:15, QEMU 9.2→10.1 + PVE 8→9 at 15:55

Feb 11 — first backup AFTER upgrade, started 90 minutes later:
Start: 2026-02-11T17:27:59 End: 2026-02-12T02:06:48 → 8h 39m ✗

Both backups show 99% deduplication — identical data volume.


KEY OBSERVATION
During backup, vzdump reads the 6TB disk at a consistent 100-135 MB/s.
Direct rbd bench on the same volume: 780 MiB/s (16 threads, 4MB blocks).
That is a 5-6x gap between raw Ceph read speed and the backup pipeline.


WHAT I HAVE RULED OUT
- PBS kernel 6.17.2 TCP regression: running 6.17.13, not affected
- Network: bond0 is 20 Gbps, iperf confirms full speed
- Ceph health: HEALTH_OK, all 12 OSDs up
- PG distribution: well balanced, ~28 PGs primary per HDD OSD
- mclock miscalibration: 3 OSDs had low osd_mclock_max_capacity_iops_hdd (239-286), corrected to 478 — no effect on backup speed
- osd.6 missing NVMe DB/WAL: fixed via ceph-bluestore-tool bluefs-bdev-migrate — no effect on backup speed
- detect_zeroes=off on scsi1: tested, no change
- Fleecing disabled: tested without --fleecing, no change
- rbd_cache=false: tested, no change
- rbd_read_from_replica_policy: already 'default'

The 5-6x gap between rbd bench (780 MB/s) and vzdump read speed (120 MB/s), appearing immediately after the upgrade, might suggest a regression in the QEMU backup block job or libproxmox-backup-qemu in QEMU 10.x / PVE 9.x ?

Has anyone else seen this? Any suggestions welcome.
 

Attachments

Hi,

Thank you for the output!

Could you please share the output of `pveversion -v` from Proxmox VE and the output of `proxmox-backup-manager versions --verbose` from the Proxmox Backup Server, and the storage config i.e., `/etc/pve/storage.cfg` from PVE side?

Have you tried to boot from the older kernel on the PBS or PVE?
All VMs have slow backup or only the VMs who stored on Ceph?
How about the restore?
 
Hi Moayad, thanks for your reply. Below the requested data plus two controlled tests that strongly narrow down the issue.

1. Versions (current state)

PVE cluster nodes (ren6 / ren7 / ren9):
  • pve-manager: 9.1.7, pve-qemu-kvm: 10.1.2-7, qemu-server: 9.1.6
  • libproxmox-backup-qemu0: 2.0.2, proxmox-backup-client: 4.1.5-1
  • kernel 6.17.13-2-pve, Ceph 19.2.3

Standalone PVE node (ren01, isolated, hosts only the firewall VM):
  • pve-manager: 9.1.9, pve-qemu-kvm: 10.1.2-7, qemu-server: 9.1.8
  • libproxmox-backup-qemu0: 2.0.2, proxmox-backup-client: 4.1.5-1
  • kernel 6.17.13-2-pve, storage = local LVM-thin (no Ceph)

PBS server (ren2):
  • proxmox-backup-server: 4.1.7 (4.1.10-1 installed, running 4.1.7), kernel 6.17.13-3-pve
  • Datastore: ZFS mirror with SLOG and special device

2. Older PVE kernel test (already done)

Installed proxmox-kernel-6.5.13-6-pve-signed and proxmox-kernel-6.8.12-15-pve-signed and rebooted cluster nodes onto them. Backup speed of VM 162 (Ceph RBD) did not change — same ~100-135 MiB/s, same ~9h total. Removed those kernels two weeks ago.

PBS-side kernel rollback: not yet tested. Can do this if you think it's worth it (PBS still has 6.8.12-9 installed).

3. All VMs on Ceph are slow, VM on LVM-thin is fast — controlled test

Code:
VM                  Storage              Mode      Bitmap         Read speed
------------------- -------------------- --------- -------------- ------------------
VM 162 (omv6, 6T)   Ceph RBD ceph6+nvme  snapshot  created new    100-135 MiB/s
VM 161 (deb6, 40G)  Ceph RBD             snapshot  created new    same range
VM 101 (pfsense)    local-lvm (LVM-thin) snapshot  created new    390 MiB/s avg
                    [32G]                                          (450 peak)

Same QEMU 10.1.2-7, same libproxmox-backup-qemu0 2.0.2, same proxmox-backup-client 4.1.5-1, same kernel 6.17.13-2-pve, same target PBS, same network, same backup mode (snapshot), same bitmap state (created new). Only difference is the storage backend.

Combined with my earlier rbd bench numbers (780 MiB/s, 16 threads, 4M blocks), Ceph itself is not the bottleneck. The slowdown appears to be in the librbd <-> pve-backup-stream interaction in QEMU 10.1, not in the generic backup pipeline.

4. Restore speed

Restored VM 161 from latest snapshot to a throwaway VMID on ceph-nvme:
  • Total: 40 GiB transferred in 363.22s = 112.77 MB/s (with 48% zero blocks skipped on the PBS side)
  • During the actual data-write phase (after zero-skip), throughput dropped to ~17-30 MB/s on the non-zero portion

Restore is as slow as backup. The bottleneck is bidirectional, affecting the PBS <-> Ceph RBD path in both directions. Same versions, same nodes, same Ceph cluster.

What I'd appreciate next

If there is a way to enable librbd debug logging or QEMU block-job tracing during a snapshot backup so I can capture per-IO timings, please tell me which knob — happy to provide traces. Also open to trying any specific block-driver tunable on QEMU side (aio, iothread, cache, queue depth, etc.) you'd suspect.
 
Thank you for the test and the output!

At this point, this looks more like the Ceph RBD access path on the PVE side than a generic PBS side issue. Before going deeper into tracing, could you please test one non-critical ceph backend VM with `krbd` enabled and then rerun one backp there?

If you do not want to change the existing storage globally, the safer variant is to create a temporary second RBD storage entry pointing to the same pool with `krbd 1` and test with one non-critical VM on that storage ID.

After that test, please post the output of the following commands:
Code:
qm showcmd <vmid> --pretty
rbd showmapped

together with the backup task log.
 
Hi Moayad, thanks for the suggestion. Done, results below — interesting, krbd does not help.

Setup

Created a temporary RBD storage pointing to the same Ceph pool with krbd 1:
Code:
rbd: ceph-nvme-krbd
    content images
    krbd 1
    pool pool_nvme_vm

Then full-cloned VM 161 (deb6, 40 GiB Linux Debian, the smallest VM I have on Ceph that's still meaningful) onto that storage as VM 701. Note: PVE clone failed first with rbd: image vm-701-disk-0: image uses unsupported features: 0x40 (journaling, not supported by krbd). I worked around it by setting rbd config pool set pool_nvme_vm rbd_default_features 61 (= layering+exclusive-lock+object-map+fast-diff+deep-flatten, no journaling) before the clone, then restored the default after. Worth noting that PVE's default RBD image features include journaling on this cluster — that may bite other people who try the same workaround.

qm showcmd 701 --pretty

Code:
/usr/bin/kvm \
  -id 701 \
  -name 'deb6-krbd-test,debug-threads=on' \
  ...
  -object 'iothread,id=iothread-virtioscsi0' \
  -object '{"id":"throttle-drive-scsi0","limits":{},"qom-type":"throttle-group"}' \
  ...
  -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' \
  -blockdev '{"detect-zeroes":"on","discard":"ignore","driver":"throttle","file":{"cache":{"direct":true,"no-flush":false},"detect-zeroes":"on","discard":"ignore","driver":"raw","file":{"aio":"io_uring","cache":{"direct":true,"no-flush":false},"detect-zeroes":"on","discard":"ignore","driver":"host_device","filename":"/dev/rbd-pve/b603a1d4-026f-4ff3-812a-5edcc9759980/pool_nvme_vm/vm-701-disk-0","node-name":"e2af3871ba7d5abf43f2c5e052b7d8a","read-only":false},"node-name":"f2af3871ba7d5abf43f2c5e052b7d8a","read-only":false},"node-name":"drive-scsi0","read-only":false,"throttle-group":"throttle-drive-scsi0"}' \
  -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,device_id=drive-scsi0,bootindex=100,write-cache=on' \
  ...
  -machine 'hpet=off,type=pc+pve0'

So QEMU is using driver=host_device on /dev/rbd-pve/.../vm-701-disk-0 with aio=io_uring, iothread, cache.direct=true, no librbd in this path.

rbd showmapped

Code:
id  pool          namespace  image          snap  device
0   pool_nvme_vm             vm-701-disk-0  -     /dev/rbd0

Backup task log (truncated to key lines)

Code:
INFO: Starting Backup of VM 701 (qemu)
INFO: VM Name: deb6-krbd-test
INFO: include disk 'scsi0' 'ceph-nvme-krbd:vm-701-disk-0' 40G
INFO: backup mode: snapshot
INFO: scsi0: dirty-bitmap status: created new
INFO:   5% (2.0 GiB of 40.0 GiB) in 48s,  read: 43.1 MiB/s,  write: 43.1 MiB/s
INFO:  10% (4.0 GiB of 40.0 GiB) in 1m 38s, read: 21.9 MiB/s,  write: 21.9 MiB/s
INFO:  20% (8.1 GiB of 40.0 GiB) in 2m 58s, read: 66.9 MiB/s,  write: 66.9 MiB/s
INFO:  30% (12.2 GiB of 40.0 GiB) in 4m 12s, read: 117.3 MiB/s, write: 117.3 MiB/s
INFO:  40% (..., zero-skip burst, read 2.7 GiB/s effective)
INFO:  80% (32.2 GiB of 40.0 GiB) in 4m 43s, read: 160.0 MiB/s, write: 74.7 MiB/s
INFO:  85% (34.0 GiB of 40.0 GiB) in 5m 32s, read: 34.5 MiB/s,  write: 34.5 MiB/s
INFO:  90% (36.0 GiB of 40.0 GiB) in 6m 21s, read: 36.4 MiB/s,  write: 36.4 MiB/s
INFO:  95% (38.0 GiB of 40.0 GiB) in 7m 20s, read: 30.2 MiB/s,  write: 30.2 MiB/s
INFO: 100% (40.0 GiB of 40.0 GiB) in 7m 27s, read: 772.0 MiB/s, write: 0 B/s
INFO: backup is sparse: 19.43 GiB (48%) total zero data
INFO: backup was done incrementally, reused 19.74 GiB (49%)
INFO: transferred 40.00 GiB in 447 seconds (91.6 MiB/s)
INFO: Finished Backup of VM 701 (00:07:36)

Comparison summary

Code:
Test                                   Client    Read avg     Notes
-------------------------------------- --------- ------------ ----------------------
VM 162 (omv6, 6T) Ceph RBD             librbd    100-135 MiB/s 6T full read, 9h
VM 161 (deb6, 40G) Ceph RBD restore    librbd    113 MiB/s    PBS->Ceph (restore)
VM 701 (clone of 161) ceph-nvme-krbd   krbd      91.6 MiB/s   THIS test
VM 101 (pfsense, 32G) local LVM-thin   n/a       390 MiB/s    on isolated PVE node

Conclusion from this test

krbd does not noticeably change the throughput. Both librbd and krbd are stuck in the same ~90-130 MiB/s band on this cluster. The throughput on a node using local LVM-thin (same QEMU 10.1.2-7, same proxmox-backup-client, same target PBS) is ~3-4x higher.

So the bottleneck is not in librbd specifically — it's downstream of the QEMU block driver, in one of:
  • The pve-backup-stream / libproxmox-backup-qemu0 pipeline (would explain why local LVM is also fast — but how does it depend on the storage backend then?)
  • The Ceph network path (cluster network is 10G X520 active-backup over a dedicated VLAN, rbd bench direct on the same RBD image gives 780 MiB/s with 16 threads / 4M blocks, so raw bandwidth is fine)
  • A small-IO read pattern that the backup pipeline uses, where Ceph RBD adds round-trip latency that LVM-thin doesn't have. That would also be consistent with krbd not helping (kernel client still has to round-trip to OSDs).

Happy to capture more data if you can suggest a probe — for example:
  • Specific QEMU block-job tracing knob during the backup
  • ceph daemon osd.X perf dump before/during/after to see queue depth on the primary OSDs
  • Per-IO timing via rbd bench --io-size 64K --io-threads 1 (small-IO single-thread, closer to what backup might be doing) — happy to run this if relevant

Or if you have a known knob on the QEMU block-driver side that would change how backup IO sequences itself against RBD (queue depth, batch size, alignment), I'll try it.
 
Thank you for the test!

Could you please run the following read-only benchmarks on the same RBD image used for VM 701?

Bash:
rbd bench --io-type read --io-size 64K --io-threads 1 --io-total 1G pool_nvme_vm/vm-701-disk-0
rbd bench --io-type read --io-size 1M  --io-threads 1 --io-total 4G pool_nvme_vm/vm-701-disk-0
rbd bench --io-type read --io-size 1M  --io-threads 16 --io-total 4G pool_nvme_vm/vm-701-disk-0
rbd bench --io-type read --io-size 4M  --io-threads 1 --io-total 4G pool_nvme_vm/vm-701-disk-0

If practical, one more backup test with a fresh or full run of the same VM 701 and a different worker count would also be useful
Bash:
vzdump 701 --storage <pbs-storage> --mode snapshot --remove 0 --notes-template '{{guestname}}' --performance max-workers=1
and, if you can make it a comparable fresh run again:
Bash:
vzdump 701 --storage <pbs-storage> --mode snapshot --remove 0 --notes-template '{{guestname}}' --performance max-workers=64

This should help us to see if the worker-count tests need to be comparable. If the second run becomes a tiny incremental backup, it will not tell us much...
 
Update with the bigger-VM test (VM 162, the original 6 TiB
problem case from my first post). Cluster: 12 OSDs, 3 mon,
HEALTH_OK, replication 3.

=== rbd bench (scratch image in pool_nvme_vm, deleted after) ===

| Test | Throughput |
|--------------------------------------------|------------|
| read --io-size 64K --io-threads 1 -t 1G | 184 MiB/s |
| read --io-size 1M --io-threads 1 -t 4G | 627 MiB/s |
| read --io-size 1M --io-threads 16 -t 4G | 1.7 GiB/s |
| write --io-size 4M --io-threads 1 -t 4G | 82 MiB/s |

(rbd bench in 19.x requires --io-type explicitly.)

So Ceph itself can deliver ~1.7 GiB/s read with 16 parallel
streams. The cluster is fine.

=== vzdump max-workers, small stopped VM (sparse, dedup'd) ===

VM 199, 32 GiB on ceph6 pool, stopped, 89% sparse, prior
backup chunks 100% reused.

| max-workers | Elapsed | MiB/s |
|-------------|---------|-------|
| 1 | 63 s | 520 |
| 64 | 34 s | 964 |

Ratio 1.85x. (Caveat: these are rbd-side rates with full
dedup, not new-data wire bandwidth.)

=== vzdump on the actual 6 TiB problem VM ===

VM 162, 6 TiB on ceph6 pool, running, no dirty-bitmap
available (host had been rebooted ~22h before). Tested
once with max-workers=64. Aborted at 4% to free cluster
IO; rate was very stable across the four 1% intervals:

| Interval | Rate |
|------------------|------------|
| 0% -> 1% (11m51s)| 89.3 MiB/s |
| 1% -> 2% (10m57s)| 96.8 MiB/s |
| 2% -> 3% (12m41s)| 83.5 MiB/s |
| 3% -> 4% (11m37s)| 91.1 MiB/s |
| Avg over 4% | ~90 MiB/s |

Extrapolated full backup: ~19.5h.

For reference, pre-upgrade (PVE 8) the same backup ran
in 2h20m -> ~770 MiB/s. So we're at ~8.5x slower, even
with max-workers=64.

=== Takeaway ===

* The Ceph backend is healthy and can deliver high-bandwidth
reads when parallelized at the rbd layer (1.7 GiB/s).
* On a small stopped VM with deduped chunks, max-workers
helps almost 2x.
* On the real 6 TiB production case, max-workers=64 does NOT
recover the regression. Backup throughput stays around
90 MiB/s -- about 1/8th of pre-upgrade speed.

So the parallelization knob isn't the dominant factor for
the actual regression case. Bottleneck looks above the rbd
client and not in the worker count.

Anything you'd like next? Some options on my side:
- librbd debug log during a slow segment of the same VM,
- perf top on the source host while a slow backup runs,
- QEMU block-job tracing,
- testing with iothread changes on the disk,
- testing on the older kernel still installed on PBS host.

Happy to gather whichever is most useful.
 
Thank you for the output!

The `rbd bench` numbers are from a scratch image. That confirms the pool cluster can deliver high throughput, but it does not fully rule out an issue specific to the actual VM image object layout. We recently had a case where pool-level tests looked fine, but the affected VM image itself benchmarked badly, recreating that RBD image fixed the backup speed. However, before going into QEMU tracing, could you please benchmark the actual VM 162 RBD image directly? i.e., First identify the disk image from the output of `qm config 162` then for the main RBD image, please run the following:


Bash:
rbd info <pool>/<image>
rbd status <pool>/<image>
rbd snap ls <pool>/<image>
rbd bench --io-type read --io-size 64K --io-threads 1 --io-total 1G <pool>/<image>
rbd bench --io-type read --io-size 1M --io-threads 1 --io-total 4G <pool>/<image>
rbd bench --io-type read --io-size 1M --io-threads 16 --io-total 4G <pool>/<image>
rbd bench --io-type read --io-size 4M --io-threads 1 --io-total 4G <pool>/<image>

These are read-only, but they will add read load to the VM image. If those numbers are also good on the actual VM image, then the next useful data would be from the source host during a slow backup segment

Bash:
ceph -s
ceph osd perf
ceph osd pool stats

This should tell us whether the slowdown is already visible on the actual RBD image or only inside the live QEMU backup job.
 
Hi Moayad — image-level results below. Short version: you nailed it. The image itself benches dramatically worse than a scratch image on the same pool, especially in single-thread small/medium I/O.

Image identification

Code:
qm config 162 | grep ^scsi
scsi1: ceph6:vm-162-disk-1,iothread=1,size=6T   <- the slow one

rbd info ceph6/vm-162-disk-1

Code:
rbd image 'vm-162-disk-1':
    size 6 TiB in 1572864 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 391f9debb88
    block_name_prefix: rbd_data.391f9debb88
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    op_features:
    flags:
    create_timestamp: Mon Dec  9 00:52:26 2024
    access_timestamp: Mon May  4 21:22:20 2026
    modify_timestamp: Mon May  4 22:02:40 2026

rbd status ceph6/vm-162-disk-1

Code:
Watchers:
    watcher=10.10.10.60:0/4149010201 client.37842145 cookie=129470446980176

(single watcher = host running VM 162, exclusive-lock held)

rbd snap ls ceph6/vm-162-disk-1 : empty (no snapshots)

Image was created 2024-12-09 under Ceph 18 (Reef) / QEMU 9.2 / PVE 8 and has been written to continuously since. The cluster was upgraded to Ceph 19 (Squid) / QEMU 10.1 / PVE 9 on 2026-02-11. Backup time jumped from 2h20m -> 8h39m -> 19.5h on this same image, no other change.

Bench results on the actual VM image (VM running, exclusive-lock held)

Code:
Test                                             | Image vm-162-disk-1 | Scratch, same pool (post #7) | Ratio
-------------------------------------------------|---------------------|------------------------------|-------
read --io-size 64K --io-threads  1 --io-total 1G |   69 MiB/s          |  184 MiB/s                   | 0.37x
read --io-size  1M --io-threads  1 --io-total 4G |   72 MiB/s          |  627 MiB/s                   | 0.11x
read --io-size  1M --io-threads 16 --io-total 4G |  814 MiB/s          | 1700 MiB/s                   | 0.48x
read --io-size  4M --io-threads  1 --io-total 4G |   78 MiB/s          | (not tested previously)      |  -

Striking pattern: on the scratch image, going from 64K to 1M single-thread gives a 3.4x speedup (184 -> 627 MiB/s). On the actual VM image, the same change gives 1.04x (69 -> 72 MiB/s). The single-thread ceiling is hard at ~75 MiB/s regardless of I/O size - classic signature of a per-object latency bottleneck, not bandwidth or I/O coalescing. Adding 16 parallel threads roughly hides it (814 MiB/s) but the QEMU backup pipeline doesn't fan out enough to recover, which matches what I saw earlier with max-workers=64 (~90 MiB/s, no improvement).

This is exactly the scenario you described: pool is fine, scratch image is fine on the same pool, but this specific RBD image benches very badly. The image was created and continuously written under Reef and lived through the migration to Squid without being rewritten - strongly suggests its on-disk object layout (BlueStore allocation, RocksDB onode placement, possibly compression state) is "stale" relative to the Squid read path.

Cluster state at end of benches (clean)

Code:
ceph -s
  cluster:
    health: HEALTH_OK
  data:
    pools:   3 pools, 289 pgs
    objects: 991.86k objects, 3.8 TiB
    usage:   12 TiB used, 24 TiB / 36 TiB avail
    pgs:     289 active+clean
  io:
    client: 72 MiB/s rd, 223 KiB/s wr, 33 op/s rd, 27 op/s wr

ceph osd perf : commit_latency and apply_latency = 0-1 ms across all 12 OSDs.

Question - recreate procedure

Given the image is 6 TiB and the VM hosts the production NFS/SMB filer, what is your recommended procedure to recreate it?

Options I'm considering:

a) qm move-disk 162 scsi1 ceph-nvme --delete 1, then qm move-disk 162 scsi1 ceph6 --delete 1. Live, no downtime, but doubles the read pressure for the duration of the round-trip and needs ~6 TiB temporary headroom on ceph-nvme (I have it).

b) Stop VM -> rbd cp ceph6/vm-162-disk-1 ceph6/vm-162-disk-1-new -> rename -> start VM. Faster (single copy pass) but requires downtime.

c) rbd deep-cp instead of rbd cp - does that produce a fresh BlueStore layout, or is it functionally equivalent to cp for our purposes?

Any preference, or other recommended path? Also, is there any diagnostic you'd want me to capture before the recreate, in case the recreate fixes it and you want to keep the data point for the regression search?