[SOLVED] Cannot backup 3.6TB Ceph guest - IOWait through the roof

chrispage1

Member
Sep 1, 2021
90
47
23
32
I have a 3.6TB guest which was backing up fine (runs on Ceph), upgraded our nodes on Saturday and now I'm having to cancel the backup because it's causing the guest to become unrepsonsive during the process. I can see it looks like it's having to rebuild from scratch, but thought this wouldn't be an issue.


Code:
INFO: starting new backup job: vzdump 111 --storage office --remove 0 --mode snapshot --node pve02 --notes-template '{{guestname}}'
INFO: Starting Backup of VM 111 (qemu)
INFO: Backup started at 2022-05-08 06:55:18
INFO: status = running
INFO: VM Name: shared2
INFO: include disk 'scsi0' 'ceph_data:vm-111-disk-3' 100G
INFO: include disk 'scsi1' 'ceph_data:vm-111-disk-0' 100G
INFO: include disk 'scsi2' 'ceph_data:vm-111-disk-1' 3500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/111/2022-05-08T05:55:18Z'
INFO: started backup task 'bd1779ec-4357-4929-9bd7-5180d059a2b0'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: created new
INFO: scsi1: dirty-bitmap status: created new
INFO: scsi2: dirty-bitmap status: created new
INFO:   0% (244.0 MiB of 3.6 TiB) in 3s, read: 81.3 MiB/s, write: 12.0 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 111 failed - interrupted by signal
INFO: Failed at 2022-05-08 06:59:21
ERROR: Backup job failed - interrupted by signal
TASK ERROR: interrupted by signal

Here is the IOWait of our guest during backup:

View attachment 36681

And the strain isn't particularly excessive on Ceph either:

View attachment 36682

I thought the whole idea of the Proxmox backup is that it doesn't have to talk to the guest and thus doesn't intefere? Any help is appreciated!

Chris.
 
a VM backup always sits between the guest I/O and the the storage providing the guest images, copying the old data to the backup before allowing the write through to the regular image. so a slow backup target can definitely slow down guest I/O.

what kind of performance do you get when doing a benchmark to PBS from your PVE host?
 
Thanks for your response. We're backing up to a spinning rust off-site storage so it won't be anything fast. Is there any way I can work around this, i.e. backup locally and then remotely? We need remote backups as part of our backup strategy.

Thanks,
Chris.
 
yeah, you can have a fast PBS with less space available locally and sync that to a bigger, slower remote (or rather, have that remote sync from the local fast PBS). this would decouple the backup and the sync:

https://pbs.proxmox.com/docs/managing-remotes.html

you just need to pay attention to pruning - you want your local PBS to keep enough copies so that the sync has time to pick each snapshot up before it gets pruned.
 
Thanks for your help - I'll take a look at this. Luckily we have enough local storage for this at the minute :)
 
Hi @fabian,

I have finally managed to get around to running an on-site PBS that syncs off to remote. Unfortunately, we're still getting high IOWait and unresponsive services as a result, meaning I'm unable to complete the backup. I don't suppose you can offer any more insight?

In terms of IO delay on the backup server, it's sitting between 0% and 0.5% when running backups and CPU usage ~15%. Below is an example of a 40G VM being backed up locally -

Code:
INFO: scsi0: dirty-bitmap status: created new
INFO:   2% (920.0 MiB of 40.0 GiB) in 3s, read: 306.7 MiB/s, write: 134.7 MiB/s
INFO:   3% (1.5 GiB of 40.0 GiB) in 6s, read: 192.0 MiB/s, write: 162.7 MiB/s
INFO:   4% (1.9 GiB of 40.0 GiB) in 9s, read: 150.7 MiB/s, write: 150.7 MiB/s
INFO:   6% (2.4 GiB of 40.0 GiB) in 12s, read: 174.7 MiB/s, write: 174.7 MiB/s
INFO:   7% (2.9 GiB of 40.0 GiB) in 15s, read: 170.7 MiB/s, write: 168.0 MiB/s
INFO:   8% (3.4 GiB of 40.0 GiB) in 18s, read: 177.3 MiB/s, write: 161.3 MiB/s
INFO:   9% (3.9 GiB of 40.0 GiB) in 21s, read: 146.7 MiB/s, write: 146.7 MiB/s
INFO:  10% (4.3 GiB of 40.0 GiB) in 24s, read: 154.7 MiB/s, write: 154.7 MiB/s
INFO:  12% (4.8 GiB of 40.0 GiB) in 27s, read: 180.0 MiB/s, write: 180.0 MiB/s
INFO:  13% (5.3 GiB of 40.0 GiB) in 30s, read: 160.0 MiB/s, write: 160.0 MiB/s
INFO:  14% (5.8 GiB of 40.0 GiB) in 33s, read: 162.7 MiB/s, write: 144.0 MiB/s
INFO:  15% (6.3 GiB of 40.0 GiB) in 36s, read: 160.0 MiB/s, write: 160.0 MiB/s
INFO:  18% (7.6 GiB of 40.0 GiB) in 39s, read: 442.7 MiB/s, write: 204.0 MiB/s
INFO:  19% (8.0 GiB of 40.0 GiB) in 42s, read: 150.7 MiB/s, write: 149.3 MiB/s
INFO:  21% (8.4 GiB of 40.0 GiB) in 45s, read: 149.3 MiB/s, write: 148.0 MiB/s
INFO:  22% (8.9 GiB of 40.0 GiB) in 48s, read: 169.3 MiB/s, write: 161.3 MiB/s
INFO:  24% (9.6 GiB of 40.0 GiB) in 51s, read: 234.7 MiB/s, write: 165.3 MiB/s
INFO:  25% (10.1 GiB of 40.0 GiB) in 54s, read: 170.7 MiB/s, write: 170.7 MiB/s
INFO:  26% (10.6 GiB of 40.0 GiB) in 57s, read: 154.7 MiB/s, write: 144.0 MiB/s
INFO:  29% (11.8 GiB of 40.0 GiB) in 1m, read: 409.3 MiB/s, write: 172.0 MiB/s
INFO:  32% (12.9 GiB of 40.0 GiB) in 1m 3s, read: 397.3 MiB/s, write: 184.0 MiB/s
INFO:  34% (13.7 GiB of 40.0 GiB) in 1m 6s, read: 254.7 MiB/s, write: 166.7 MiB/s
INFO:  36% (14.4 GiB of 40.0 GiB) in 1m 9s, read: 256.0 MiB/s, write: 186.7 MiB/s
INFO:  38% (15.4 GiB of 40.0 GiB) in 1m 12s, read: 348.0 MiB/s, write: 174.7 MiB/s
INFO:  40% (16.3 GiB of 40.0 GiB) in 1m 15s, read: 281.3 MiB/s, write: 173.3 MiB/s
INFO:  43% (17.6 GiB of 40.0 GiB) in 1m 18s, read: 440.0 MiB/s, write: 169.3 MiB/s
INFO:  46% (18.5 GiB of 40.0 GiB) in 1m 21s, read: 336.0 MiB/s, write: 198.7 MiB/s
INFO:  48% (19.6 GiB of 40.0 GiB) in 1m 24s, read: 342.7 MiB/s, write: 213.3 MiB/s
INFO:  50% (20.0 GiB of 40.0 GiB) in 1m 27s, read: 154.7 MiB/s, write: 154.7 MiB/s
INFO:  51% (20.5 GiB of 40.0 GiB) in 1m 30s, read: 156.0 MiB/s, write: 150.7 MiB/s
INFO:  52% (21.2 GiB of 40.0 GiB) in 1m 33s, read: 237.3 MiB/s, write: 170.7 MiB/s
INFO:  54% (21.8 GiB of 40.0 GiB) in 1m 36s, read: 230.7 MiB/s, write: 186.7 MiB/s
INFO:  57% (23.0 GiB of 40.0 GiB) in 1m 39s, read: 389.3 MiB/s, write: 190.7 MiB/s
INFO:  58% (23.5 GiB of 40.0 GiB) in 1m 42s, read: 192.0 MiB/s, write: 165.3 MiB/s
INFO:  60% (24.2 GiB of 40.0 GiB) in 1m 45s, read: 232.0 MiB/s, write: 164.0 MiB/s
INFO:  62% (24.8 GiB of 40.0 GiB) in 1m 48s, read: 216.0 MiB/s, write: 166.7 MiB/s
INFO:  63% (25.5 GiB of 40.0 GiB) in 1m 51s, read: 216.0 MiB/s, write: 173.3 MiB/s
INFO:  65% (26.2 GiB of 40.0 GiB) in 1m 54s, read: 229.3 MiB/s, write: 169.3 MiB/s
INFO:  67% (26.8 GiB of 40.0 GiB) in 1m 57s, read: 228.0 MiB/s, write: 118.7 MiB/s
INFO:  68% (27.3 GiB of 40.0 GiB) in 2m, read: 180.0 MiB/s, write: 160.0 MiB/s
INFO:  69% (27.7 GiB of 40.0 GiB) in 2m 3s, read: 136.0 MiB/s, write: 136.0 MiB/s
INFO:  70% (28.2 GiB of 40.0 GiB) in 2m 6s, read: 153.3 MiB/s, write: 152.0 MiB/s
INFO:  71% (28.7 GiB of 40.0 GiB) in 2m 9s, read: 156.0 MiB/s, write: 153.3 MiB/s
INFO:  72% (29.1 GiB of 40.0 GiB) in 2m 12s, read: 142.7 MiB/s, write: 138.7 MiB/s
INFO:  74% (29.6 GiB of 40.0 GiB) in 2m 15s, read: 181.3 MiB/s, write: 152.0 MiB/s
INFO:  75% (30.2 GiB of 40.0 GiB) in 2m 18s, read: 221.3 MiB/s, write: 157.3 MiB/s
INFO:  77% (30.9 GiB of 40.0 GiB) in 2m 21s, read: 213.3 MiB/s, write: 170.7 MiB/s
INFO:  78% (31.4 GiB of 40.0 GiB) in 2m 24s, read: 194.7 MiB/s, write: 168.0 MiB/s
INFO:  79% (31.9 GiB of 40.0 GiB) in 2m 27s, read: 169.3 MiB/s, write: 168.0 MiB/s
INFO:  80% (32.4 GiB of 40.0 GiB) in 2m 30s, read: 156.0 MiB/s, write: 154.7 MiB/s
INFO:  82% (32.9 GiB of 40.0 GiB) in 2m 33s, read: 177.3 MiB/s, write: 177.3 MiB/s
INFO:  83% (33.5 GiB of 40.0 GiB) in 2m 36s, read: 186.7 MiB/s, write: 160.0 MiB/s
INFO:  85% (34.0 GiB of 40.0 GiB) in 2m 39s, read: 185.3 MiB/s, write: 176.0 MiB/s
INFO:  86% (34.5 GiB of 40.0 GiB) in 2m 42s, read: 184.0 MiB/s, write: 160.0 MiB/s
INFO:  87% (35.0 GiB of 40.0 GiB) in 2m 45s, read: 152.0 MiB/s, write: 149.3 MiB/s
INFO:  88% (35.5 GiB of 40.0 GiB) in 2m 48s, read: 185.3 MiB/s, write: 144.0 MiB/s
INFO:  90% (36.0 GiB of 40.0 GiB) in 2m 51s, read: 158.7 MiB/s, write: 154.7 MiB/s
INFO:  91% (36.6 GiB of 40.0 GiB) in 2m 54s, read: 190.7 MiB/s, write: 165.3 MiB/s
INFO:  92% (37.0 GiB of 40.0 GiB) in 2m 57s, read: 150.7 MiB/s, write: 138.7 MiB/s
INFO:  93% (37.5 GiB of 40.0 GiB) in 3m, read: 181.3 MiB/s, write: 148.0 MiB/s
INFO:  95% (38.1 GiB of 40.0 GiB) in 3m 3s, read: 178.7 MiB/s, write: 178.7 MiB/s
INFO:  96% (38.6 GiB of 40.0 GiB) in 3m 6s, read: 196.0 MiB/s, write: 124.0 MiB/s
INFO:  98% (39.3 GiB of 40.0 GiB) in 3m 9s, read: 218.7 MiB/s, write: 181.3 MiB/s
INFO:  99% (39.8 GiB of 40.0 GiB) in 3m 12s, read: 186.7 MiB/s, write: 178.7 MiB/s
INFO: 100% (40.0 GiB of 40.0 GiB) in 3m 14s, read: 94.0 MiB/s, write: 64.0 MiB/s
INFO: backup is sparse: 8.05 GiB (20%) total zero data
INFO: backup was done incrementally, reused 9.30 GiB (23%)
INFO: transferred 40.00 GiB in 194 seconds (211.1 MiB/s)

As far as our remote backup goes, I can sync the 40G backup using a sync job in about 3 minutes so it seems the performance is quite good there too.

> what kind of performance do you get when doing a benchmark to PBS from your PVE host?

What would you recommend to use in the way of benchmarking this?

Thanks,
Chris.
 
Last edited:
Just to confirm this issue has been resolved by changing the storage to use KRBD.

Although, I'm not quite sure why this should be the case?
 
see my reply in the other thread - qemu 6.2 had a regression for reading via librbd that affected bigger disks - the most recent pve-qemu-kvm version (6.2.0-8) contains the fix (well, revert of the problematic changes).