Running snapshot backup stalls VM - Backup throughput is low

Rainerle

Renowned Member
Jan 29, 2019
120
35
68
Hi,
we are trying to replace out current Backup solution with PBS. So I installed a PBS on an older ThomasKrenn 4HE Intel Dual-CPU RI2436 with some disks...
Bash:
root@pbs01:~# zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
backup01  65.2T  46.1T  19.1T        -         -    35%    70%  1.00x    ONLINE  -
backup02   116T  80.2T  36.2T        -         -    25%    68%  1.00x    ONLINE  -
rpool     28.5G  14.1G  14.4G        -         -    48%    49%  1.00x    ONLINE  -
root@pbs01:~# zpool status
  pool: backup01
 state: ONLINE
  scan: scrub repaired 0B in 15:33:49 with 0 errors on Sun Jun 13 15:57:50 2021
config:

        NAME                        STATE     READ WRITE CKSUM
        backup01                    ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x5000cca073188978  ONLINE       0     0     0
            wwn-0x5000cca073707980  ONLINE       0     0     0
            wwn-0x5000cca05cd8bfa4  ONLINE       0     0     0
            wwn-0x5000cca0731f58ec  ONLINE       0     0     0
            wwn-0x5000cca0731964e4  ONLINE       0     0     0
            wwn-0x5000cca073707ab0  ONLINE       0     0     0
          raidz2-1                  ONLINE       0     0     0
            wwn-0x5000cca073152f5c  ONLINE       0     0     0
            wwn-0x5000cca073082160  ONLINE       0     0     0
            wwn-0x5000cca05cda79c4  ONLINE       0     0     0
            wwn-0x5000cca05cd8bb8c  ONLINE       0     0     0
            wwn-0x5000cca0731e6f14  ONLINE       0     0     0
            wwn-0x5000cca05cd292a0  ONLINE       0     0     0
          raidz2-3                  ONLINE       0     0     0
            wwn-0x5000cca073073f08  ONLINE       0     0     0
            wwn-0x5000cca073041d4c  ONLINE       0     0     0
            wwn-0x5000cca073088884  ONLINE       0     0     0
            wwn-0x5000cca073081d14  ONLINE       0     0     0
            wwn-0x5000cca0731e746c  ONLINE       0     0     0
            wwn-0x5000cca07306c9e4  ONLINE       0     0     0
        cache
          wwn-0x500117310096e2cc    ONLINE       0     0     0
          wwn-0x500117310096e35c    ONLINE       0     0     0

errors: No known data errors

  pool: backup02
 state: ONLINE
  scan: scrub repaired 0B in 4 days 22:26:31 with 0 errors on Tue Jun 15 20:09:59 2021
config:

        NAME                        STATE     READ WRITE CKSUM
        backup02                    ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x5000cca2a1c35899  ONLINE       0     0     0
            wwn-0x5000cca2a1c36418  ONLINE       0     0     0
            wwn-0x5000cca2a1c86d6f  ONLINE       0     0     0
            wwn-0x5000cca2a1c86e24  ONLINE       0     0     0
            wwn-0x5000cca2a1c36498  ONLINE       0     0     0
            wwn-0x5000cca284d2a159  ONLINE       0     0     0
            wwn-0x5000cca2a1c87255  ONLINE       0     0     0
            wwn-0x5000cca2a1c360dc  ONLINE       0     0     0
        logs
          mirror-1                  ONLINE       0     0     0
            wwn-0x5001173100b4900c  ONLINE       0     0     0
            wwn-0x5001173100b48f38  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:02:18 with 0 errors on Sun Jun 13 00:26:23 2021
config:

        NAME                                               STATE     READ WRITE CKSUM
        rpool                                              ONLINE       0     0     0
          mirror-0                                         ONLINE       0     0     0
            ata-SuperMicro_SSD_SMC0515D95116BCB1138-part3  ONLINE       0     0     0
            ata-SuperMicro_SSD_SMC0515D90717B6J5904-part3  ONLINE       0     0     0

errors: No known data errors
root@pbs01:~#

One Proxmox Cluster is already running backups into this pbs01 for some time without problems. Now I try to run backups with our production environment on weekends and the backup freezes one of our SPoF VMs (MySQL Master DB server). The VM becomes slow in intervals and then is fine again for a prolonged time and then freezes again. I had to stop the backup.
INFO: Starting Backup of VM 167 (qemu)
INFO: Backup started at 2021-07-03 00:28:31
INFO: status = running
INFO: VM Name: mysql01-a
INFO: exclude disk 'scsi0' 'ceph-proxmox-VMs:vm-167-disk-0' (backup=no)
INFO: include disk 'scsi1' 'ceph-proxmox-VMs:vm-167-disk-1' 8G
INFO: include disk 'scsi2' 'ceph-proxmox-VMs:vm-167-disk-2' 384G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/167/2021-07-02T22:28:31Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '762e6197-0428-4a99-a130-86bb53afd4f3'
INFO: resuming VM again
INFO: scsi1: dirty-bitmap status: OK (616.0 MiB of 8.0 GiB dirty)
INFO: scsi2: dirty-bitmap status: OK (296.5 GiB of 384.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 297.1 GiB dirty of 392.0 GiB total
INFO: 0% (52.0 MiB of 297.1 GiB) in 3s, read: 17.3 MiB/s, write: 17.3 MiB/s
INFO: 1% (3.0 GiB of 297.1 GiB) in 14m 36s, read: 3.4 MiB/s, write: 3.4 MiB/s
INFO: 2% (6.0 GiB of 297.1 GiB) in 32m 14s, read: 2.9 MiB/s, write: 2.9 MiB/s
INFO: 3% (8.9 GiB of 297.1 GiB) in 47m 28s, read: 3.3 MiB/s, write: 3.3 MiB/s
INFO: 4% (11.9 GiB of 297.1 GiB) in 1h 5m 4s, read: 2.9 MiB/s, write: 2.9 MiB/s
INFO: 5% (14.9 GiB of 297.1 GiB) in 1h 23m 27s, read: 2.8 MiB/s, write: 2.8 MiB/s
INFO: 6% (17.8 GiB of 297.1 GiB) in 1h 41m 27s, read: 2.8 MiB/s, write: 2.8 MiB/s
INFO: 7% (20.8 GiB of 297.1 GiB) in 2h 3m 37s, read: 2.3 MiB/s, write: 2.3 MiB/s
INFO: 8% (23.8 GiB of 297.1 GiB) in 2h 20m 48s, read: 3.0 MiB/s, write: 3.0 MiB/s
INFO: 9% (26.8 GiB of 297.1 GiB) in 2h 37m 58s, read: 3.0 MiB/s, write: 3.0 MiB/s
INFO: 10% (29.7 GiB of 297.1 GiB) in 2h 57m 3s, read: 2.7 MiB/s, write: 2.7 MiB/s
INFO: 11% (32.7 GiB of 297.1 GiB) in 3h 11m 51s, read: 3.4 MiB/s, write: 3.4 MiB/s
INFO: 12% (35.7 GiB of 297.1 GiB) in 3h 26m 9s, read: 3.5 MiB/s, write: 3.5 MiB/s
INFO: 13% (38.6 GiB of 297.1 GiB) in 3h 40m 3s, read: 3.7 MiB/s, write: 3.7 MiB/s
INFO: 14% (41.6 GiB of 297.1 GiB) in 3h 52m 53s, read: 3.9 MiB/s, write: 3.9 MiB/s
INFO: 15% (44.6 GiB of 297.1 GiB) in 4h 6m 15s, read: 3.8 MiB/s, write: 3.8 MiB/s
INFO: 16% (47.5 GiB of 297.1 GiB) in 4h 20m 23s, read: 3.6 MiB/s, write: 3.6 MiB/s
INFO: 17% (50.5 GiB of 297.1 GiB) in 4h 33m 25s, read: 3.9 MiB/s, write: 3.9 MiB/s
INFO: 18% (53.5 GiB of 297.1 GiB) in 4h 48m 42s, read: 3.3 MiB/s, write: 3.3 MiB/s
INFO: 19% (56.4 GiB of 297.1 GiB) in 5h 8m 12s, read: 2.6 MiB/s, write: 2.6 MiB/s
INFO: 20% (59.4 GiB of 297.1 GiB) in 5h 23m 6s, read: 3.4 MiB/s, write: 3.4 MiB/s
INFO: 21% (62.4 GiB of 297.1 GiB) in 5h 35m 30s, read: 4.1 MiB/s, write: 4.1 MiB/s
INFO: 22% (65.4 GiB of 297.1 GiB) in 5h 49m 44s, read: 3.6 MiB/s, write: 3.6 MiB/s
INFO: 23% (68.3 GiB of 297.1 GiB) in 6h 2m 42s, read: 3.9 MiB/s, write: 3.9 MiB/s
INFO: 24% (71.3 GiB of 297.1 GiB) in 6h 16m 45s, read: 3.6 MiB/s, write: 3.6 MiB/s
INFO: 25% (74.3 GiB of 297.1 GiB) in 6h 30m 44s, read: 3.6 MiB/s, write: 3.6 MiB/s
INFO: 26% (77.2 GiB of 297.1 GiB) in 6h 43m 45s, read: 3.9 MiB/s, write: 3.9 MiB/s
INFO: 27% (80.2 GiB of 297.1 GiB) in 6h 57m, read: 3.8 MiB/s, write: 3.8 MiB/s
INFO: 28% (83.2 GiB of 297.1 GiB) in 7h 10m 10s, read: 3.9 MiB/s, write: 3.9 MiB/s
INFO: 29% (86.2 GiB of 297.1 GiB) in 7h 22m 25s, read: 4.1 MiB/s, write: 4.1 MiB/s
INFO: 30% (89.1 GiB of 297.1 GiB) in 7h 34m 30s, read: 4.2 MiB/s, write: 4.2 MiB/s
INFO: 31% (92.1 GiB of 297.1 GiB) in 7h 46m 52s, read: 4.1 MiB/s, write: 4.1 MiB/s
INFO: 32% (95.1 GiB of 297.1 GiB) in 8h 1s, read: 3.8 MiB/s, write: 3.8 MiB/s
INFO: 33% (98.0 GiB of 297.1 GiB) in 8h 12m 5s, read: 4.2 MiB/s, write: 4.2 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
ERROR: Backup of VM 167 failed - interrupted by signal
INFO: Failed at 2021-07-03 08:52:10
ERROR: Backup job failed - interrupted by signal

TASK ERROR: interrupted by signal

Looking at the logs I find the write and read throughput very slow - my old USB Stick is faster.... .

The network between pbs and Proxmox cluster looks like this
pbs
-> 10G Dual LACP
-> 10G Switch cluster
-> 2x 10G active/passive
-> Firewall
-> 1G VRRP Uplink
-> Internet
-> 1G VRRP Uplink
-> Firewall
-> 10G active/passive
-> 10G Switch Cluster
-> 40G MLAG
-> 100G active/passive Switch
-> 100G MLAG
-> Proxmox Node

Network speeds are normally never a problem and there is nothing in our monitoring indicaing problems at that time.

The Proxmox cluster itself is a 5 node Cluster with the hardware and configuration mentioned in this benchmark post.

Proxmox Cluster:
proxmox-ve: 6.3-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-8
pve-kernel-helper: 6.3-8
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.10-pve1
ceph-fuse: 15.2.10-pve1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-9
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-5
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-10
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

pbs01:
proxmox-backup: 1.0-4 (running kernel: 5.4.119-1-pve)
proxmox-backup-server: 1.1.10-1 (running version: 1.1.10)
pve-kernel-5.4: 6.4-3
pve-kernel-helper: 6.4-3
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
ifupdown2: 3.0.0-1+pve3
libjs-extjs: 6.0.1-10
proxmox-backup-docs: 1.1.10-1
proxmox-backup-client: 1.1.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-6
pve-xtermjs: 4.7.0-3
smartmontools: 7.2-pve2
zfsutils-linux: 2.0.4-pve1


I currently have no idea, where to look further to get that backup working.

Best regards
Rainer
 
Last edited:
that the vm stalls during a slow backup is rather normal and has to do with how the qemu backup is done
(if the vm wants to write to a block that is not yet backed up, the write will be paused, the block backed up and then the write can continue)

the backup is rather latency sensitive (as are most tcp connections)

can you check what the benchmark says (on one of the pve boxes)

Code:
proxmox-backup-client benchmark --repository <repository>

?

also maybe it would be a better design to have a pbs instance locally that sync to a remote one
 
Hi Dominik,

so this is the output directly on the pbs01
Bash:
root@pbs01:~# proxmox-backup-client benchmark --repository pve-prod
Password for "root@pam": ************
Uploaded 911 chunks in 5 seconds.
Time per request: 5510 microseconds.
TLS speed: 761.13 MB/s
SHA256 speed: 367.21 MB/s
Compression speed: 458.08 MB/s
Decompress speed: 771.52 MB/s
AES256/GCM speed: 1933.45 MB/s
Verify speed: 244.16 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 761.13 MB/s (62%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 367.21 MB/s (18%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 458.08 MB/s (61%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 771.52 MB/s (64%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 244.16 MB/s (32%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1933.45 MB/s (53%) │
└───────────────────────────────────┴────────────────────┘
root@pbs01:~#

This is from a local PVE cluster connected via 10G. On I have not yet experienced problems like mentioned above on it.
Bash:
root@proxmox01:~# proxmox-backup-client benchmark --repository ceph-muc01@pbs@10.20.38.149:ceph-muc01
Uploaded 823 chunks in 5 seconds.
Time per request: 6091 microseconds.
TLS speed: 688.61 MB/s
SHA256 speed: 304.79 MB/s
Compression speed: 435.34 MB/s
Decompress speed: 624.07 MB/s
AES256/GCM speed: 1572.09 MB/s
Verify speed: 217.16 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 688.61 MB/s (56%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 304.79 MB/s (15%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 435.34 MB/s (58%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 624.07 MB/s (52%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 217.16 MB/s (29%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1572.09 MB/s (43%) │
└───────────────────────────────────┴────────────────────┘
root@proxmox01:~#

And this is the output of a node of the 5-node Cluster. As mentioned the traffic goes via "The Internet". This benchmark was just done now during day time. Backup time is over night where it is quieter.
Code:
root@proxmox07:~# proxmox-backup-client benchmark --repository ceph-prod@pbs@pbs01.muc.contoso.com:ceph-prod
Uploaded 31 chunks in 6 seconds.
Time per request: 218478 microseconds.
TLS speed: 19.20 MB/s
SHA256 speed: 1454.41 MB/s
Compression speed: 552.04 MB/s
Decompress speed: 1030.83 MB/s
AES256/GCM speed: 2811.19 MB/s
Verify speed: 630.11 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 19.20 MB/s (2%)    │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 1454.41 MB/s (72%) │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 552.04 MB/s (73%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 1030.83 MB/s (86%) │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 630.11 MB/s (83%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 2811.19 MB/s (77%) │
└───────────────────────────────────┴────────────────────┘
root@proxmox07:~#

Best regards and thanks for having a look at it!
Rainer
 
Last edited:
ok so it seems that the reads from the disks are not too fast... also your versions are a bit behind (pve 6.3 vs current pve 6.4)

can you try upgrading

also what is the output of 'qm status 167 --verbose' ?
 
So here is the status... upgrade is in planning
Code:
root@proxmox07:~# qm status 167 --verbose
blockstat:
        scsi0:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 0
                flush_operations: 56
                flush_total_time_ns: 615039
                idle_time_ns: 4350755657979293
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                rd_bytes: 52105216
                rd_merged: 0
                rd_operations: 1985
                rd_total_time_ns: 496696532
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 0
                wr_highest_offset: 0
                wr_merged: 0
                wr_operations: 0
                wr_total_time_ns: 0
        scsi1:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 0
                flush_operations: 1682093
                flush_total_time_ns: 52106919378
                idle_time_ns: 17092683295
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                rd_bytes: 287309824
                rd_merged: 0
                rd_operations: 29853
                rd_total_time_ns: 16554491409
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 667660431360
                wr_highest_offset: 8580497408
                wr_merged: 0
                wr_operations: 36717616
                wr_total_time_ns: 51506473094381
        scsi2:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 3
                flush_operations: 92497667
                flush_total_time_ns: 36433677406400
                idle_time_ns: 1571525
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                rd_bytes: 19527878459392
                rd_merged: 0
                rd_operations: 992646514
                rd_total_time_ns: 340969136956621
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 15450759467008
                wr_highest_offset: 412307423232
                wr_merged: 0
                wr_operations: 470304549
                wr_total_time_ns: 1458541544871098
cpus: 8
disk: 0
diskread: 19528217874432
diskwrite: 16118419898368
maxdisk: 0
maxmem: 68719476736
mem: 64917399283
name: mysql01-a
netin: 5292002277464
netout: 19095603712408
nics:
        tap167i0:
                netin: 5531143936
                netout: 2828764638
        tap167i1:
                netin: 5286471133528
                netout: 19092774947770
pid: 2649160
proxmox-support:
        pbs-dirty-bitmap: 1
        pbs-dirty-bitmap-migration: 1
        pbs-dirty-bitmap-savevm: 1
        pbs-library-version: 1.0.3 (8de935110ed4cab743f6c9437357057f9f9f08ea)
        pbs-masterkey: 1
        query-bitmap-info: 1
qmpstatus: running
running-machine: pc-i440fx-5.2+pve0
running-qemu: 5.2.0
status: running
template:
uptime: 6078540
vmid: 167
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!