Slow backups and poor performance of VMs during backups

dylan.uia0

Member
Mar 28, 2022
7
0
6
I am hoping I provide enough info off the bat to give a good idea of what is going on. But I am a little lost and just have a lot of questions I guess. I will also do my best to update with what has been answered, and link or say what the answer/solution was.

The setup:​

So we have 4 HP DL360p Gen9 servers, for 3 PVE nodes and a PBS. The nodes just have 2 SSDs in a ZFS mirror, and the PBS has 2 SSDs in mirror for boot, and 8 2.4TB SAS drives in a raid Z2. Shared storage for the nodes is a Jetstor, with 7 1.92 TB SAS SSDs raid 10 with hot spare. Each node and PBS has 1 of possible 2 10Gb connections to a switch for all traffic (I know it is suggested to separate the networks, we are looking into it). The Cisco switch has 16 10Gb ports, 8 of which are wirespeed (where we are plugged in), while the other 8 are 2:1 oversubscribed. The Jetstor has 4x 10Gb links, but is only using one currently (we are looking to increase this).
proxmox-ve: 7.3-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-3
pve-kernel-5.15: 7.3-1
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-2
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1
proxmox-ve: 7.3-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-3
pve-kernel-5.15: 7.3-1
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-2
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1
proxmox-ve: 7.3-1 (running kernel: 5.19.17-2-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-3
pve-kernel-5.15: 7.3-1
pve-kernel-5.19: 7.2-15
pve-kernel-5.13: 7.1-9
pve-kernel-5.19.17-2-pve: 5.19.17-2
pve-kernel-5.19.17-1-pve: 5.19.17-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-2
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1
proxmox-backup: 2.3-1 (running kernel: 5.15.83-1-pve)
proxmox-backup-server: 2.3.2-1 (running version: 2.3.2)
pve-kernel-helper: 7.3-3
pve-kernel-5.15: 7.3-1
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-1-pve: 5.13.19-3
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
proxmox-backup-docs: 2.3.2-1
proxmox-backup-client: 2.3.2-1
proxmox-mini-journalreader: 1.2-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-xtermjs: 4.16.0-1
smartmontools: 7.2-pve3
zfsutils-linux: 2.1.9-pve1
Note: Node 3 is a little different as we were troubleshooting another issue and it was suggested by support folk.

The problems:​

There are 2 main problems I would say, but they may be related, I'm not sure...
1) We see speeds of only 350MB/s during backups. While currently there is about 5TB of VM storage used, it only takes about 10 minutes to backup what it needs. We would like to increase this if possible as that would theoretically increase the restore speed as well. [Sounds like this is due to the storage configuration in PBS. We will investigate this later.]
2) The main problem, is that during a backup, VMs start to suffer. Not very responsive, web servers not loading sites as fast, databases are slow, etc.

What I have tried/done so far:​

  1. I tried changing the MTU to 4000, and 9000 (as those are the only 2 jumbo frame sizes supported by the Jetstor), and things went to crap. <75MB/s on backups now, seemed to be recreating a bunch of bitmaps, instead of 10 minutes it was taking over an hour. I did make sure I changed all nodes, PBS, Jetstor, and switch to the higher MTU.
  2. Also per Proxmox support suggestion, I changed machines over to VirtIO SCSI single from the LSI option I had.
  3. Checked that I have SR-IOV enabled, per this post. I don't see anything in the BIOS specifically for I/OAT however.

What I will do/try:​

  1. Try using iothread on the VMs that have the most i/o (Postgres and FileMaker server).
  2. I can play around with compression and see if anything can be gained there (zstd, lzo, gzip, pigz). Currently the compression says it is using zstd, but it is also grayed out, and I did not change manually. Default says it is 0, so I'm not sure.

Future considerations:​

  1. Changing the network typology to what is recommended.
  2. Adding some SSD/NVMe cache to PBS in front of the spinning drives.
  3. Changing node 1 and 2 to UEFI boot.

Questions I have:​

  1. [Answered] Would it be worth separating the backups into multiple jobs to parallelize it?
  2. [Answered] What is the read/write relative to on a backup? Is it reading from PBS to know what needs to be written?
  3. [Answered] Is backing up single threaded?
  4. [Answered] If we separate the corosync network, would it be worth changing the migration method to insecure, like here?
  5. With dual Xeon E5-2690 v3 in each PVE, and dual Xeon E5-2670 v3 in PBS, are we CPU bottlenecked?
  6. If so, would a second attempt at upping the MTU be worth it? So for the same amount of packets handled, more data is moved (is my understanding).
  7. If the backup server can only do 350MB/s, but migrations can saturate the full 10Gb between nodes, what might be causing the slowness while doing backups?
  8. Has anyone had luck changing the queue size on the NICs, like here?
  9. How about changing the MTU, like here, or here?
 
Last edited:
Since I reached the character limit, here is some more tests I have:

Information I have so far:​

Code:
Uploaded 292 chunks in 5 seconds.
Time per request: 17438 microseconds.
TLS speed: 240.52 MB/s
SHA256 speed: 402.37 MB/s
Compression speed: 390.95 MB/s
Decompress speed: 579.71 MB/s
AES256/GCM speed: 1158.82 MB/s
Verify speed: 238.02 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 240.52 MB/s (19%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 402.37 MB/s (20%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 390.95 MB/s (52%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 579.71 MB/s (48%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 238.02 MB/s (31%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1158.82 MB/s (32%) │
└───────────────────────────────────┴────────────────────┘
Code:
Connecting to host 10.22.13.90, port 5201
[  5] local 10.22.13.20 port 34008 connected to 10.22.13.90 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.01 GBytes  8.64 Gbits/sec   13   1.19 MBytes   
[  5]   1.00-2.00   sec  1016 MBytes  8.53 Gbits/sec   13   1.35 MBytes   
[  5]   2.00-3.00   sec  1.04 GBytes  8.96 Gbits/sec    0   1.37 MBytes   
[  5]   3.00-4.00   sec  1.01 GBytes  8.66 Gbits/sec    0   1.39 MBytes   
[  5]   4.00-5.00   sec  1.02 GBytes  8.72 Gbits/sec    0   1.41 MBytes   
[  5]   5.00-6.00   sec   939 MBytes  7.88 Gbits/sec   23   1.42 MBytes   
[  5]   6.00-7.00   sec  1004 MBytes  8.42 Gbits/sec    9   1.43 MBytes   
[  5]   7.00-8.00   sec  1.04 GBytes  8.90 Gbits/sec  138   1.44 MBytes   
[  5]   8.00-9.00   sec  1021 MBytes  8.57 Gbits/sec    0   1.45 MBytes   
[  5]   9.00-10.00  sec  1.04 GBytes  8.98 Gbits/sec  152   1.02 MBytes   
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.0 GBytes  8.63 Gbits/sec  348             sender
[  5]   0.00-10.03  sec  10.0 GBytes  8.59 Gbits/sec                  receiver
Code:
Connecting to host 10.22.13.20, port 5201
[  5] local 10.22.13.90 port 36052 connected to 10.22.13.20 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.07 GBytes  9.16 Gbits/sec   14   3.13 MBytes   
[  5]   1.00-2.00   sec  1.03 GBytes  8.86 Gbits/sec   20   3.13 MBytes   
[  5]   2.00-3.00   sec  1.03 GBytes  8.85 Gbits/sec   20   3.13 MBytes   
[  5]   3.00-4.00   sec  1.03 GBytes  8.85 Gbits/sec    2   3.13 MBytes   
[  5]   4.00-5.00   sec  1.03 GBytes  8.86 Gbits/sec    0   3.13 MBytes   
[  5]   5.00-6.00   sec  1.06 GBytes  9.08 Gbits/sec    1   3.13 MBytes   
[  5]   6.00-7.00   sec  1.06 GBytes  9.13 Gbits/sec    0   3.13 MBytes   
[  5]   7.00-8.00   sec  1.06 GBytes  9.07 Gbits/sec    0   3.13 MBytes   
[  5]   8.00-9.00   sec  1.03 GBytes  8.87 Gbits/sec    0   3.13 MBytes   
[  5]   9.00-10.00  sec  1.03 GBytes  8.86 Gbits/sec    0   3.13 MBytes   
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.4 GBytes  8.96 Gbits/sec   57             sender
[  5]   0.00-10.04  sec  10.4 GBytes  8.92 Gbits/sec                  receiver
Connecting to host 10.22.13.30, port 5201
[ 5] local 10.22.13.20 port 47820 connected to 10.22.13.30 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.08 GBytes 9.25 Gbits/sec 7 1.24 MBytes
[ 5] 1.00-2.00 sec 1.09 GBytes 9.38 Gbits/sec 15 1.50 MBytes
[ 5] 2.00-3.00 sec 1.09 GBytes 9.40 Gbits/sec 12 1.51 MBytes
[ 5] 3.00-4.00 sec 1.09 GBytes 9.37 Gbits/sec 1 1.51 MBytes
[ 5] 4.00-5.00 sec 1.09 GBytes 9.33 Gbits/sec 0 1.51 MBytes
[ 5] 5.00-6.00 sec 1.09 GBytes 9.39 Gbits/sec 0 1.51 MBytes
[ 5] 6.00-7.00 sec 1.09 GBytes 9.39 Gbits/sec 0 1.53 MBytes
[ 5] 7.00-8.00 sec 1.09 GBytes 9.38 Gbits/sec 0 1.55 MBytes
[ 5] 8.00-9.00 sec 1.09 GBytes 9.40 Gbits/sec 0 1.60 MBytes
[ 5] 9.00-10.00 sec 1.08 GBytes 9.28 Gbits/sec 0 1.66 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.9 GBytes 9.36 Gbits/sec 35 sender
[ 5] 0.00-10.04 sec 10.9 GBytes 9.32 Gbits/sec receiver
Connecting to host 10.22.13.20, port 5201
[ 5] local 10.22.13.30 port 39724 connected to 10.22.13.20 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.09 GBytes 9.40 Gbits/sec 17 1.32 MBytes
[ 5] 1.00-2.00 sec 1.09 GBytes 9.39 Gbits/sec 0 1.34 MBytes
[ 5] 2.00-3.00 sec 1.09 GBytes 9.39 Gbits/sec 49 1.35 MBytes
[ 5] 3.00-4.00 sec 1.09 GBytes 9.39 Gbits/sec 15 1.36 MBytes
[ 5] 4.00-5.00 sec 1.09 GBytes 9.38 Gbits/sec 0 1.44 MBytes
[ 5] 5.00-6.00 sec 1.09 GBytes 9.39 Gbits/sec 14 1.92 MBytes
[ 5] 6.00-7.00 sec 1.09 GBytes 9.39 Gbits/sec 0 2.00 MBytes
[ 5] 7.00-8.00 sec 1.09 GBytes 9.39 Gbits/sec 0 2.23 MBytes
[ 5] 8.00-9.00 sec 1.09 GBytes 9.39 Gbits/sec 31 2.32 MBytes
[ 5] 9.00-10.00 sec 1.09 GBytes 9.39 Gbits/sec 6 2.33 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.9 GBytes 9.39 Gbits/sec 132 sender
[ 5] 0.00-10.04 sec 10.9 GBytes 9.35 Gbits/sec receiver
INFO: starting new backup job: vzdump 211 --mode snapshot --compress 0 --remove 0 --notes-template '{{guestname}}' --node proxmox1 --storage local
INFO: Starting Backup of VM 211 (qemu)
INFO: Backup started at 2023-02-14 15:56:59
INFO: status = running
INFO: VM Name: zabbix
INFO: include disk 'scsi0' 'SharedStorage:vm-211-disk-0' 10G
INFO: include disk 'scsi1' 'SharedStorage:vm-211-disk-1' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/var/lib/vz/dump/vzdump-qemu-211-2023_02_14-15_56_59.vma'
INFO: started backup task 'fc40c958-70d8-4b80-a657-f07ceb3d3ced'
INFO: resuming VM again
INFO: 3% (1.6 GiB of 42.0 GiB) in 3s, read: 540.9 MiB/s, write: 516.8 MiB/s
INFO: 6% (2.7 GiB of 42.0 GiB) in 6s, read: 389.0 MiB/s, write: 370.1 MiB/s
INFO: 7% (3.1 GiB of 42.0 GiB) in 11s, read: 83.3 MiB/s, write: 82.7 MiB/s
INFO: 9% (4.1 GiB of 42.0 GiB) in 14s, read: 343.8 MiB/s, write: 325.8 MiB/s
INFO: 11% (4.7 GiB of 42.0 GiB) in 17s, read: 186.1 MiB/s, write: 180.7 MiB/s
INFO: 12% (5.1 GiB of 42.0 GiB) in 21s, read: 113.3 MiB/s, write: 110.4 MiB/s
INFO: 13% (5.5 GiB of 42.0 GiB) in 25s, read: 106.0 MiB/s, write: 104.6 MiB/s
INFO: 14% (5.9 GiB of 42.0 GiB) in 28s, read: 123.2 MiB/s, write: 119.6 MiB/s
INFO: 15% (6.4 GiB of 42.0 GiB) in 33s, read: 92.8 MiB/s, write: 85.8 MiB/s
INFO: 16% (6.8 GiB of 42.0 GiB) in 38s, read: 83.4 MiB/s, write: 83.0 MiB/s
INFO: 17% (7.2 GiB of 42.0 GiB) in 43s, read: 94.2 MiB/s, write: 92.5 MiB/s
INFO: 18% (7.6 GiB of 42.0 GiB) in 46s, read: 128.1 MiB/s, write: 127.2 MiB/s
INFO: 19% (8.2 GiB of 42.0 GiB) in 50s, read: 142.3 MiB/s, write: 128.9 MiB/s
INFO: 20% (8.6 GiB of 42.0 GiB) in 53s, read: 154.7 MiB/s, write: 153.1 MiB/s
INFO: 21% (9.0 GiB of 42.0 GiB) in 56s, read: 146.2 MiB/s, write: 144.2 MiB/s
INFO: 22% (9.5 GiB of 42.0 GiB) in 59s, read: 168.0 MiB/s, write: 165.5 MiB/s
INFO: 23% (10.0 GiB of 42.0 GiB) in 1m 2s, read: 169.2 MiB/s, write: 165.7 MiB/s
INFO: 25% (10.7 GiB of 42.0 GiB) in 1m 5s, read: 245.3 MiB/s, write: 186.9 MiB/s
INFO: 26% (11.3 GiB of 42.0 GiB) in 1m 8s, read: 175.2 MiB/s, write: 163.3 MiB/s
INFO: 27% (11.6 GiB of 42.0 GiB) in 1m 11s, read: 126.9 MiB/s, write: 112.3 MiB/s
INFO: 29% (12.3 GiB of 42.0 GiB) in 1m 14s, read: 219.9 MiB/s, write: 157.0 MiB/s
INFO: 30% (13.0 GiB of 42.0 GiB) in 1m 17s, read: 235.6 MiB/s, write: 176.1 MiB/s
INFO: 33% (14.0 GiB of 42.0 GiB) in 1m 20s, read: 344.5 MiB/s, write: 127.2 MiB/s
INFO: 35% (15.1 GiB of 42.0 GiB) in 1m 23s, read: 380.8 MiB/s, write: 131.2 MiB/s
INFO: 39% (16.6 GiB of 42.0 GiB) in 1m 26s, read: 521.3 MiB/s, write: 16.4 MiB/s
INFO: 43% (18.3 GiB of 42.0 GiB) in 1m 29s, read: 584.7 MiB/s, write: 124.0 KiB/s
INFO: 47% (20.0 GiB of 42.0 GiB) in 1m 32s, read: 578.0 MiB/s, write: 4.0 KiB/s
INFO: 51% (21.8 GiB of 42.0 GiB) in 1m 35s, read: 595.3 MiB/s, write: 1.3 KiB/s
INFO: 56% (23.6 GiB of 42.0 GiB) in 1m 38s, read: 612.0 MiB/s, write: 7.3 MiB/s
INFO: 60% (25.3 GiB of 42.0 GiB) in 1m 41s, read: 589.7 MiB/s, write: 60.2 MiB/s
INFO: 64% (27.0 GiB of 42.0 GiB) in 1m 44s, read: 595.3 MiB/s, write: 94.7 MiB/s
INFO: 68% (28.7 GiB of 42.0 GiB) in 1m 47s, read: 565.4 MiB/s, write: 678.7 KiB/s
INFO: 72% (30.3 GiB of 42.0 GiB) in 1m 50s, read: 549.7 MiB/s, write: 45.1 MiB/s
INFO: 76% (32.0 GiB of 42.0 GiB) in 1m 53s, read: 573.3 MiB/s, write: 8.0 MiB/s
INFO: 80% (33.6 GiB of 42.0 GiB) in 1m 56s, read: 568.4 MiB/s, write: 436.6 MiB/s
INFO: 83% (35.1 GiB of 42.0 GiB) in 1m 59s, read: 500.3 MiB/s, write: 408.2 MiB/s
INFO: 85% (35.9 GiB of 42.0 GiB) in 2m 2s, read: 269.7 MiB/s, write: 190.0 MiB/s
INFO: 86% (36.2 GiB of 42.0 GiB) in 2m 6s, read: 81.7 MiB/s, write: 49.1 MiB/s
INFO: 89% (37.6 GiB of 42.0 GiB) in 2m 9s, read: 470.2 MiB/s, write: 382.3 MiB/s
INFO: 90% (38.2 GiB of 42.0 GiB) in 2m 12s, read: 212.7 MiB/s, write: 178.3 MiB/s
INFO: 91% (38.6 GiB of 42.0 GiB) in 2m 15s, read: 140.2 MiB/s, write: 138.6 MiB/s
INFO: 92% (39.0 GiB of 42.0 GiB) in 2m 18s, read: 140.2 MiB/s, write: 111.5 MiB/s
INFO: 94% (39.6 GiB of 42.0 GiB) in 2m 21s, read: 180.0 MiB/s, write: 116.8 MiB/s
INFO: 95% (39.9 GiB of 42.0 GiB) in 2m 28s, read: 53.3 MiB/s, write: 33.0 MiB/s
INFO: 96% (40.7 GiB of 42.0 GiB) in 2m 31s, read: 263.4 MiB/s, write: 186.4 MiB/s
INFO: 99% (41.7 GiB of 42.0 GiB) in 2m 34s, read: 356.5 MiB/s, write: 118.1 MiB/s
INFO: 100% (42.0 GiB of 42.0 GiB) in 2m 35s, read: 262.0 MiB/s, write: 43.6 MiB/s
INFO: backup is sparse: 21.46 GiB (51%) total zero data
INFO: transferred 42.00 GiB in 155 seconds (277.5 MiB/s)
INFO: archive file size: 20.54GB
INFO: adding notes to backup
INFO: Finished Backup of VM 211 (00:02:36)
INFO: Backup finished at 2023-02-14 15:59:35
INFO: Backup job finished successfully
TASK OK
INFO: starting new backup job: vzdump 209 --storage local --node proxmox1 --remove 0 --notes-template '{{guestname}}' --compress 0 --mode snapshot
INFO: Starting Backup of VM 209 (qemu)
INFO: Backup started at 2023-02-14 16:21:39
INFO: status = running
INFO: VM Name: FlexScada
INFO: include disk 'scsi0' 'SharedStorage:vm-209-disk-1' 114G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'SharedStorage:vm-209-disk-0' (not included into backup)
INFO: creating vzdump archive '/var/lib/vz/dump/vzdump-qemu-209-2023_02_14-16_21_39.vma'
INFO: started backup task '9df683a3-a826-4ef6-9ac4-91c80ce64c53'
INFO: resuming VM again
INFO: 1% (1.7 GiB of 114.0 GiB) in 3s, read: 564.1 MiB/s, write: 432.8 MiB/s
INFO: 2% (3.3 GiB of 114.0 GiB) in 6s, read: 553.6 MiB/s, write: 317.4 MiB/s
INFO: 4% (4.9 GiB of 114.0 GiB) in 9s, read: 569.7 MiB/s, write: 4.7 MiB/s
INFO: 5% (6.6 GiB of 114.0 GiB) in 12s, read: 560.0 MiB/s, write: 64.0 KiB/s
INFO: 7% (8.3 GiB of 114.0 GiB) in 15s, read: 570.1 MiB/s, write: 177.1 MiB/s
INFO: 8% (9.5 GiB of 114.0 GiB) in 18s, read: 432.0 MiB/s, write: 413.8 MiB/s
INFO: 9% (11.0 GiB of 114.0 GiB) in 33s, read: 101.0 MiB/s, write: 90.4 MiB/s
INFO: 10% (11.5 GiB of 114.0 GiB) in 36s, read: 178.1 MiB/s, write: 172.3 MiB/s
INFO: 11% (13.1 GiB of 114.0 GiB) in 42s, read: 273.5 MiB/s, write: 247.7 MiB/s
INFO: 12% (14.3 GiB of 114.0 GiB) in 48s, read: 204.8 MiB/s, write: 182.5 MiB/s
INFO: 13% (15.0 GiB of 114.0 GiB) in 52s, read: 165.8 MiB/s, write: 160.9 MiB/s
INFO: 14% (16.2 GiB of 114.0 GiB) in 1m 2s, read: 126.5 MiB/s, write: 112.5 MiB/s
INFO: 15% (17.2 GiB of 114.0 GiB) in 1m 6s, read: 261.8 MiB/s, write: 256.4 MiB/s
INFO: 16% (18.5 GiB of 114.0 GiB) in 1m 12s, read: 220.3 MiB/s, write: 198.1 MiB/s
INFO: 17% (20.1 GiB of 114.0 GiB) in 1m 15s, read: 530.9 MiB/s, write: 478.0 MiB/s
INFO: 18% (21.5 GiB of 114.0 GiB) in 1m 21s, read: 235.7 MiB/s, write: 231.2 MiB/s
INFO: 19% (22.2 GiB of 114.0 GiB) in 1m 26s, read: 145.0 MiB/s, write: 113.7 MiB/s
INFO: 20% (23.0 GiB of 114.0 GiB) in 1m 29s, read: 278.0 MiB/s, write: 75.0 MiB/s
INFO: 21% (24.0 GiB of 114.0 GiB) in 1m 32s, read: 339.0 MiB/s, write: 89.6 MiB/s
INFO: 22% (25.6 GiB of 114.0 GiB) in 1m 35s, read: 562.0 MiB/s, write: 3.8 MiB/s
INFO: 23% (26.6 GiB of 114.0 GiB) in 1m 39s, read: 247.0 MiB/s, write: 92.5 MiB/s
INFO: 24% (28.3 GiB of 114.0 GiB) in 1m 42s, read: 589.0 MiB/s, write: 43.2 MiB/s
INFO: 26% (30.0 GiB of 114.0 GiB) in 1m 45s, read: 563.7 MiB/s, write: 21.8 MiB/s
INFO: 27% (31.8 GiB of 114.0 GiB) in 1m 48s, read: 613.7 MiB/s, write: 39.7 MiB/s
INFO: 29% (33.4 GiB of 114.0 GiB) in 1m 51s, read: 559.3 MiB/s, write: 18.7 KiB/s
INFO: 30% (34.3 GiB of 114.0 GiB) in 1m 54s, read: 308.7 MiB/s, write: 119.9 MiB/s
INFO: 31% (35.5 GiB of 114.0 GiB) in 2m 1s, read: 178.4 MiB/s, write: 159.1 MiB/s
INFO: 32% (36.8 GiB of 114.0 GiB) in 2m 11s, read: 134.5 MiB/s, write: 124.4 MiB/s
INFO: 33% (38.2 GiB of 114.0 GiB) in 2m 21s, read: 141.4 MiB/s, write: 128.7 MiB/s
INFO: 34% (38.8 GiB of 114.0 GiB) in 2m 27s, read: 100.9 MiB/s, write: 99.9 MiB/s
INFO: 35% (40.2 GiB of 114.0 GiB) in 2m 42s, read: 93.3 MiB/s, write: 85.3 MiB/s
INFO: 36% (41.5 GiB of 114.0 GiB) in 3m, read: 77.0 MiB/s, write: 75.6 MiB/s
INFO: 37% (42.6 GiB of 114.0 GiB) in 3m 12s, read: 91.2 MiB/s, write: 74.9 MiB/s
INFO: 38% (43.5 GiB of 114.0 GiB) in 3m 21s, read: 105.3 MiB/s, write: 103.1 MiB/s
INFO: 39% (45.0 GiB of 114.0 GiB) in 3m 27s, read: 255.3 MiB/s, write: 238.1 MiB/s
INFO: 40% (46.6 GiB of 114.0 GiB) in 3m 40s, read: 123.8 MiB/s, write: 114.7 MiB/s
INFO: 41% (47.5 GiB of 114.0 GiB) in 3m 48s, read: 112.8 MiB/s, write: 111.8 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
 
Last edited:
Hi,
I'll try to answer a few of your questions:
  1. With dual Xeon E5-2690 v3 in each PVE, and dual Xeon E5-2670 v3 in PBS, are we CPU bottlenecked?
Well, how does the load look like during backup? If you have much IO wait in on the PVE side, I recommend trying to lower the max-workers setting, see here: https://forum.proxmox.com/threads/t...ad-behavior-during-backup.118430/#post-513106

  1. Is backing up single threaded?
I think the reads in QEMU are handled in a single thread, but issued asynchronously. If you don't use iothread for the disks, they will be handled in the main thread, which can reduce performance.

  1. If the backup server can only do 350MB/s, but migrations can saturate the full 10Gb between nodes, what might be causing the slowness while doing backups?
Disk migrations or RAM migrations? How does CPU/IO/network load look like during backup?

  1. Would it be worth separating the backups into multiple jobs to parallelize it?
There can be only one concurrent backup job per node currently, relevant feature request: https://bugzilla.proxmox.com/show_bug.cgi?id=3347

  1. If we separate the corosync network, would it be worth changing the migration method to insecure, like here?
Using a dedicated network for corosync is highly recommended. You could change to insecure (if the network is local), but I thought migrations were fast already, so I wouldn't recommend it. The default is always better tested.

  1. What is the read/write relative to on a backup? Is it reading from PBS to know what needs to be written?
Reads are what has been read from the source disks and writes are what needed to be written to PBS. While metadata is read from PBS to avoid uploading already present chunks, I don't think those reads are counted anywhere in the progress output.
 
I tried changing the MTU to 4000, and 9000
The MTU _must_ be changed at every point/device in the network that cross-communicates, including the network switch ports. If you are sharing VM and Backup traffic - all compute nodes connected to the switch must be set to same MTU. If you dont, the retransmits will overwhelm your network and negate any benefits.

Also keep in mind that you cant skimp on your backup server/network - they are in a critical path of production traffic during the backup:
https://git.proxmox.com/?p=pve-qemu.git;a=blob_plain;f=backup.txt

specifically changes/writes to existing data can be affected by slow backup:
Code:
1.) read old data before it gets overwritten
2.) write that data into the backup archive
3.) write new data (VM write)


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Disk migrations or RAM migrations? How does CPU/IO/network load look like during backup?
Sorry...I did not think about that. I just did a VM migration, so I guess that would be the RAM moving not storage, since storage is shared. But I think it still shows that the network links and nodes are certainly capable of the speeds. So it is almost certainly something to do with storage or storage access. Does that seem reasonable?

Well, how does the load look like during backup? If you have much IO wait in on the PVE side, I recommend trying to lower the max-workers setting, see here: https://forum.proxmox.com/threads/t...ad-behavior-during-backup.118430/#post-513106
I will watch the IO wait on the node with the suffering VMs and see what it looks like during a backup. I will also see what the CPU and throughput we are seeing during a backup (asked for in previous quote).

Using a dedicated network for corosync is highly recommended. You could change to insecure (if the network is local), but I thought migrations were fast already, so I wouldn't recommend it. The default is always better tested.
So by default, it looks like since we only setup the one connection, all 3...purposes (corosync, storage, and migration) are going over the one link. So we could move JUST corosync to a 1Gb interface to ensure large amounts of data moving around are not going to cause an issue for corosync. Since we only have the 2 10Gb connections, and they are looking for faster access, that we may try out having migrations and storage share a path. Unless that is unreasonable?

The MTU _must_ be changed at every point/device in the network that cross-communicates
Yes. Every node, the switch, PBS, and the Jetstor all had matching MTUs at 4000. That did not seem to work, and at some point we also tried 9000, which also seemed to not work.

Also keep in mind that you cant skimp on your backup server/network
* slow backup storage can slow down VM during backup
Well look at that. Right in the link you sent. Haha. Sounds like the solution for the backup speed is to get better storage in there. Either a RAIDZ/Z2 with SSDs, or maybe we can try a mirror/stripes with the SAS drives we have? We can also try just throwing in 2 NVMe PCIe slot SSDs in a mirror, I would imagine they don't have to be very big as it would just be a write cache (or special device, it seems).
 
A very quick peek at your iperf does seem to indicate that the links are good at network layer.
I cant advise you on how to improve responsiveness of your existing storage system. I do think that using FIO to establish baseline might give you a better handle on what is possible in terms of reads/writes and concurrency.

We always use 9000 MTU when possible and it never presented a problem. All of our testing is done with Jumbo MTU : https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/

good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
A very quick peek at your iperf does seem to indicate that the links are good at network layer.
Thank you for taking the time to check it out and provide some feedback. I appreciate it.

I do think that using FIO to establish baseline might give you a better handle on what is possible
I will look into running this maybe during and not during a backup. I'll see what I can do about making the system as quiet as possible to test too, but I don't have high hopes for that.
 
How does CPU/IO/network load look like during backup?
I am watching a backup taking place on node 2, and it is currently the only backup running (this VM backs up offset of everything else). A good majority of the time the IO wait is <0.20%, but a few times it has climbed up to 4.20%, then hovering around 1.00-4.20%, before settling back down to <0.20%. I saw a peak as high as 6.23%. Though speeds are still good (or appear to be) during this time.

EDIT: Watching the main backup task on node 1 (where all but 2 machines are), IO delay gets as high as 18% from what I have seen. And those jumps do coincide with drops in backup performance. Could an issue be specific to a node? I will try moving to node 3 when these are done to see how the next few backups perform. I think doing this will make it recreate bitmaps (correct me if I'm wrong), so I'll look at a handful of backups after the first one to get an idea.

CPU usage for node 1 (graphed via Grafana). Backup starts at 8, so I did a little before to show the difference.
Screenshot 2023-02-16 at 8.21.59 AM.png

CPU usage for node 2 (graphed via Grafana). Backup starts at 7, so I did a little before to show the difference (still running).
Screenshot 2023-02-16 at 7.32.09 AM.png

PBS stats:
Screenshot 2023-02-16 at 7.40.40 AM.png

TenGigabitEthernet1/2 is up, line protocol is up (connected)
Hardware is Ten Gigabit Ethernet Port, address is 0022.56bb.6dc1 (bia 0022.56bb.6dc1)
Description: PROXMOX_2_DATA
MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,
reliability 255/255, txload 82/255, rxload 4/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 10Gb/s, link type is auto, media type is 10GBase-SR
input flow-control is on, output flow-control is off
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:03, output never, output hang never
Last clearing of "show interface" counters 3w5d
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 169909000 bits/sec, 128372 packets/sec
5 minute output rate 3233311000 bits/sec, 339969 packets/sec
53995403844 packets input, 40870787653620 bytes, 0 no buffer
Received 84448 broadcasts (81442 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 input packets with dribble condition detected
109242754294 packets output, 136942615103616 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier
0 output buffer failures, 0 output buffers swapped out
TenGigabitEthernet1/5 is up, line protocol is up (connected)
Hardware is Ten Gigabit Ethernet Port, address is 0022.56bb.6dc4 (bia 0022.56bb.6dc4)
Description: JETSTOR_DATA_1
MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,
reliability 255/255, txload 6/255, rxload 83/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 10Gb/s, link type is auto, media type is 10GBase-SR
input flow-control is on, output flow-control is off
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output never, output hang never
Last clearing of "show interface" counters 3w5d
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 3268920000 bits/sec, 341617 packets/sec
5 minute output rate 240462000 bits/sec, 131525 packets/sec
172276633003 packets input, 215208250998627 bytes, 0 no buffer
Received 24087 broadcasts (0 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 input packets with dribble condition detected
80319301553 packets output, 51109713699106 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier
0 output buffer failures, 0 output buffers swapped out
INFO: starting new backup job: vzdump 220 --storage BackupServer1 --mailnotification always --quiet 1 --notes-template '{{guestname}} ({{node}})' --mode snapshot
INFO: Starting Backup of VM 220 (qemu)
INFO: Backup started at 2023-02-16 07:00:01
INFO: status = running
INFO: VM Name: <hidden>
INFO: include disk 'scsi0' 'SharedStorage:vm-220-disk-0' 1T
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: pending configuration changes found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/220/2023-02-16T15:00:01Z'
INFO: started backup task 'b4a52b79-0bd4-40b3-871d-69994dce9034'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: 0% (1.1 GiB of 1.0 TiB) in 3s, read: 360.0 MiB/s, write: 1.3 MiB/s
INFO: 1% (10.3 GiB of 1.0 TiB) in 29s, read: 364.2 MiB/s, write: 5.8 MiB/s
INFO: 2% (20.7 GiB of 1.0 TiB) in 57s, read: 380.0 MiB/s, write: 3.9 MiB/s
INFO: 3% (30.8 GiB of 1.0 TiB) in 1m 25s, read: 371.3 MiB/s, write: 8.0 MiB/s
INFO: 4% (41.2 GiB of 1.0 TiB) in 1m 54s, read: 366.9 MiB/s, write: 2.3 MiB/s
INFO: 5% (51.3 GiB of 1.0 TiB) in 2m 22s, read: 369.0 MiB/s, write: 438.9 KiB/s
INFO: 6% (61.7 GiB of 1.0 TiB) in 2m 56s, read: 313.1 MiB/s, write: 2.2 MiB/s
INFO: 7% (71.9 GiB of 1.0 TiB) in 3m 23s, read: 387.9 MiB/s, write: 758.5 KiB/s
INFO: 8% (82.0 GiB of 1.0 TiB) in 3m 54s, read: 333.0 MiB/s, write: 1.9 MiB/s
INFO: 9% (92.4 GiB of 1.0 TiB) in 4m 21s, read: 393.5 MiB/s, write: 455.1 KiB/s
INFO: 10% (102.7 GiB of 1.0 TiB) in 4m 53s, read: 329.4 MiB/s, write: 1.0 MiB/s
INFO: 11% (112.8 GiB of 1.0 TiB) in 5m 25s, read: 323.0 MiB/s, write: 640.0 KiB/s
INFO: 12% (123.2 GiB of 1.0 TiB) in 6m 3s, read: 281.9 MiB/s, write: 1.2 MiB/s
INFO: 13% (133.4 GiB of 1.0 TiB) in 6m 31s, read: 371.6 MiB/s, write: 5.0 MiB/s
INFO: 14% (143.8 GiB of 1.0 TiB) in 6m 59s, read: 379.1 MiB/s, write: 1.0 MiB/s
INFO: 15% (153.6 GiB of 1.0 TiB) in 7m 27s, read: 360.3 MiB/s, write: 438.9 KiB/s
INFO: 16% (164.1 GiB of 1.0 TiB) in 7m 54s, read: 395.9 MiB/s, write: 1.9 MiB/s
INFO: 17% (174.3 GiB of 1.0 TiB) in 8m 21s, read: 386.7 MiB/s, write: 151.7 KiB/s
INFO: 18% (184.4 GiB of 1.0 TiB) in 8m 48s, read: 384.9 MiB/s, write: 303.4 KiB/s
INFO: 19% (194.8 GiB of 1.0 TiB) in 9m 20s, read: 330.9 MiB/s, write: 0 B/s
INFO: 20% (205.0 GiB of 1.0 TiB) in 9m 49s, read: 362.9 MiB/s, write: 706.2 KiB/s
INFO: 21% (215.4 GiB of 1.0 TiB) in 10m 17s, read: 377.6 MiB/s, write: 1.3 MiB/s
INFO: 22% (225.3 GiB of 1.0 TiB) in 10m 45s, read: 364.4 MiB/s, write: 15.3 MiB/s
INFO: 23% (235.6 GiB of 1.0 TiB) in 11m 13s, read: 377.3 MiB/s, write: 1.4 MiB/s
INFO: 24% (245.9 GiB of 1.0 TiB) in 11m 41s, read: 376.4 MiB/s, write: 731.4 KiB/s
INFO: 25% (256.2 GiB of 1.0 TiB) in 12m 8s, read: 388.3 MiB/s, write: 303.4 KiB/s
INFO: 26% (266.4 GiB of 1.0 TiB) in 12m 35s, read: 387.6 MiB/s, write: 455.1 KiB/s
INFO: 27% (276.8 GiB of 1.0 TiB) in 13m 3s, read: 380.6 MiB/s, write: 585.1 KiB/s
INFO: 28% (286.8 GiB of 1.0 TiB) in 13m 28s, read: 411.5 MiB/s, write: 327.7 KiB/s
INFO: 29% (297.2 GiB of 1.0 TiB) in 13m 56s, read: 378.3 MiB/s, write: 11.3 MiB/s
INFO: 30% (307.2 GiB of 1.0 TiB) in 14m 26s, read: 342.3 MiB/s, write: 1.5 MiB/s
INFO: 31% (317.8 GiB of 1.0 TiB) in 14m 54s, read: 386.7 MiB/s, write: 1.6 MiB/s
INFO: 32% (327.8 GiB of 1.0 TiB) in 15m 21s, read: 381.8 MiB/s, write: 606.8 KiB/s
INFO: 33% (338.0 GiB of 1.0 TiB) in 15m 48s, read: 386.8 MiB/s, write: 1.6 MiB/s
INFO: 34% (348.2 GiB of 1.0 TiB) in 16m 25s, read: 281.5 MiB/s, write: 1.1 MiB/s
INFO: 35% (358.6 GiB of 1.0 TiB) in 16m 52s, read: 395.3 MiB/s, write: 1.3 MiB/s
INFO: 36% (368.6 GiB of 1.0 TiB) in 17m 18s, read: 394.2 MiB/s, write: 2.5 MiB/s
INFO: 37% (379.2 GiB of 1.0 TiB) in 17m 46s, read: 384.3 MiB/s, write: 2.3 MiB/s
INFO: 38% (389.1 GiB of 1.0 TiB) in 18m 13s, read: 378.7 MiB/s, write: 1.3 MiB/s
INFO: 39% (399.5 GiB of 1.0 TiB) in 18m 41s, read: 378.9 MiB/s, write: 1.1 MiB/s
INFO: 40% (409.8 GiB of 1.0 TiB) in 19m 9s, read: 376.1 MiB/s, write: 1.4 MiB/s
INFO: 41% (419.9 GiB of 1.0 TiB) in 19m 36s, read: 383.6 MiB/s, write: 3.9 MiB/s
INFO: 42% (430.2 GiB of 1.0 TiB) in 20m 4s, read: 376.4 MiB/s, write: 585.1 KiB/s
INFO: 43% (440.6 GiB of 1.0 TiB) in 20m 33s, read: 368.4 MiB/s, write: 988.7 KiB/s
INFO: 44% (450.8 GiB of 1.0 TiB) in 21m 3s, read: 348.4 MiB/s, write: 23.6 MiB/s
INFO: 45% (460.9 GiB of 1.0 TiB) in 21m 36s, read: 311.6 MiB/s, write: 868.8 KiB/s
INFO: 46% (471.5 GiB of 1.0 TiB) in 22m 5s, read: 376.1 MiB/s, write: 847.4 KiB/s
INFO: 47% (481.3 GiB of 1.0 TiB) in 22m 27s, read: 456.7 MiB/s, write: 1.6 MiB/s
INFO: 48% (491.5 GiB of 1.0 TiB) in 22m 50s, read: 453.7 MiB/s, write: 2.3 MiB/s
INFO: 49% (502.1 GiB of 1.0 TiB) in 23m 14s, read: 452.0 MiB/s, write: 2.2 MiB/s
INFO: 50% (512.0 GiB of 1.0 TiB) in 23m 36s, read: 460.7 MiB/s, write: 0 B/s
INFO: 51% (522.5 GiB of 1.0 TiB) in 24m 4s, read: 384.3 MiB/s, write: 585.1 KiB/s
INFO: 52% (532.6 GiB of 1.0 TiB) in 24m 30s, read: 396.9 MiB/s, write: 2.6 MiB/s
INFO: 53% (542.9 GiB of 1.0 TiB) in 25m 2s, read: 328.2 MiB/s, write: 640.0 KiB/s
INFO: 54% (553.3 GiB of 1.0 TiB) in 25m 30s, read: 381.6 MiB/s, write: 1.4 MiB/s
INFO: 55% (563.4 GiB of 1.0 TiB) in 25m 57s, read: 382.2 MiB/s, write: 1.6 MiB/s
INFO: 56% (573.8 GiB of 1.0 TiB) in 26m 25s, read: 381.0 MiB/s, write: 585.1 KiB/s
INFO: 57% (583.9 GiB of 1.0 TiB) in 26m 52s, read: 382.4 MiB/s, write: 303.4 KiB/s
INFO: 58% (594.1 GiB of 1.0 TiB) in 27m 20s, read: 374.0 MiB/s, write: 292.6 KiB/s
INFO: 59% (604.2 GiB of 1.0 TiB) in 27m 45s, read: 414.2 MiB/s, write: 163.8 KiB/s
INFO: 60% (614.9 GiB of 1.0 TiB) in 28m 39s, read: 203.0 MiB/s, write: 455.1 KiB/s
INFO: 61% (624.8 GiB of 1.0 TiB) in 29m 3s, read: 421.2 MiB/s, write: 1.2 MiB/s
INFO: 62% (635.0 GiB of 1.0 TiB) in 29m 35s, read: 326.2 MiB/s, write: 768.0 KiB/s
INFO: 63% (645.3 GiB of 1.0 TiB) in 30m 7s, read: 329.4 MiB/s, write: 1.5 MiB/s
INFO: 64% (655.7 GiB of 1.0 TiB) in 30m 40s, read: 322.4 MiB/s, write: 372.4 KiB/s
INFO: 65% (665.7 GiB of 1.0 TiB) in 31m 6s, read: 394.2 MiB/s, write: 472.6 KiB/s
INFO: 66% (676.0 GiB of 1.0 TiB) in 31m 33s, read: 389.8 MiB/s, write: 151.7 KiB/s
INFO: 67% (686.4 GiB of 1.0 TiB) in 32m, read: 395.4 MiB/s, write: 5.5 MiB/s
INFO: 68% (696.3 GiB of 1.0 TiB) in 32m 26s, read: 391.8 MiB/s, write: 787.7 KiB/s
INFO: 69% (706.8 GiB of 1.0 TiB) in 32m 58s, read: 336.2 MiB/s, write: 768.0 KiB/s
INFO: 70% (716.9 GiB of 1.0 TiB) in 33m 30s, read: 321.8 MiB/s, write: 896.0 KiB/s
INFO: 71% (727.0 GiB of 1.0 TiB) in 33m 57s, read: 384.7 MiB/s, write: 151.7 KiB/s
INFO: 72% (737.5 GiB of 1.0 TiB) in 34m 25s, read: 383.7 MiB/s, write: 438.9 KiB/s
INFO: 73% (747.7 GiB of 1.0 TiB) in 34m 52s, read: 387.3 MiB/s, write: 1.0 MiB/s
INFO: 74% (757.9 GiB of 1.0 TiB) in 35m 19s, read: 386.2 MiB/s, write: 455.1 KiB/s
INFO: 75% (768.2 GiB of 1.0 TiB) in 35m 52s, read: 319.4 MiB/s, write: 620.6 KiB/s
INFO: 76% (778.4 GiB of 1.0 TiB) in 36m 19s, read: 387.1 MiB/s, write: 1.2 MiB/s
INFO: 77% (788.8 GiB of 1.0 TiB) in 36m 47s, read: 377.6 MiB/s, write: 1.1 MiB/s
 
Last edited:
Sorry...I did not think about that. I just did a VM migration, so I guess that would be the RAM moving not storage, since storage is shared. But I think it still shows that the network links and nodes are certainly capable of the speeds. So it is almost certainly something to do with storage or storage access. Does that seem reasonable?
Yes, that means network is not likely to be the bottleneck.

So by default, it looks like since we only setup the one connection, all 3...purposes (corosync, storage, and migration) are going over the one link. So we could move JUST corosync to a 1Gb interface to ensure large amounts of data moving around are not going to cause an issue for corosync. Since we only have the 2 10Gb connections, and they are looking for faster access, that we may try out having migrations and storage share a path. Unless that is unreasonable?
You don't need much bandwidth for corosync, but low latency. And yes, corosync on a dedicated network is best.

I am watching a backup taking place on node 2, and it is currently the only backup running (this VM backs up offset of everything else). A good majority of the time the IO wait is <0.20%, but a few times it has climbed up to 4.20%, then hovering around 1.00-4.20%, before settling back down to <0.20%. I saw a peak as high as 6.23%. Though speeds are still good (or appear to be) during this time.

EDIT: Watching the main backup task on node 1 (where all but 2 machines are), IO delay gets as high as 18% from what I have seen. And those jumps do coincide with drops in backup performance. Could an issue be specific to a node?
Could be a indications that there are times when there's too many workers for the storage. Best to try a lower value for max-workers and see what difference it makes.

What read performance do you get from the storage in other scenarios, e.g. copying large files or some benchmark?

I will try moving to node 3 when these are done to see how the next few backups perform. I think doing this will make it recreate bitmaps (correct me if I'm wrong), so I'll look at a handful of backups after the first one to get an idea.
No, live migration should preserve the bitmaps ;) Shutdown will clear them or if the previous backup was aborted too.
 
No, live migration should preserve the bitmaps ;) Shutdown will clear them or if the previous backup was aborted too
I am probably guilty of that....as I stop them if I am testing so the performance does not suffer.

Could be a indications that there are times when there's too many workers for the storage. Best to try a lower value for max-workers and see what difference it makes
Max workers looks like it is in Datacenter -> Options, is that correct? That is currently 4. Is the thought that 4 is just too many for the storage? Or that 4 is too many for the PVE CPU? Should I step down and do 3....2...1? Or just try straight to 2?
 
I am probably guilty of that....as I stop them if I am testing so the performance does not suffer.


Max workers looks like it is in Datacenter -> Options, is that correct? That is currently 4. Is the thought that 4 is just too many for the storage? Or that 4 is too many for the PVE CPU? Should I step down and do 3....2...1? Or just try straight to 2?
No, that's a different max-workers setting, for bulk actions. For QEMU backups, the default is 16, I would try a rather low value first to see what difference it makes, but it's highly dependent on the setup. So I can't give a blanket statement of what value is best.

See the link in my first reply for where to set the backup-specific option:
Well, how does the load look like during backup? If you have much IO wait in on the PVE side, I recommend trying to lower the max-workers setting, see here: https://forum.proxmox.com/threads/t...ad-behavior-during-backup.118430/#post-513106
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!