High OI on disks during backup

rontex · Oct 12, 2023

During backup, the backup process itself takes up all the disk resources of the virtual machine. The performance: max-workers=1 parameter in /etc/vzdump.conf has no effect on the situation. When a backup is running, processes that need access to the VM disk begin to go into standby mode. iostat shows that inside the virtual machine from normal values of %iowait 0.1% increases to 50-70%

We also tried changing the values of ionice, bwlimit. This also doesn't change the situation.

File system - LVM-thin volume is used, which is created on Raid 10 mdadm. The raid itself is assembled from 8 NVME disks.

During backup on the physical machine itself, everything is fine with IO. The problem is observed only inside the virtual machine.

I attach the virtual machine disk settings in the picture.

We use the latest version of pve-qemu-kvm
ii pve-qemu-kvm 8.0.2-6

Has anyone encountered and solved a similar problem? Thank you.

LnxBil · Oct 12, 2023

Seems like your NVMe are the bottleneck. What product is that?

rontex · Oct 12, 2023

Model Number: VK007680KWWFQ
It is very unlikely that the problem is with the disks themselves. Since physically on the server during backup the file system is not loaded and other Virtual Machines located on the same server work normally. The problem is only with the Virtual Machine that is currently being backed up

rontex · Oct 12, 2023

I assembled a test server. In this case, the disk system is hardware raid 10 of 4 sas SSD.I installed Proxmox and created a virtual machine on Ubuntu 20 completely by analogy as described above. Only instead of LVM I used ZFS.

I run the test using fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test3 --filename=test --bs=4k --iodepth=16 --size=20G --readwrite= randrwI do this both on the physical machine itself on the assembled storage, and inside the virtual machine, which is located in the same storage.

Backup is created on Proxmox Backup server storage.

Results before running the backup:
In a virtual machine
Jobs: 1 (f=1): [m(1)][100.0%][r=102MiB/s,w=101MiB/s][r=26.1k,w=25.0k IOPS][eta 00m:00s]
On the server itself
Jobs: 1 (f=1): [m(1)][20.0%][r=4820KiB/s,w=4784KiB/s][r=1205,w=1196 IOPS][eta 08m:24s]

During backup:
In a virtual machine:Jobs: 1 (f=1): [m(1)][89.5%][r=100KiB/s,w=76KiB/s][r=25,w=19 IOPS][eta 00m:15s
On the server itself:
Jobs: 1 (f=1): [m(1)][25.4%][r=10.8MiB/s,w=11.0MiB/s][r=2759,w=2818 IOPS][eta 10m:32s]
After stopping the backup, IOPS is instantly restored as before the backup was started. It can be seen that after starting the backup, the IOPS of the virtual server dropped to unsatisfactory values. And on the server it even grew.
Options performance: max-workers=1 and ionice: 8 in /etc/vzdump.conf do not affect the situation in any way.
Is there any way to reduce the impact of backup on the performance of the file system inside the virtual machine?

Chris · Oct 12, 2023

Hi,
i assume you are using snapshot type backups and are running into a situation where the backup target is the bottleneck. To keep the backups consistent in snapshot type backups, the existing blocks written to by the VM have to be send to the backup target before persisting the new data to disk. This can cause issues if the backup target is slow/has high latency. See this issue for details https://bugzilla.proxmox.com/show_bug.cgi?id=3231#c4.

What you can do to improve performance is to write the backup to a local storage first and then move it to an off-host target or setup a faster backup target.

rontex said:
File system - LVM-thin volume is used, which is created on Raid 10 mdadm

On a side note: We only support ZFS as software RAID solution. So best to use hardware raid or ZFS.

rontex · Oct 12, 2023

Chris said:
See this issue for details https://bugzilla.proxmox.com/show_bug.cgi?id=3231#c4.

That's right, I'm using snapshot as my backup type. The freeze-fs-on-backup=0 option is set on the virtual machine.
The target server where the backup is sent should not be a bottleneck.
A 10G link is forwarded there. The file system itself on the backup server is not loaded even with 9 parallel backups from different servers.

Thanks for the answer, but it’s not clear why the parameters set in /etc/vzdump.conf have no effect on the situation?
Why is the result the same with performance: max-workers=1 and with performance: max-workers=16?

Chris · Oct 13, 2023

rontex said:
A 10G link is forwarded there. The file system itself on the backup server is not loaded even with 9 parallel backups from different servers.

Can you check iostats and cpu usage on the Proxmox Backup Server? What hardware are you using on the PBS side?

rontex said:
Thanks for the answer, but it’s not clear why the parameters set in /etc/vzdump.conf have no effect on the situation?
Why is the result the same with performance: max-workers=1 and with performance: max-workers=16?

The max-workers control how many tasks qemu spawns for block copy operations. This might have an influence if your CPU/storage is overwhelmed by these operations, which you showed to not be the case.

I suggest you try to setup a local PBS instance in a VM and run a backup to that backup target, to exclude that the network and/or the backup target storage might be the root cause of your performance issues.

You can also run a benchmark using proxmox-backup-client benchmark --repository <your-target-repo> and share your findings.

rontex · Oct 16, 2023

Chris said:
Can you check iostats and cpu usage on the Proxmox Backup Server? What hardware are you using on the PBS side?

Code:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.56    2.72    0.00   95.72

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util

sda              5.50     48.00     0.00   0.00    9.36     8.73    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.05   7.60
sdaa             8.00     62.00     0.00   0.00    9.00     7.75    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.07  10.20
sdab             7.50     56.00     0.00   0.00    6.13     7.47    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.05   7.00
sdac             5.00     44.00     0.00   0.00    8.10     8.80    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.04   5.20
sdad             7.00     54.00     0.00   0.00    6.29     7.71    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.04   7.60
sdae             5.00     44.00     0.00   0.00   12.90     8.80    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.06   8.40
....

Otherwise, the situation is similar on other disks. This is 1 backup in progress + the garbage collector has turned on

Chris said:
You can also run a benchmark using proxmox-backup-client benchmark --repository <your-target-repo> and share your findings.

Code:

┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 141.43 MB/s (11%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 1638.34 MB/s (81%) │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 404.69 MB/s (54%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 529.89 MB/s (44%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 398.18 MB/s (53%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1145.54 MB/s (31%) │
└───────────────────────────────────┴────────────────────┘

Perhaps the problem is low latency due to the large removal of the backup server. Ping average 15-16 ms. I’ll try to create a similar test backup server in the same location and conduct tests to eliminate network latency as a problem. I'll write back after the tests.

LnxBil · Oct 16, 2023

Please post in CODE-tags, the output is unreadable.

rontex · Nov 10, 2023

A similar backup server was assembled but in the same location as the main server. The problem has become much less noticeable and more tolerable.
Apparently the cause of the delays was high network latency. A ping of 20ms turned out to be critical. Thanks to all who responded. The issue can be considered resolved.

Search

Search

High OI on disks during backup

rontex

New Member

Attachments

LnxBil

Distinguished Member

rontex

New Member

rontex

New Member

Chris

Proxmox Staff Member

rontex

New Member

Chris

Proxmox Staff Member

rontex

New Member

LnxBil

Distinguished Member

rontex

New Member

We value your privacy