High OI on disks during backup

rontex

New Member
Oct 12, 2023
8
1
3
Ukraine
During backup, the backup process itself takes up all the disk resources of the virtual machine. The performance: max-workers=1 parameter in /etc/vzdump.conf has no effect on the situation. When a backup is running, processes that need access to the VM disk begin to go into standby mode. iostat shows that inside the virtual machine from normal values of %iowait 0.1% increases to 50-70%

We also tried changing the values of ionice, bwlimit. This also doesn't change the situation.

File system - LVM-thin volume is used, which is created on Raid 10 mdadm. The raid itself is assembled from 8 NVME disks.

During backup on the physical machine itself, everything is fine with IO. The problem is observed only inside the virtual machine.

I attach the virtual machine disk settings in the picture.

We use the latest version of pve-qemu-kvm
ii pve-qemu-kvm 8.0.2-6

Has anyone encountered and solved a similar problem? Thank you.
 

Attachments

  • DiskVM.png
    DiskVM.png
    26.7 KB · Views: 17
Last edited:
Model Number: VK007680KWWFQ
It is very unlikely that the problem is with the disks themselves. Since physically on the server during backup the file system is not loaded and other Virtual Machines located on the same server work normally. The problem is only with the Virtual Machine that is currently being backed up
 
I assembled a test server. In this case, the disk system is hardware raid 10 of 4 sas SSD.I installed Proxmox and created a virtual machine on Ubuntu 20 completely by analogy as described above. Only instead of LVM I used ZFS.

I run the test using fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test3 --filename=test --bs=4k --iodepth=16 --size=20G --readwrite= randrwI do this both on the physical machine itself on the assembled storage, and inside the virtual machine, which is located in the same storage.

Backup is created on Proxmox Backup server storage.

Results before running the backup:
In a virtual machine
Jobs: 1 (f=1): [m(1)][100.0%][r=102MiB/s,w=101MiB/s][r=26.1k,w=25.0k IOPS][eta 00m:00s]
On the server itself
Jobs: 1 (f=1): [m(1)][20.0%][r=4820KiB/s,w=4784KiB/s][r=1205,w=1196 IOPS][eta 08m:24s]

During backup:
In a virtual machine:Jobs: 1 (f=1): [m(1)][89.5%][r=100KiB/s,w=76KiB/s][r=25,w=19 IOPS][eta 00m:15s
On the server itself:
Jobs: 1 (f=1): [m(1)][25.4%][r=10.8MiB/s,w=11.0MiB/s][r=2759,w=2818 IOPS][eta 10m:32s]
After stopping the backup, IOPS is instantly restored as before the backup was started. It can be seen that after starting the backup, the IOPS of the virtual server dropped to unsatisfactory values. And on the server it even grew.
Options performance: max-workers=1 and ionice: 8 in /etc/vzdump.conf do not affect the situation in any way.
Is there any way to reduce the impact of backup on the performance of the file system inside the virtual machine?
 
Last edited:
Hi,
i assume you are using snapshot type backups and are running into a situation where the backup target is the bottleneck. To keep the backups consistent in snapshot type backups, the existing blocks written to by the VM have to be send to the backup target before persisting the new data to disk. This can cause issues if the backup target is slow/has high latency. See this issue for details https://bugzilla.proxmox.com/show_bug.cgi?id=3231#c4.

What you can do to improve performance is to write the backup to a local storage first and then move it to an off-host target or setup a faster backup target.

File system - LVM-thin volume is used, which is created on Raid 10 mdadm
On a side note: We only support ZFS as software RAID solution. So best to use hardware raid or ZFS.
 
That's right, I'm using snapshot as my backup type. The freeze-fs-on-backup=0 option is set on the virtual machine.
The target server where the backup is sent should not be a bottleneck.
A 10G link is forwarded there. The file system itself on the backup server is not loaded even with 9 parallel backups from different servers.

Thanks for the answer, but it’s not clear why the parameters set in /etc/vzdump.conf have no effect on the situation?
Why is the result the same with performance: max-workers=1 and with performance: max-workers=16?
 
A 10G link is forwarded there. The file system itself on the backup server is not loaded even with 9 parallel backups from different servers.
Can you check iostats and cpu usage on the Proxmox Backup Server? What hardware are you using on the PBS side?

Thanks for the answer, but it’s not clear why the parameters set in /etc/vzdump.conf have no effect on the situation?
Why is the result the same with performance: max-workers=1 and with performance: max-workers=16?
The max-workers control how many tasks qemu spawns for block copy operations. This might have an influence if your CPU/storage is overwhelmed by these operations, which you showed to not be the case.

I suggest you try to setup a local PBS instance in a VM and run a backup to that backup target, to exclude that the network and/or the backup target storage might be the root cause of your performance issues.

You can also run a benchmark using proxmox-backup-client benchmark --repository <your-target-repo> and share your findings.
 
Can you check iostats and cpu usage on the Proxmox Backup Server? What hardware are you using on the PBS side?

Code:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.56    2.72    0.00   95.72

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util

sda              5.50     48.00     0.00   0.00    9.36     8.73    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.05   7.60
sdaa             8.00     62.00     0.00   0.00    9.00     7.75    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.07  10.20
sdab             7.50     56.00     0.00   0.00    6.13     7.47    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.05   7.00
sdac             5.00     44.00     0.00   0.00    8.10     8.80    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.04   5.20
sdad             7.00     54.00     0.00   0.00    6.29     7.71    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.04   7.60
sdae             5.00     44.00     0.00   0.00   12.90     8.80    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.06   8.40
....

Otherwise, the situation is similar on other disks. This is 1 backup in progress + the garbage collector has turned on
You can also run a benchmark using proxmox-backup-client benchmark --repository <your-target-repo> and share your findings.
Code:
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 141.43 MB/s (11%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 1638.34 MB/s (81%) │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 404.69 MB/s (54%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 529.89 MB/s (44%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 398.18 MB/s (53%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1145.54 MB/s (31%) │
└───────────────────────────────────┴────────────────────┘

Perhaps the problem is low latency due to the large removal of the backup server. Ping average 15-16 ms. I’ll try to create a similar test backup server in the same location and conduct tests to eliminate network latency as a problem. I'll write back after the tests.
 
Last edited:
A similar backup server was assembled but in the same location as the main server. The problem has become much less noticeable and more tolerable.
Apparently the cause of the delays was high network latency. A ping of 20ms turned out to be critical. Thanks to all who responded. The issue can be considered resolved.
 
  • Like
Reactions: Chris

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!