High iowait when inside VM

Alexey Pavlyuts

Active Member
Jun 16, 2018
28
3
43
52
Hi All,

I am running Backup Server inside of PVE VM. Yes, I know it is not the best practices, but I can't afford dedicate hardware for it. Backup data are placed to ZFS raid5 of 4 disks, the disks are bypassed to VM as whole units.

What I see is extrmally high iowait inside of VM at backup times and almost no iowait increase on the host.

iowait.png
Green line is VM, red one is the host, spikes in the host related to VM replication inside PVE cluster.

The Backup server VM config:

1744961294848.png
All this sounds quite strange for me. I did not found a way to allocate more IO to VM, especially for bypassed block devices.

Unfortunately, I can't bypass it as PCI because of HP Raid controller in direct mode used.

I am very appreciating for any suggestions on the case!
 
Hi,
Backup data are placed to ZFS raid5 of 4 disks, the disks are bypassed to VM as whole units.
what disks are you using? Also note, RAID5 will perform badly with the IO load produced by PBS. Using RAID10 with an additional mirror as special device for fast metadata access will give you typically better performance.

What I see is extrmally high iowait inside of VM at backup times and almost no iowait increase on the host.
During backup, the PBS needs to write the many small chunk files to disk, so this will produce IO. I assume that this is not at all related to the fact that you run the PBS in a VM but rather the performace limit of your storage.

I would suggest to do some baseline performance testing with tools such as fio.
 
  • Like
Reactions: news and UdoB
what disks are you using?
Toshiba Enterprise-level 10K SAS HD, 4pcs of 1.2Tb each.

Also note, RAID5 will perform badly with the IO load produced by PBS. Using RAID10 with an additional mirror as special device for fast metadata access will give you typically better performance.
Frankly, I do not care too much about performance as it is pretty enough to complete all backup tasks for my small 4-node cluster. I am more concerned by io overload alarms I geting in my Zabbix. I did not experiment with bandwidth limit yet, but probably this could get the system calm.

Thank you for the hint on raid10, I will think about it. Current array is 3.6TB, but BPS appears much more space-saving that I expected, all my backups are just 700gb at the moment, so probably I can afford RAID10 2+2 with size of 2,4TB. The disks are cheap, the problem is my servers has limited bay count.

During backup, the PBS needs to write the many small chunk files to disk, so this will produce IO. I assume that this is not at all related to the fact that you run the PBS in a VM but rather the performace limit of your storage.

I would suggest to do some baseline performance testing with tools such as fio.

I understand that PBS and ZFS are not the best friends. I was trying to research and found some mentions that PBS creates too much io load when using ZFS for backup strore and it has some deep and unclean reasons.

Also, may you recommend the best ZFS configuration, means ioshift and use of ZFS compression? Should ZFS ioshift match PBS block size and what the block size to match to? Do ZFS compression any meaning when BPS compress all the data itself?
 
Also, It seems like it was a bad idea to setup storag for immediate vrification. As I understand, verufy means to read all the snapshot content not yet verified, so it creates great io load by itself. And especially bad for concurent writes and reads from different disk areas.
I set it off and schedule separate veify job. so let's see how performance changes.
 
Also, It seems like it was a bad idea to setup storag for immediate vrification. As I understand, verufy means to read all the snapshot content not yet verified, so it creates great io load by itself. And especially bad for concurent writes and reads from different disk areas.
It is turned out that immediate verification was causing io overload. When I set it off, iowait dropped from 38% peak to 13% peak, it seems affordable now.

I think the cause was the jobs were set for all the servers to start at once, then the big server jobs run beckup jobs one by one and every VM/CT backup immediately trigger verification job. Verification takes even longer then backup, so too many concurrent jobs starts, blocking each other and rising iowait to high levels.
 
  • Like
Reactions: UdoB