High(ish) IO Delays during PBS (VM in PVE) running verify jobs and backup jobs

May 4, 2021
7
3
8
34
Hello everyone. I am bit lost at this stage but maybe I am being too concerned about it as well. Your opinion is appreciated.

Server setup:

• DL380 Gen9 (141Gb Ram - 2x Xeon E5 E5-2620 v3) running Truenas Scale with ZFS Pool. (4x 10Tb disks in raidz1 setup, I have slog NVME 1tb mirrored pool)
• DL360 Gen9 (128Gb Ram - 2x Xeon E5-2620 v3) running PVE. PVE has 1x Samsung MZPLJ6T4HALA-00007 at 6.4Tb as VM Storage in single disk ZFS setup.
• Dell Server running truenas scale which is replicated from the DL380 to assure further data redundancy.

PVE (7.4.3) hosts:

Proxmox Backup Server
(2.4.1) with System drive located on that 6.4Tb Samsung disk and my datastore disks are sitting on Truenas Scale (Proxmox is managing NFS Share in Datastore section and disks are then created inside Proxmox and passed to the VM so the VM is not aware that its over NFS) totally 10 Datastore VM Disks.

See attached 110.txt for further details.
See attached storage.txt for further details.

Connection between Proxmox VE and Truenas Scale is direct (No switch) using via 2x40Gb QSFP ports in balanced bond of 80Gb total speed connected using Direct Attach Copper cable. All network interfaces has set 9000 MTU including inside PBS VM.

Issue:

After I've replaced old DL380 with the current DL360 server (Proxmox fresh installation and VMs were restored from backups). Proxmox Backup Server runs verify jobs I get around 30% on average IO delay on PVE, when backups are happening from WAN traffic to PBS VM (Download speed 1 Gigabit - Fibre Connection) I get around 6% IO delay on PVE. I dont remember seeing these delays before they would be in 0.0% numbers. Also verify jobs are taking hours where before they were finished in minutes.

I have suspicious feeling that NFS is here to blame but I can't find anything wrong see studies sections.

Concern:

After I've replaced old DL380 with the current DL360 I am concerned if this is the issue. I thought that difference between DL380 and DL360 is really just between chassis type and layouts.

Studies:

Iperf3 speed results between Proxmox and Truenas Scale are one average at 20 gigabit/s on single connection. When i run 4 connection tests in one go i get around 17.5 gigabit per second per connection meaning I am saturating that 80Gb connection well enough (I think HAHA).

Proxmox Backup Client Benchmark on local LAN results:

Time per request: 9201 microseconds.
TLS speed: 455.83 MB/s
SHA256 speed: 344.63 MB/s
Compression speed: 341.65 MB/s
Decompress speed: 515.24 MB/s
AES256/GCM speed: 1037.65 MB/s
Verify speed: 183.61 MB/s

Proxmox Backup Client Benchmark over WAN results:

Time per request: 230241 microseconds.
TLS speed: 18.22 MB/s
SHA256 speed: 369.91 MB/s
Compression speed: 403.25 MB/s
Decompress speed: 604.02 MB/s
AES256/GCM speed: 1183.09 MB/s
Verify speed: 243.91 MB/s

I've read here on forum that PBS writes data in small chunks to I thought that if run ISCSI disk from Truenas to PVE it would improve IO since it's know that ISCSI is better for small read/write operations but no JOY its pretty much the same.

I've tried amending NFS share inside PVE (see options in storage.txt file) to see if it would improve NFS performance didn't help either.

Notes:
Previous DL380 replaced by DL360 has same configuration all components were moved to the new chassis the only diffrence is that proxmox was before installed on single NVME where on now it sits on raid card 2 Sas drive pool in mirror. (From internet research i was assured by myself that proxmox dont care what it is installed on and it doesnt affect VM performance at all if the VMs disks are not stored on the raid card which are not.

Conclusion:

Either I do indeed have some issue somewhere or I am wrong and these IO delays are fine for my setup.

I would appreciate any guidance to any topic on the internet or any user input or even better actual solution to my problem hehe.
 

Attachments

  • storage.txt
    4.1 KB · Views: 5
  • 110.txt
    1.5 KB · Views: 1
I've managed to fix the issue myself after days of trying different stuff. The issue was simply Truenas Scale, I can't tell why though.

I have much faster speeds reaching almost 80Gb/s between Truenas CORE and PVE only reaching 19Gb/s between Truenas SCALE and PVE.
I thought that maybe something was wrong with my SCALE installation so i've run fresh install of SCALE but no joy at all. When I've installed Truenas CORE and simply set the same setup like on SCALE out of the box with no advanced tuning the TRUENAS CORE was just simply way faster and fully saturating 80Gb infiniband connection. My IO delays are now on average 0.08 to 0.125%.

Hopefully someone will find this helpful. As of right now I am sticking with Truenas CORE as I don't need SCALE features.
 
  • Like
Reactions: eugenevdm
Thank you! After months of struggling with Proxmox Backup Server I find this post which is so helpful to me.

What started of as a single installation of TrueNAS Scale about year ago has grown into four different NAS installations spread across a cluster.

My original environment was approximately:
  • TrueNAS Scale ZFS over iSCSI using the TheGrandWazoo API integration.
  • RAID Z1 3 x Magnetic 14 TB drives
  • Four small underpowered hypervisors sharing the NAS and small workloads
  • Mostly 1 Gbit/s
I then did a another TrueNAS Scale installation, attached to two new hosts and added some heavy workloads.

Note: Against of the advice of Proxmox I use 2 x virtualized Proxmox Backup Servers. I do this because I have to deploy limited resources remotely.

At this point I had around 20 VMs all working nicely and backing up most of the time. Almost 100% success rate.

I then migrated another data centre and then things exploded. Now the environment is approximately:
  • Four TrueNAS Scale installations with ZFS over iSCSI running magnetic and SSD. All RAIDZ1
  • About 50 VMs
  • Mostly 10 Gbit/s
Between having 20 VMs and around 40 VMs being backed up daily I started experiencing problems.

Since there are two PBS installation it is easy to compare, but both running TrueNAS Scale.

I chose Scale because of Debian familiarity but after reading your post I can see how actually this might have been a big mistake. I did some background and I also don't really need Scale, and so far it seems that Core is much more mature.

Anyway, the one specific problem I encounter with backups is similar to yours:

- Verification on the one NAS is super fast and on the other is super slow.

When backups don't work, it starts to back up and then after around 3% or 5% or 5% or soon it just stops working. Some VMs always back up and some just never do. I see this often just before the abort, many repeating lines:

Code:
Jul 12 04:06:12 example.hv.com QEMU[3644222]: kvm: iSCSI GET_LBA_STATUS failed at lba 139804672: SENSE KEY:ILLEGAL_REQUEST(5) ASCQ:INVALID_FIELD_IN_CDB(0x2400)

Here is the interesting part:

The cluster and the NASses attached to the VMs perform exceptional, all of the time! Only when I do backups do things fall apart...

I tried to eliminate numerous things in my network and haven't slept much:
  • More bandwidth between nodes
  • Becoming an expert at iotop and reading IO.
  • Converting block sizes from 4K to 8K to 16K on the ZFS.
  • Going down the rabbit hole of Virtio vs Virtio Single, IO Thread, Async IO io_uring vs threads. I stopped just workers and before going mad.
Now I think I have a fairly sizeable problem. I have to move 50 production VMs on Scale back to Core on 4 x NAS.

I did a bit of research and maybe I can do it with ZFS import, if I can migrate away or do downtime.

So I really think you're onto something, thank you!

It might well be that the folk over at iXsystems has one super crucial difference between their FreeBSD vs Debian versions, something that severely affect Proxmox Backup Server with ZFS over iSCSI...

I'll update this post once I have made some progress again...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!