(Too?) Big IO Delay peaks

SargOn · Apr 23, 2020

PVE version 5.2-2
Kernel Version
- Linux 4.15.17-1-pve #1 SMP PVE 4.15.17-9 (Wed, 9 May 2018 13:31:43 +0200)
CPU
- 24 x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (2 Sockets)
Storage:
- 8 x SSD SATA3 INTEL S3520 800GB 6Gb/s 3D MLC in a ZFS pool (RAID 5 HW, ...ye we didn't know ZFS in the beginning...)
- NO ZIL (same issue with 32 GB ZIL)
- NO L2ARC
- RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
RAM:
- 256GB

There is no need to comment what complains I get from my users (about 20 Windows VMs) when this happens:

I understand that I can expect some punctual latencies or general speed down if intensive IO processes are working (inside a VM, or migrating VMs between nodes, for instance)... but this behaviour totally stucks all VMs in the node for big whiles... Maybe something regarding unproper IO queue management... ?

Any advise (besides recreating a pure SW ZFS RAID...) ?

Thx in advance.

sg90 · Apr 23, 2020

Have you checked per a VM graph during the same peak, to see if you can locate a certain VM that was completing a large amount of IO at the time? And then use this information to investigate anything running on that VM?

SargOn · Apr 23, 2020

Yes, I know which VM is the problem (not always the same: If large data is moved between disks, intense DB working, etc...) but I guess that only one VM shouldn't hang all the node with the rest of VMs... Right? Thx

sg90 · Apr 23, 2020

SargOn said:
Yes, I know which VM is the problem (not always the same: If large data is moved between disks, intense DB working, etc...) but I guess that only one VM shouldn't hang all the node with the rest of VMs... Right? Thx

The VM isn't hanging the other VM's, it's saturating the disks meaning the other VMs then become slow due to slow / delayed IO requests.

In Proxmox you can edit the VM in question and apply disk I/O limits that may help to make sure that particular VM never fully saturates 100% of the available capacity of your storage.

SargOn · Apr 23, 2020

Ok, so do you think that the graph and that IO delay with those specs is normal if you don't apply per VM IO policies...? Thx

sg90 · Apr 23, 2020

SargOn said:
Ok, so do you think that the graph and that IO delay with those specs is normal if you don't apply per VM IO policies...? Thx

The graph is showing you have huge I/O wait, if you have a process running in a VM which requires more IO than your underlying storage can provide, then yes without any limits it will use as much IO as possible and cause a backlog of requests.

SargOn · Apr 23, 2020

Ok, so I understand that also is normal to get big locks if migrating a vm between nodes (not sharing vmdata). Thx

sg90 · Apr 23, 2020

SargOn said:
Ok, so I understand that also is normal to get big locks if migrating a vm between nodes (not sharing vmdata). Thx

Well migrating a VM is a large amount of data, end of the day think of it like a hosepipe and a bucket of water, the hose can only transfer so much water at once. You can limit the speed you pour the bucket so the water can travel through the pipe and not overflow, or you can pour the whole bucket at once and the water will back up and flood over.

Your running your underlying storage in Raid5, which is never going to give you amazing raw IO capacity, so what your seeing is just these disks not be able to keep up with the workload requested.

SargOn · Apr 23, 2020

Sorry, most probably is that I'm missing something... but at this point I think that there is anything else... Looking at this:

This is the result of running crystal mark inside a VM in the node 1 (the case of study of this thread)

Getting:

Code:

-----------------------------------------------------------------------
CrystalDiskMark 5.5.0 x64 (C) 2007-2017 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 1) :  5402.759 MB/s
  Sequential Write (Q= 32,T= 1) :  5173.051 MB/s
  Random Read 4KiB (Q= 32,T= 1) :   432.205 MB/s [105518.8 IOPS]
Random Write 4KiB (Q= 32,T= 1) :   370.257 MB/s [ 90394.8 IOPS]
         Sequential Read (T= 1) :  3279.101 MB/s
        Sequential Write (T= 1) :  2545.794 MB/s
   Random Read 4KiB (Q= 1,T= 1) :    55.562 MB/s [ 13564.9 IOPS]
  Random Write 4KiB (Q= 1,T= 1) :    41.203 MB/s [ 10059.3 IOPS]

  Test : 1024 MiB [C: 40.3% (16.0/39.7 GiB)] (x5)  [Interval=0 sec]
  Date : 2020/04/23 19:11:16
    OS : Windows Server 2012 R2 Server Standard (full installation) [6.3 Build 9600] (x64)

This is the result of running same test in a VM in the node 2 (older system, also RAID5 HW with 6 similar SSD intel disks insted 8):

getting:

Code:

-----------------------------------------------------------------------
CrystalDiskMark 5.5.0 x64 (C) 2007-2017 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 1) :  3806.527 MB/s
  Sequential Write (Q= 32,T= 1) :  1895.046 MB/s
  Random Read 4KiB (Q= 32,T= 1) :   188.317 MB/s [ 45975.8 IOPS]
Random Write 4KiB (Q= 32,T= 1) :   142.426 MB/s [ 34772.0 IOPS]
         Sequential Read (T= 1) :  2120.249 MB/s
        Sequential Write (T= 1) :  1506.049 MB/s
   Random Read 4KiB (Q= 1,T= 1) :    16.572 MB/s [  4045.9 IOPS]
  Random Write 4KiB (Q= 1,T= 1) :    15.794 MB/s [  3856.0 IOPS]

  Test : 1024 MiB [C: 56.6% (33.7/59.7 GiB)] (x5)  [Interval=0 sec]
  Date : 2020/04/23 19:12:24
    OS : Windows Server 2016 Server Standard (full installation) [10.0 Build 14393] (x64)

Average CPU use in 1 is higher, and there are more VMs running (idle) but I still think that something is wrong with 1, comparing relative values with node 2... I've made also test with a 3th node (older than 2th, 2 x SSDs), with similar results as 2 (relative values):

getting for 3:

Code:

-----------------------------------------------------------------------
CrystalDiskMark 5.5.0 x64 (C) 2007-2017 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 1) :  2171.304 MB/s
  Sequential Write (Q= 32,T= 1) :  1434.091 MB/s
  Random Read 4KiB (Q= 32,T= 1) :    97.524 MB/s [ 23809.6 IOPS]
Random Write 4KiB (Q= 32,T= 1) :   101.282 MB/s [ 24727.1 IOPS]
         Sequential Read (T= 1) :  1229.369 MB/s
        Sequential Write (T= 1) :  1000.323 MB/s
   Random Read 4KiB (Q= 1,T= 1) :    31.355 MB/s [  7655.0 IOPS]
  Random Write 4KiB (Q= 1,T= 1) :    29.973 MB/s [  7317.6 IOPS]

  Test : 1024 MiB [C: 78.6% (38.9/49.5 GiB)] (x5)  [Interval=0 sec]
  Date : 2020/04/23 18:26:28
    OS : Windows Server 2016 Server Standard (full installation) [10.0 Build 17763] (x64)

BTW... after some tests in VM in 3th node, it seems that performance increase activate "IO thread" option for the Virtio drive (1th and 2th VMs using Virtio SCSI).

Thx in advance.

SargOn · Apr 24, 2020

After revising many many zfs parameters I've decided to set sync=disabled in the pool (we have UPS and controller with battery) and the problem is gone...

Thx.

budy · Apr 24, 2020

Wow… sync=disabled is a real threat on a VM server. If an application issues a sync write, it is usually doing so on purpose. Sync writes ensure that the drive has landed safely on the storage, before signalling success to the request. First examples that come to mind are database applications, but also metadata for some file systems.

ZFS may keep data up to 5s in the caches, before flushing out to disk, which is considerable dangerous. If you have a need for a lot of sync writes, I'd consider to add a ZIL to your zpool.

SargOn · Apr 25, 2020

I'm studying enabling sync only for critical VMs... The rest ones are implemented for remote development so I'm happy if in worst case at least FS doesn't get corrupted, which should be guaranteed with ZFS.

Keeping sync disabled for root zpool also enables, if I'm not wrong, the extra performance needed to avoid IO delay when migrating or back/restoring VMs.

Anyway, we also count with an UPS system which provides around 1h before an managed shutdown if needed.

budy · Apr 26, 2020

SargOn said:
I'm studying enabling sync only for critical VMs... The rest ones are implemented for remote development so I'm happy if in worst case at least FS doesn't get corrupted, which should be guaranteed with ZFS.

Keeping sync disabled for root zpool also enables, if I'm not wrong, the extra performance needed to avoid IO delay when migrating or back/restoring VMs.

Backup/Restore/Migration will not trigger sync-writes, but only async ones! If you experience issues with troughput speeds on any of them it's not due to sync-writes, but because something else is straining your resources.

SargOn said:
Anyway, we also count with an UPS system which provides around 1h before an managed shutdown if needed.

I'd be more weary of sudden host crashes, which could then totally ruins your guests disks. A reliable UPS is one of the requirements for daring to switch off sync-writes on a zpool, but there's more to that than reliable power.

SargOn said:
Any advise (besides recreating a pure SW ZFS RAID...) ?

You really shoud re-create your zpool properly, rather than setting really dangerous options for the zpool.

SargOn · Apr 26, 2020

budy said:
Backup/Restore/Migration will not trigger sync-writes, but only async ones! If you experience issues with troughput speeds on any of them it's not due to sync-writes, but because something else is straining your resources.

It's not a throughput issue... it's a problem with IO delay getting totally stuck the rest of VMs... The same issue arrises when migrating any vm between nodes... so I don't understand why this occurs if writes are not synced.

budy said:
I'd be more weary of sudden host crashes, which could then totally ruins your guests disks. A reliable UPS is one of the requirements for daring to switch off sync-writes on a zpool, but there's more to that than reliable power.

Can you point any other scenario for this, please?

budy said:
You really shoud re-create your zpool properly, rather than setting really dangerous options for the zpool.

It's my main suspect... I think that disabling HW RAID and letting ZFS manage directly the disk could fix this issue... but can I be sure about this...?

Thanks for your comments.

budy · Apr 26, 2020

SargOn said:
It's not a throughput issue... it's a problem with IO delay getting totally stuck the rest of VMs... The same issue arrises when migrating any vm between nodes... so I don't understand why this occurs if writes are not synced.

If migration sucks, then you will have to take a look at the network as well. On the source host, reading from the storage shouldn't cause high i/o latency.

SargOn said:
Can you point any other scenario for this, please?

Pardon? What other scenario do you need? I just had a sudden pve host reboot on thursday… these things can - and will, happen. Better to be safe in that regard.

SargOn said:
It's my main suspect... I think that disabling HW RAID and letting ZFS manage directly the disk could fix this issue... but can I be sure about this...?

I cannot say… I don't have all the specs of your setup and we are just discussing single aspects of this, but I'd never trust my ZFS pools to *any* HW raid controller… never…!

SargOn · Apr 26, 2020

budy said:
If migration sucks, then you will have to take a look at the network as well. On the source host, reading from the storage shouldn't cause high i/o latency.

On the source...

EDIT: After some test, indeed, a simple VM migration doesn't seem to impact severely in IO delay on destination. Almost the same with sync=standard or sync=disabled.

budy said:
Pardon? What other scenario do you need? I just had a sudden pve host reboot on thursday… these things can - and will, happen. Better to be safe in that regard.

Sudden reboot...? Man, what kind of HW do you have... ?

...just joking, sorry

budy said:
I cannot say… I don't have all the specs of your setup and we are just discussing single aspects of this, but I'd never trust my ZFS pools to *any* HW raid controller… never…!

Yes, this was a newbie error... In the beginning we didn't have a ZFS pool... we had EXT4...

Anyway, I understand your point of view, but I think that we don't have to see everything black or white... As you mention... it depends on the real and particular scenario. The important thing is not getting ever a corrupted FS, and for particular VMs (SQL) we can enable sync=standard (maybe not all if it's a dev or lab vm).

https://forum.proxmox.com/threads/zfs-sync-disabled.37900/

Thx.

budy · Apr 26, 2020

SargOn said:
Sudden reboot...? Man, what kind of HW do you have... ?

...just joking, sorry

Yeah… very funny…

Actually, the reboot had been triggered by a rampant-going LXC which lead to repeated oom-killers. The point is… you can never be sure…

SargOn · Apr 26, 2020

budy said:
Yeah… very funny… Actually, the reboot had been triggered by a rampant-going LXC which lead to repeated oom-killers. The point is… you can never be sure…

Absolutely agree.

I will try to make more tests... but for now I will keep sync=standard, and only if falling in intense issues while working I will think about temporalily disabling it. And if I find time (and resources) I will remake the RAID properly for ZFS.

Thanks for your comments and advises.

budy · Apr 26, 2020

Good choice for the time being, I guess.

verulian · May 25, 2020

How can determine the IO limits that we should try to use? I posted this question on this thread that seems to discuss it here:
https://forum.proxmox.com/threads/i-o-disk-limit.28591/post-315763

(Too?) Big IO Delay peaks

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Well-Known Member