(Too?) Big IO Delay peaks

SargOn

Active Member
Jan 19, 2018
13
2
43
48
PVE version 5.2-2
Kernel Version
- Linux 4.15.17-1-pve #1 SMP PVE 4.15.17-9 (Wed, 9 May 2018 13:31:43 +0200)
CPU
- 24 x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (2 Sockets)
Storage:
- 8 x SSD SATA3 INTEL S3520 800GB 6Gb/s 3D MLC in a ZFS pool (RAID 5 HW, ...ye we didn't know ZFS in the beginning...)
- NO ZIL (same issue with 32 GB ZIL)
- NO L2ARC
- RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
RAM:
- 256GB

There is no need to comment what complains I get from my users (about 20 Windows VMs) when this happens:

1587637849361.png

I understand that I can expect some punctual latencies or general speed down if intensive IO processes are working (inside a VM, or migrating VMs between nodes, for instance)... but this behaviour totally stucks all VMs in the node for big whiles... Maybe something regarding unproper IO queue management... ?

Any advise (besides recreating a pure SW ZFS RAID...) ?

Thx in advance.
 
Have you checked per a VM graph during the same peak, to see if you can locate a certain VM that was completing a large amount of IO at the time? And then use this information to investigate anything running on that VM?
 
Yes, I know which VM is the problem (not always the same: If large data is moved between disks, intense DB working, etc...) but I guess that only one VM shouldn't hang all the node with the rest of VMs... Right? Thx
 
Yes, I know which VM is the problem (not always the same: If large data is moved between disks, intense DB working, etc...) but I guess that only one VM shouldn't hang all the node with the rest of VMs... Right? Thx

The VM isn't hanging the other VM's, it's saturating the disks meaning the other VMs then become slow due to slow / delayed IO requests.

In Proxmox you can edit the VM in question and apply disk I/O limits that may help to make sure that particular VM never fully saturates 100% of the available capacity of your storage.
 
  • Like
Reactions: SargOn
Ok, so do you think that the graph and that IO delay with those specs is normal if you don't apply per VM IO policies...? Thx
 
Ok, so do you think that the graph and that IO delay with those specs is normal if you don't apply per VM IO policies...? Thx

The graph is showing you have huge I/O wait, if you have a process running in a VM which requires more IO than your underlying storage can provide, then yes without any limits it will use as much IO as possible and cause a backlog of requests.
 
Ok, so I understand that also is normal to get big locks if migrating a vm between nodes (not sharing vmdata). Thx
 
Ok, so I understand that also is normal to get big locks if migrating a vm between nodes (not sharing vmdata). Thx

Well migrating a VM is a large amount of data, end of the day think of it like a hosepipe and a bucket of water, the hose can only transfer so much water at once. You can limit the speed you pour the bucket so the water can travel through the pipe and not overflow, or you can pour the whole bucket at once and the water will back up and flood over.

Your running your underlying storage in Raid5, which is never going to give you amazing raw IO capacity, so what your seeing is just these disks not be able to keep up with the workload requested.
 
  • Like
Reactions: SargOn
Sorry, most probably is that I'm missing something... but at this point I think that there is anything else... Looking at this:

This is the result of running crystal mark inside a VM in the node 1 (the case of study of this thread)

1587661715672.png

Getting:

Code:
-----------------------------------------------------------------------
CrystalDiskMark 5.5.0 x64 (C) 2007-2017 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 1) :  5402.759 MB/s
  Sequential Write (Q= 32,T= 1) :  5173.051 MB/s
  Random Read 4KiB (Q= 32,T= 1) :   432.205 MB/s [105518.8 IOPS]
Random Write 4KiB (Q= 32,T= 1) :   370.257 MB/s [ 90394.8 IOPS]
         Sequential Read (T= 1) :  3279.101 MB/s
        Sequential Write (T= 1) :  2545.794 MB/s
   Random Read 4KiB (Q= 1,T= 1) :    55.562 MB/s [ 13564.9 IOPS]
  Random Write 4KiB (Q= 1,T= 1) :    41.203 MB/s [ 10059.3 IOPS]

  Test : 1024 MiB [C: 40.3% (16.0/39.7 GiB)] (x5)  [Interval=0 sec]
  Date : 2020/04/23 19:11:16
    OS : Windows Server 2012 R2 Server Standard (full installation) [6.3 Build 9600] (x64)


This is the result of running same test in a VM in the node 2 (older system, also RAID5 HW with 6 similar SSD intel disks insted 8):

1587661841712.png

getting:

Code:
-----------------------------------------------------------------------
CrystalDiskMark 5.5.0 x64 (C) 2007-2017 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 1) :  3806.527 MB/s
  Sequential Write (Q= 32,T= 1) :  1895.046 MB/s
  Random Read 4KiB (Q= 32,T= 1) :   188.317 MB/s [ 45975.8 IOPS]
Random Write 4KiB (Q= 32,T= 1) :   142.426 MB/s [ 34772.0 IOPS]
         Sequential Read (T= 1) :  2120.249 MB/s
        Sequential Write (T= 1) :  1506.049 MB/s
   Random Read 4KiB (Q= 1,T= 1) :    16.572 MB/s [  4045.9 IOPS]
  Random Write 4KiB (Q= 1,T= 1) :    15.794 MB/s [  3856.0 IOPS]

  Test : 1024 MiB [C: 56.6% (33.7/59.7 GiB)] (x5)  [Interval=0 sec]
  Date : 2020/04/23 19:12:24
    OS : Windows Server 2016 Server Standard (full installation) [10.0 Build 14393] (x64)

Average CPU use in 1 is higher, and there are more VMs running (idle) but I still think that something is wrong with 1, comparing relative values with node 2... I've made also test with a 3th node (older than 2th, 2 x SSDs), with similar results as 2 (relative values):

1587662120252.png

getting for 3:

Code:
-----------------------------------------------------------------------
CrystalDiskMark 5.5.0 x64 (C) 2007-2017 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 1) :  2171.304 MB/s
  Sequential Write (Q= 32,T= 1) :  1434.091 MB/s
  Random Read 4KiB (Q= 32,T= 1) :    97.524 MB/s [ 23809.6 IOPS]
Random Write 4KiB (Q= 32,T= 1) :   101.282 MB/s [ 24727.1 IOPS]
         Sequential Read (T= 1) :  1229.369 MB/s
        Sequential Write (T= 1) :  1000.323 MB/s
   Random Read 4KiB (Q= 1,T= 1) :    31.355 MB/s [  7655.0 IOPS]
  Random Write 4KiB (Q= 1,T= 1) :    29.973 MB/s [  7317.6 IOPS]

  Test : 1024 MiB [C: 78.6% (38.9/49.5 GiB)] (x5)  [Interval=0 sec]
  Date : 2020/04/23 18:26:28
    OS : Windows Server 2016 Server Standard (full installation) [10.0 Build 17763] (x64)


BTW... after some tests in VM in 3th node, it seems that performance increase activate "IO thread" option for the Virtio drive (1th and 2th VMs using Virtio SCSI).

Thx in advance.
 
Last edited:
After revising many many zfs parameters I've decided to set sync=disabled in the pool (we have UPS and controller with battery) and the problem is gone...

Thx.
 
  • Like
Reactions: sg90
Wow… sync=disabled is a real threat on a VM server. If an application issues a sync write, it is usually doing so on purpose. Sync writes ensure that the drive has landed safely on the storage, before signalling success to the request. First examples that come to mind are database applications, but also metadata for some file systems.

ZFS may keep data up to 5s in the caches, before flushing out to disk, which is considerable dangerous. If you have a need for a lot of sync writes, I'd consider to add a ZIL to your zpool.
 
  • Like
Reactions: SargOn
I'm studying enabling sync only for critical VMs... The rest ones are implemented for remote development so I'm happy if in worst case at least FS doesn't get corrupted, which should be guaranteed with ZFS.

Keeping sync disabled for root zpool also enables, if I'm not wrong, the extra performance needed to avoid IO delay when migrating or back/restoring VMs.

Anyway, we also count with an UPS system which provides around 1h before an managed shutdown if needed.
 
I'm studying enabling sync only for critical VMs... The rest ones are implemented for remote development so I'm happy if in worst case at least FS doesn't get corrupted, which should be guaranteed with ZFS.

Keeping sync disabled for root zpool also enables, if I'm not wrong, the extra performance needed to avoid IO delay when migrating or back/restoring VMs.

Backup/Restore/Migration will not trigger sync-writes, but only async ones! If you experience issues with troughput speeds on any of them it's not due to sync-writes, but because something else is straining your resources.

Anyway, we also count with an UPS system which provides around 1h before an managed shutdown if needed.

I'd be more weary of sudden host crashes, which could then totally ruins your guests disks. A reliable UPS is one of the requirements for daring to switch off sync-writes on a zpool, but there's more to that than reliable power.

Any advise (besides recreating a pure SW ZFS RAID...) ?

You really shoud re-create your zpool properly, rather than setting really dangerous options for the zpool.
 
  • Like
Reactions: guletz
Backup/Restore/Migration will not trigger sync-writes, but only async ones! If you experience issues with troughput speeds on any of them it's not due to sync-writes, but because something else is straining your resources.

It's not a throughput issue... it's a problem with IO delay getting totally stuck the rest of VMs... The same issue arrises when migrating any vm between nodes... so I don't understand why this occurs if writes are not synced.

I'd be more weary of sudden host crashes, which could then totally ruins your guests disks. A reliable UPS is one of the requirements for daring to switch off sync-writes on a zpool, but there's more to that than reliable power.

Can you point any other scenario for this, please?

You really shoud re-create your zpool properly, rather than setting really dangerous options for the zpool.

It's my main suspect... I think that disabling HW RAID and letting ZFS manage directly the disk could fix this issue... but can I be sure about this...?

Thanks for your comments.
 
It's not a throughput issue... it's a problem with IO delay getting totally stuck the rest of VMs... The same issue arrises when migrating any vm between nodes... so I don't understand why this occurs if writes are not synced.

If migration sucks, then you will have to take a look at the network as well. On the source host, reading from the storage shouldn't cause high i/o latency.


Can you point any other scenario for this, please?

Pardon? What other scenario do you need? I just had a sudden pve host reboot on thursday… these things can - and will, happen. Better to be safe in that regard.


It's my main suspect... I think that disabling HW RAID and letting ZFS manage directly the disk could fix this issue... but can I be sure about this...?

I cannot say… I don't have all the specs of your setup and we are just discussing single aspects of this, but I'd never trust my ZFS pools to *any* HW raid controller… never…!
 
If migration sucks, then you will have to take a look at the network as well. On the source host, reading from the storage shouldn't cause high i/o latency.

On the source...

EDIT: After some test, indeed, a simple VM migration doesn't seem to impact severely in IO delay on destination. Almost the same with sync=standard or sync=disabled.

Pardon? What other scenario do you need? I just had a sudden pve host reboot on thursday… these things can - and will, happen. Better to be safe in that regard.

Sudden reboot...? Man, what kind of HW do you have... ?

1587897409744.png

...just joking, sorry :p

I cannot say… I don't have all the specs of your setup and we are just discussing single aspects of this, but I'd never trust my ZFS pools to *any* HW raid controller… never…!

Yes, this was a newbie error... In the beginning we didn't have a ZFS pool... we had EXT4...

Anyway, I understand your point of view, but I think that we don't have to see everything black or white... As you mention... it depends on the real and particular scenario. The important thing is not getting ever a corrupted FS, and for particular VMs (SQL) we can enable sync=standard (maybe not all if it's a dev or lab vm).

https://forum.proxmox.com/threads/zfs-sync-disabled.37900/

Thx.
 
Last edited:
Sudden reboot...? Man, what kind of HW do you have... ?

1587897409744.png


...just joking, sorry :p

Yeah… very funny… ;) Actually, the reboot had been triggered by a rampant-going LXC which lead to repeated oom-killers. The point is… you can never be sure…
 
  • Like
Reactions: SargOn
Yeah… very funny… ;) Actually, the reboot had been triggered by a rampant-going LXC which lead to repeated oom-killers. The point is… you can never be sure…

Absolutely agree.

I will try to make more tests... but for now I will keep sync=standard, and only if falling in intense issues while working I will think about temporalily disabling it. And if I find time (and resources) I will remake the RAID properly for ZFS.

Thanks for your comments and advises.
 
  • Like
Reactions: budy

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!