Proxmox 6.1 with ZFS - High IO and Server Crash

Mecanik · Mar 12, 2020

I cannot get my head around this, as I`m experiencing some very weird server IO delay and server crash. It seems I have this exact same issue [1][2].

The setup has ZFS + ZIL + L2ARC, however the RAM was getting used up to 100+ GB and the L2ARC was (and is set) to 150GB/read/write on the SSD.

Current HDDs:

2× 960GB SSD NVMe (SAMSUNG MZQLB960HAJR-00007) and 2× 6TB HDD SATA Soft RAID (HGST_HUS726T6TALE6L1 )

ZFS is currently running with 2 pools:

1 for SSD (rpool)
1 for HDD

The server was crashing (hung) with 3 vm's running windows, so I had to make some changes. However current pool status:

Bash:

NAME            SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool           888G   115G   773G        -         -     5%    12%  1.00x    ONLINE  -
  mirror        888G   115G   773G        -         -     5%  12.9%      -  ONLINE
    nvme0n1p2      -      -      -        -         -      -      -      -  ONLINE
    nvme1n1p2      -      -      -        -         -      -      -      -  ONLINE
vmpool         5.44T  72.5G  5.37T        -         -     0%     1%  1.00x    ONLINE  -
  mirror       5.44T  72.5G  5.37T        -         -     0%  1.30%      -  ONLINE
    sdb            -      -      -        -         -      -      -      -  ONLINE
    sda            -      -      -        -         -      -      -      -  ONLINE

I know and understand that normal spinning HDD's cannot cope too much, this is why I setup ZFS with cache and ARC hoping this will improve, and to be honest it did because speed was amazing.

But I don't understand why randomly the sudden high IO and server hung... can someone advise me what to do ?

PS: HW cannot be modified.

[1] https://forum.proxmox.com/threads/proxmox-ve-new-server-high-io-delay.39162/
[2] https://forum.proxmox.com/threads/zfs-high-io-again.55331/

Mecanik · Mar 12, 2020

Point of interest, I have re-set as follows:

Bash:

# ZFS Intent Log, or ZIL, to buffer WRITE operations.
# ARC and L2ARC which are meant for READ operations.

# ARC
# 5 gb min - 10 gb max
options zfs zfs_arc_min=5368709120
options zfs zfs_arc_max=10737418240

# metadata limit
# 5 gb
options zfs zfs_arc_meta_limit=5368709120

# L2ARC
options zfs l2arc_noprefetch=0
options zfs l2arc_write_max=104857600
options zfs l2arc_write_boost=104857600

The node has 256 GB RAM by the way.

After this, some testing:

HDD:

Bash:

vmpool# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][99.0%][r=47.1MiB/s,w=15.5MiB/s][r=12.1k,w=3980 IOPS][eta 00m:01s]
test: (groupid=0, jobs=1): err= 0: pid=29699: Thu Mar 12 11:16:23 2020
  read: IOPS=7725, BW=30.2MiB/s (31.6MB/s)(3070MiB/101728msec)
   bw (  KiB/s): min=15888, max=327320, per=99.81%, avg=30843.93, stdev=30337.62, samples=203
   iops        : min= 3972, max=81830, avg=7710.94, stdev=7584.41, samples=203
  write: IOPS=2581, BW=10.1MiB/s (10.6MB/s)(1026MiB/101728msec); 0 zone resets
   bw (  KiB/s): min= 5248, max=109048, per=99.82%, avg=10308.83, stdev=10098.90, samples=203
   iops        : min= 1312, max=27262, avg=2577.18, stdev=2524.73, samples=203
  cpu          : usr=0.90%, sys=9.50%, ctx=240931, majf=6, minf=645
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=30.2MiB/s (31.6MB/s), 30.2MiB/s-30.2MiB/s (31.6MB/s-31.6MB/s), io=3070MiB (3219MB), run=101728-101728msec
  WRITE: bw=10.1MiB/s (10.6MB/s), 10.1MiB/s-10.1MiB/s (10.6MB/s-10.6MB/s), io=1026MiB (1076MB), run=101728-101728msec

SSD:

Bash:

rpool# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)


Jobs: 1 (f=1): [m(1)][100.0%][r=143MiB/s,w=47.1MiB/s][r=36.7k,w=12.1k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=27153: Thu Mar 12 11:17:28 2020
  read: IOPS=37.4k, BW=146MiB/s (153MB/s)(3070MiB/21005msec)
   bw (  KiB/s): min=90216, max=457309, per=99.96%, avg=149603.33, stdev=67137.64, samples=42
   iops        : min=22554, max=114327, avg=37400.81, stdev=16784.39, samples=42
  write: IOPS=12.5k, BW=48.8MiB/s (51.2MB/s)(1026MiB/21005msec); 0 zone resets
   bw (  KiB/s): min=30440, max=152159, per=99.97%, avg=50001.64, stdev=22364.86, samples=42
   iops        : min= 7610, max=38039, avg=12500.38, stdev=5591.14, samples=42
  cpu          : usr=3.59%, sys=47.41%, ctx=114660, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=146MiB/s (153MB/s), 146MiB/s-146MiB/s (153MB/s-153MB/s), io=3070MiB (3219MB), run=21005-21005msec
  WRITE: bw=48.8MiB/s (51.2MB/s), 48.8MiB/s-48.8MiB/s (51.2MB/s-51.2MB/s), io=1026MiB (1076MB), run=21005-21005msec

Which looks goo to me, however...

- Only if I limit the VM R/W OPS drastically, the node will not have high IO delay
- RAM is being used over the limit... why ?
- When cloning, the IO delay goes up to 60%

Any thoughts ?

Mecanik · Mar 12, 2020

Yep, randomly gets stuck again. IO goes up to 60% and nothing can be done anymore. Not even reboot...

LnxBil · Mar 12, 2020

Mecanik said:
Bash:

# ZFS Intent Log, or ZIL, to buffer WRITE operations.

ZIL is meant for sync writes, all other writes directly hit the disk.

Mecanik said:

Bash:

# ARC and L2ARC which are meant for READ operations.

# ARC
# 5 gb min - 10 gb max
options zfs zfs_arc_min=5368709120
options zfs zfs_arc_max=10737418240

# metadata limit
# 5 gb
options zfs zfs_arc_meta_limit=5368709120

Why on earth would you run a setup with such a low ARC? L2ARC is totally useless in this case and worsens the performance. L2ARC uses up a lot of ARC, so in the end you will have even less than you configured.

Mecanik said:
- RAM is being used over the limit... why ?

How do you detect that? Please monitor with arcstat

Mecanik · Mar 13, 2020

LnxBil said:
ZIL is meant for sync writes, all other writes directly hit the disk.

Why on earth would you run a setup with such a low ARC? L2ARC is totally useless in this case and worsens the performance. L2ARC uses up a lot of ARC, so in the end you will have even less than you configured.

How do you detect that? Please monitor with arcstat

I was having over 50% being used of 256 gb ram, while I was testing. The system hang even with so much ram used, the moment you tried to move a big file.

After disabling l2arc, the performance was .. 100 iops compared to what I posted, after enabling it back the performance dramatically improved.

It was "all good", until I tried to clone a machine of 30 gb, at that point because the ARC/L2ARC had only a limit of 25 gb, the system once again crashed.

My desired result would be some improvement to the normal sata operation, without killing the host or using up all its ram...

Search

Search

Proxmox 6.1 with ZFS - High IO and Server Crash

Mecanik

Well-Known Member

Mecanik

Well-Known Member

Mecanik

Well-Known Member

LnxBil

Distinguished Member

Mecanik

Well-Known Member

We value your privacy