High disk IO overhead for clients on ZFS

caretech · Dec 4, 2020

On our Proxmox servers we've been surprised at how much slower the VM clients are than the host. How much disk IO overhead is normal or acceptable for VMs?

These tests were done on a server running Proxmox VE 6.2-15; writing to the dpool on the host machine itself, and then performing the same write benchmark within the client, which has its disk stored on the same dpool.

Code:

NAME                                            STATE     READ WRITE CKSUM
dpool                                           ONLINE       0     0     0
  mirror-0                                      ONLINE       0     0     0
    ata-WDC_WD3000FYYZ-01UL1B1_WD-WCC132177822  ONLINE       0     0     0
    ata-WDC_WD3000FYYZ-DATTO-1_WD-WCC131198607  ONLINE       0     0     0
  mirror-1                                      ONLINE       0     0     0
    ata-WDC_WD3000FYYZ-DATTO-1_WD-WCC131330748  ONLINE       0     0     0
    ata-WDC_WD3000FYYZ-DATTO-1_WD-WCC1F0366551  ONLINE       0     0     0

Testing overall throughput of large file writes...

Code:

root@chestnut:~# dd if=/dev/zero of=/dpool/test bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 9.10997 s, 118 MB/s

I ran 5 tests of this same command, with the following results:

MIN: 99.6 MB/s MAX: 119 MB/s AVG: 108.92 MB/s

On this same host, now testing from a VM:

Code:

boot:  
cores: 8
cpu: host
ide2: local:iso/ubuntu-20.04-live-server-amd64.iso,media=cdrom
memory: 12288
name: cloudberry
net0: e1000=2A:5B:65:68:19:BE,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
parent: before_install
scsi0: VHDs:vm-105-disk-0,cache=none,size=500G
scsihw: virtio-scsi-single
smbios1: uuid=c73ce295-12a4-41fb-9925-0f29dbfefd7e
sockets: 1
vmgenid: 9114d856-946b-42ef-a55d-66d0676f0201

With cache=none, the results of 5 runs are as follows: (command: command: dd if=/dev/zero of=test bs=1G count=1 oflag=dsync

MIN: 32.4 MB/s MAX: 112 MB/s AVG: 89.22 MB/s

Result: roughly 18% performance hit average for virtual vs bare metal for large writes; also, much greater spread in the results. 32 MB/s??

Now testing smaller writes using the following command:

dd if=/dev/zero of=test bs=512 count=1000 oflag=dsync

Average of 5 runs on bare metal, the Proxmox host:

MIN: 41.3 kB/s MAX: 52.2 kB/s AVG: 49.08 kB/s

Average 5 runs on the Ubuntu VM described above:

MIN: 21.4 kB/s MAX: 22.4 kB/s AVG: 22.04 kB/s

Result: 55% performance hit, which hardly seems acceptable?

We also tested with cache=writeback and the results were within a percentage point or two.

On another system where the ZFS raid is configured with L2ARC cache and ZIL, the max throughput on the client is 90% slower than the host.

So in summary, we're experiencing a 30-90% overhead of disk IO performance on ZFS raid between host and client.

Any suggestions on how to resolve this would be much appreciated.

guletz · Dec 4, 2020

Hi,

If you use a wrong tests, you get wrong conclusion

In a VM youor test is done using a ZVOL device(using a FS with a FIXED block size, default 512 if you are using ext4). The test that you done on the zfs pool is on a ZFS Dataset(using a FS with variable block size, max=128 K by default). Even more, any zfs dataset are using/benefit from ZFS cache/ARC, but in a VM (with cache=none) you use only the Guest OS cache only . So apples are not like oranges

And not at least, /dev/zero is not a real test for zfs( /dev/urandom is more close to reality usage). A better comparation will be to use a CT(who use a dataset zfs) and not a VM. Even better will be to try various bs(ex: 4K, 32K).

Also in my own oppinion, such ideea like "let do a test" who will not emulate your own real load => let do something to get better performance for this "test" could have a bad impact on performance using your real load(who is very much different from your test). Myself I had have using such tests, then I "optimised" my zfs setup, and in the end I discover that using my real load the results was worse then before "optimisation".

So be very careful WHAT you test!

Good luck / Bafta !

nktech1135 · Dec 4, 2020

Hi.
I'm a co-worker of the origional poster and we have a few more questions.
Using lxc with a bind mount does indeed get us bare metal performance using the priorly mentioned dd command.
However, i'm wondering. Is there a definitive guide on what the performance overhead is for proxmox vms? I'm not convinced dd is the correct method to benchmark bare metal vs vm performance. Can anyone shed some light on what i should be doing to get some actual correct numbers?

nktech1135 · Dec 4, 2020

Hi.
Another update. I'm now testing with /dev/urandom and a 128 block size. following are 2 results, one from the vm and one from the host. There is still a difference.

Code:

root@chestnut:~# dd if=/dev/urandom of=/dpool/test bs=128K count=100 oflag=dsync
100+0 records in
100+0 records out
13107200 bytes (13 MB, 12 MiB) copied, 2.58647 s, 5.1 MB/s

Code:

root@cloudberry:~# dd if=/dev/urandom of=test bs=128K count=100 oflag=dsync
100+0 records in
100+0 records out
13107200 bytes (13 MB, 12 MiB) copied, 3.57552 s, 3.7 MB/s

Is there any way to optomize this or is running ext4 on zfs volumes just not a good idea in principle?

Thanks.

leesteken · Dec 4, 2020

I was curious and ran the same commands on my host (33-34 MB/s) and a container (zfs;subvol) (33-37 MB/s) several times. Unfortunately, I cannot reproduce your results and have no idea why they differ.

nktech1135 · Dec 4, 2020

avw said:
I was curious and ran the same commands on my host (33-34 MB/s) and a container (zfs;subvol) (33-37 MB/s) several times. Unfortunately, I cannot reproduce your results and have no idea why they differ..

Hosts and containers are pretty well the same, it's vms that we're having trouble with.

Thanks.

leesteken · Dec 5, 2020

nktech1135 said:
Hosts and containers are pretty well the same, it's vms that we're having trouble with.

Thanks.

Apologies, using a VM with ext4 I get 23-24 MB/s, which is comparable to your results. I have seen much worse ratios than 70% in some virtualization systems (without special drivers like VirtIO).

Ext4 also tries to guarantee some kind of write ordering and ZFS is even uses COW and I guess it is just multiple layers of (mostly useful) overhead on top of each other.
Maybe a simpler file system like FAT32 would introduce less overhead. Or tru 9p or virtiofs to connect directly to the host storage (9p is much slower, I have no experience with virtiofs yet).
If you have a very fast (virtual) network: I've seen VMs that used a SAN for data storage which were even faster than bare metal. This removes the overhead of a second filesystem on top.

H4R0 · Dec 6, 2020

I just benchmarked my VM's and I'm surprised I get 50% overhead as well.

Writeback performs much better but freezes other vm's during load.

Disabling zfs sync performs best without freeze but is unsafe.

Different volblocksize doesn't make a difference.

Code:

sequential write test:
cd /tmp && fio --name=seqwrite --filename=fio_seqwrite.fio --refill_buffers --rw=write --direct=1 --loops=5 --ioengine=libaio --bs=1m --size=2G --runtime=60 --group_reporting && rm fio_seqwrite.fio


host:
  WRITE: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=6740MiB (7067MB), run=62621-62621msec

vm virtio iscsi discard cache=none:
  WRITE: bw=57.6MiB/s (60.4MB/s), 57.6MiB/s-57.6MiB/s (60.4MB/s-60.4MB/s), io=3540MiB (3712MB), run=61431-61431msec

vm virtio iscsi discard cache=none sync=disabled:
WRITE: bw=103MiB/s (108MB/s), 103MiB/s-103MiB/s (108MB/s-108MB/s), io=6429MiB (6741MB), run=62705-62705msec

vm virtio iscsi discard cache=writeback:
  WRITE: bw=86.9MiB/s (91.1MB/s), 86.9MiB/s-86.9MiB/s (91.1MB/s-91.1MB/s), io=5233MiB (5487MB), run=60247-60247msec

vm virtio iscsi discard cache=writeback_unsafe:
  WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=7150MiB (7497MB), run=61639-61639msec

guletz · Dec 8, 2020

nktech1135 said:
Is there any way to optomize this or is running ext4 on zfs volumes just not a good idea in principle?

Yes.

1. Use plain partitions and do not use LVM(another layer ...).
2. Adjust your ext4 command thinking at storage layers
- let say I have a zfs(ashift=12) VM with zvolblocksize=8K, so in such situation I will use:

mkfs.ext4 ..... -b 4096 -E stripe-width=2 /dev/sdX1
4096 x 2 = 8K

Good luck / Bafta !

leesteken · Dec 8, 2020

Maybe this Questions regarding blocksize and write amplification can help? Even more if it ever gets an answer, but it mentions several other post about aligning I/O layers.

guletz · Dec 8, 2020

nktech1135 said:
Hi.
Another update. I'm now testing with /dev/urandom and a 128 block size. following are 2 results, one from the vm and one from the host. There is still a difference.

In my case:

HOST:

Code:

/rpool/data# dd if=/dev/urandom of=test count=100 oflag=dsync bs=128K
100+0 records in
100+0 records out
13107200 bytes (13 MB, 12 MiB) copied, 1.15131 s, 11.4 MB/s

VM with ext4(zvolblocksize=16K)

Code:

dd if=/dev/urandom of=test count=100 oflag=dsync bs=128K
100+0 records in
100+0 records out
13107200 bytes (13 MB, 12 MiB) copied, 0.440259 s, 29.8 MB/s

Good luck / Bafta !

nktech1135 · Dec 9, 2020

guletz said:
Yes.

1. Use plain partitions and do not use LVM(another layer ...).
2. Adjust your ext4 command thinking at storage layers
- let say I have a zfs(ashift=12) VM with zvolblocksize=8K, so in such situation I will use:

mkfs.ext4 ..... -b 4096 -E stripe-width=2 /dev/sdX1
4096 x 2 = 8K

Good luck / Bafta !

Thanks for the response. Unfortunately it just brings up more questions.
I'm not using lvm on my vm.

Code:

root@cloudberry:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            5.9G     0  5.9G   0% /dev
tmpfs           1.2G  1.2M  1.2G   1% /run
/dev/sda2       492G  148G  319G  32% /
tmpfs           5.9G  156K  5.9G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           5.9G     0  5.9G   0% /sys/fs/cgroup
/dev/loop1       70M   70M     0 100% /snap/lxd/18520
/dev/loop0       56M   56M     0 100% /snap/core18/1885
/dev/loop3      4.0M  4.0M     0 100% /snap/rustup/617
/dev/loop2       72M   72M     0 100% /snap/lxd/18546
/dev/loop6       32M   32M     0 100% /snap/snapd/10238
/dev/loop7      4.0M  4.0M     0 100% /snap/rustup/587
/dev/loop5       56M   56M     0 100% /snap/core18/1932
/dev/loop4       32M   32M     0 100% /snap/snapd/10492
tmpfs           1.2G     0  1.2G   0% /run/user/0
/dev/sdb1        32G   61M   30G   1% /root/disk2

I'm not understanding the ashift parameter completely, but in my case i have an ashift of 9.
This is being run on the hypervisor that contains the cloudberry vm.


root@chestnut:~# zdb | grep ashift
            ashift: 9
            ashift: 9

My volblocksize is set to 8k
Using your format command i added a second disk to the cloudberry vm and tried formatting it. All seems to work but running the test on that disk doesn't get me any better performance than running on the root disk.
Questions.
running the following reports that my disk formatted as 4096x2 has a block size of 512. How can this be?

Code:

root@cloudberry:~# blockdev --getss --getpbsz /dev/sdb1
512
512

And here are the zfs parameters for that volume.

Code:

root@chestnut:~# zfs get all dpool/vm-105-disk-1
NAME                 PROPERTY              VALUE                  SOURCE
dpool/vm-105-disk-1  type                  volume                 -
dpool/vm-105-disk-1  creation              Wed Dec  9  5:33 2020  -
dpool/vm-105-disk-1  used                  33.0G                  -
dpool/vm-105-disk-1  available             3.80T                  -
dpool/vm-105-disk-1  referenced            12K                    -
dpool/vm-105-disk-1  compressratio         1.00x                  -
dpool/vm-105-disk-1  reservation           none                   default
dpool/vm-105-disk-1  volsize               32G                    local
dpool/vm-105-disk-1  volblocksize          8K                     default
dpool/vm-105-disk-1  checksum              on                     default
dpool/vm-105-disk-1  compression           off                    default
dpool/vm-105-disk-1  readonly              off                    default
dpool/vm-105-disk-1  createtxg             6906723                -
dpool/vm-105-disk-1  copies                1                      default
dpool/vm-105-disk-1  refreservation        33.0G                  local
dpool/vm-105-disk-1  guid                  3306793169398357435    -
dpool/vm-105-disk-1  primarycache          all                    default
dpool/vm-105-disk-1  secondarycache        all                    default
dpool/vm-105-disk-1  usedbysnapshots       0B                     -
dpool/vm-105-disk-1  usedbydataset         12K                    -
dpool/vm-105-disk-1  usedbychildren        0B                     -
dpool/vm-105-disk-1  usedbyrefreservation  33.0G                  -
dpool/vm-105-disk-1  logbias               latency                default
dpool/vm-105-disk-1  objsetid              652                    -
dpool/vm-105-disk-1  dedup                 off                    default
dpool/vm-105-disk-1  mlslabel              none                   default
dpool/vm-105-disk-1  sync                  standard               default
dpool/vm-105-disk-1  refcompressratio      1.00x                  -
dpool/vm-105-disk-1  written               12K                    -
dpool/vm-105-disk-1  logicalused           6K                     -
dpool/vm-105-disk-1  logicalreferenced     6K                     -
dpool/vm-105-disk-1  volmode               default                default
dpool/vm-105-disk-1  snapshot_limit        none                   default
dpool/vm-105-disk-1  snapshot_count        none                   default
dpool/vm-105-disk-1  snapdev               hidden                 default
dpool/vm-105-disk-1  context               none                   default
dpool/vm-105-disk-1  fscontext             none                   default
dpool/vm-105-disk-1  defcontext            none                   default
dpool/vm-105-disk-1  rootcontext           none                   default
dpool/vm-105-disk-1  redundant_metadata    all                    default
dpool/vm-105-disk-1  encryption            off                    default
dpool/vm-105-disk-1  keylocation           none                   default
dpool/vm-105-disk-1  keyformat             none                   default
dpool/vm-105-disk-1  pbkdf2iters           0                      default

I must be missing something here. I just can't figure out what it is.

guletz · Dec 9, 2020

Hi again,

Can you provide the output from:

Bash:

smartctl -a /dev/sdX

I want to know what is your HDDs block size!

Good luck /Bafta !

guletz · Dec 9, 2020

nktech1135 said:
reports that my disk formatted as 4096x2 has a block size of 512. How can this be?

I write explicit, for ashift=12(=HDDs with block-size=4k). For ashift=9(=block-size=512) => 8k/512 = 16 => E stripe-width

Good luck /Bafta !

nktech1135 · Dec 9, 2020

guletz said:
Hi again,

Can you provide the output from:

Bash:

smartctl -a /dev/sdX

I want to know what is your HDDs block size!

Good luck /Bafta !

Here you go. We are using 4 of these drives configured as zfs raid10.

Code:

root@chestnut:~# smartctl -x /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.65-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Re
Device Model:     WDC WD3000FYYZ-DATTO-1
Serial Number:    WD-WCC131330748
LU WWN Device Id: 5 0014ee 25ea51236
Firmware Version: 01.01KD2
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Dec  9 13:05:46 2020 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is:   Disabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

I also reformatted the disk on the vm using the following command.
root@cloudberry:~# mkfs.ext4 -b 4096 -E stripe-width=16 /dev/sdb1

The performance numbers didn't change.
From the disk on the vm

Code:

root@cloudberry:~# dd if=/dev/urandom of=disk2/test bs=128K count=100 oflag=dsync
100+0 records in
100+0 records out
13107200 bytes (13 MB, 12 MiB) copied, 6.31069 s, 2.1 MB/s

From the hypervisor run on the same zfs pool, but not the same volume.

Code:

root@chestnut:~# dd if=/dev/urandom of=/dpool/test bs=128K count=100 oflag=dsync
100+0 records in
100+0 records out
13107200 bytes (13 MB, 12 MiB) copied, 2.4024 s, 5.5 MB/s

The numbers jump around some and the differential posted here is slightly above average.

I'm wondering. Would larger block sizes on the zvol help me at all? or am i fighting with something deeper?

nktech1135 · Dec 9, 2020

Hi.
I think i've come across the issue and it's rather obvious once you get all the various differences ironed out.
I would like some comments from others to tell me if i'm on the right track.
The issue isn't actually overhead but differences in volume vs dataset configuration.
basically a dataset uses a variable block size which can go up to 128k while a zvol uses a set block size that defaults to 8k
This means all my tests up to this point on the thread are invalid because the bare metal host tests were done on a zfs dataset and not a zvol.
Using a zvol i'm now matching bare metal and vm performance, actually, on the few tests i did the vm outperforms the host.
So now the question is, Can i somehow pass a dataset through to a vm?
If not, what's the best method to tune a zvol for faster performance?
I'm guessing tuning a zvol will depend largely on what the vm is being used for, correct?

Thanks.

Dunuin · Dec 10, 2020

If you want to use iSCSI or use virtual disks its always a zvol. By the way, you also could increase the volblocksize of that zvol to 128K if your workload allows that. I think if you want to access stuff direcly on a dataset you need some kind of network share like NFS and that probably isn't faster as just using a zvol.

guletz · Dec 10, 2020

nktech1135 said:
I'm wondering. Would larger block sizes on the zvol help me at all? or am i fighting with something deeper?

You can try. By default on riad10 I use zvol with 16 K. But also you can do something else. You can use instead of zvol a raw disk file, if you use a directory storage in datacenter. In a such case your VM will use a dataset. And also the performance will be better.

Good luck/Bafta!

Search

Search

High disk IO overhead for clients on ZFS

caretech

Member

guletz

Famous Member

nktech1135

Member

nktech1135

Member

leesteken

Distinguished Member

nktech1135

Member

leesteken

Distinguished Member

H4R0

Well-Known Member

guletz

Famous Member

leesteken

Distinguished Member

guletz

Famous Member

nktech1135

Member

guletz

Famous Member

guletz

Famous Member

nktech1135

Member

nktech1135

Member

Dunuin

Distinguished Member

guletz

Famous Member