Strange IO performance issues

SirMonkey

Member
Aug 12, 2021
3
0
21
33
Hey all, I'm hoping someone can explain some of the IO performance issues I'm encountering.

The Build
I've just built a new home server, the hardware is:
GByte X570 Aorus Elite
Ryzen9 3900X
2x Samsung 32GB 3200MHz ECC DDR4 UDIMM
LSI SAS SATA 9201-8I
2x WD Blue SN550 1TB M.2
2x WD RED wd10jfcx-68n6gn0 1TB
4x huh721010AL5200 10TB

The 10TB drives are the only ones connected to the LSI controller, leaving 4 ports for expansion. All others are connected to the onboard sata controller. I installed Proxmox 7.0-8.

ZFS Pools
rpool: mirror array of the 2x WD RED wd10jfcx-68n6gn0 1TB. (proxmox OS install, created during setup)
NVMe: mirror array of the 2x WD Blue SN550 1TB M.2
PLEX: RAIDZ1 array of the 4x huh721010AL5200 10TB. Compression=on (which I think means lz4, can someone confirm?), ashift=12.

Issues
Very slow write speeds to PLEX pool
High IO activity on a single ZFS pool (PLEX) seems to be effecting all of them.

When rsyncing all of the data from my old plex server to my new VM on the PLEX pool. I'm able to achieve full gigabit wire speed for about 2 - 3 mins, after which the speed drops down to about 20MBps - 40MBps and as soon as this happens the IO response time across all storage pools shoots up to insane amounts. Here is a screenshot of taskmgr from a VM running on the NVMe pool while the rsync is running. As soon as I stop the rsync, disk response time returns to normal.


1628773029564.png

So for my first question, why would activity on one ZFS pool effect the IO response time on another?
My current theory is that maybe they're all sharing the same cache within RAM and this is what is filling in about 2 - 3 mins and then none of the other pools have a cache but I'm not really sure how to validate or fix. I also might be entirely wrong as I'm a bit of a ZFS noob! :D

My second question is why can't the PLEX RAIDZ1 pool support writing data at full Gigabit line speed?
I did a bit of digging into this myself but couldn't figure it out. Here are my steps:

I first destroied the pool so I could do a write performance test to all four disks indepentantly. I did this by writing 10G to each disk using dd, my results were:
Code:
root@themox:~# for d in 'a' 'b' 'c' 'd'; do dd if=/dev/zero of=/dev/sd$d count=1024 bs=10M; done
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 50.7757 s, 211 MB/s
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 52.193 s, 206 MB/s
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 50.7544 s, 212 MB/s
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 52.5369 s, 204 MB/s

To me, the above shows that the system can write to all four disks perfectly fine as thats roughly the speed I'd expect from a 7200rpm disk and that none of the disks, cables, HBA card are causing the issue, right?

I then recreated the PLEX pool with the same defaults as before (comp=on, ashift=12), recreated the plex vm disk on the pool and restarted my rsync while paying attention to the output of
Code:
 watch zpool iostat -v PLEX
This then confused me even more! :D

Just like before the rsync initially started copying at full line speed and rsync was reportedly writing at around 110MBps, it continued to do this for around 3 mins before dropping to about 30MBps and nuking all of the IO response times across all the pools just like before.

The bit that confused me tho was the output of zpool iostat -v PLEX:
Code:
root@themox:~# zpool iostat -v PLEX
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
PLEX                        4.01T  31.6T      1    670  10.7K  34.0M
  raidz1                    4.01T  31.6T      1    670  10.7K  34.0M
    scsi-35000cca2739b1074      -      -      0    217  3.26K  10.4M
    scsi-35000cca26ab9aad4      -      -      0    126  2.09K  6.82M
    scsi-35000cca2739b1640      -      -      0    218  3.45K  10.4M
    scsi-35000cca2739b14ec      -      -      0    108  1.85K  6.45M
--------------------------  -----  -----  -----  -----  -----  -----

For some reason ZFS is only writing to each disk at around 10MBps, and we know that these disks are capable of around 20x that from the previous test.

So to summarize;
1. How can I prevent high IO activity on one pool effecting the response time of others.
2. Why does the RAIDZ1 (PLEX) pool not write to disk faster?


Looking forward to your replies!
Many thanks in advance!
 
Last edited:
So to summarize;
1. How can I prevent high IO activity on one pool effecting the response time of others.
2. Why does the RAIDZ1 (PLEX) pool not write to disk faster?
I got the same mainboard and ECC RAM (and yes, its RGB ECC XD) for my Gaming PC. Keep in mind that only one of the M.2 slots is connected to the CPU. The other one is connected to the mainboards chipset and will share the same bandwidth (4 PCIe 4.0 lanes) with all the other stuff on the mainboard. So all 6 SATA ports + NIC + soundcard + one M.2 will share the same 4 PCI lanes. If you use your M.2 the SATA and NIC performance may drop and vice versa.

You will always get write amplification. My write amplification is between factor 3 (big async sequential writes) and 81 (4k random sync writes). If I write with 10 MB/s inside the VM and got a factor 81 write amplification the host is writing with 810 MB/s to store this. If my pool on the host is only capable of writing with 200MB/s I wouldn't be able to write with more then 2,47 MB/s. How bad your write amplification is depends on several factor like your pool setup, your filesystem used inside the guest, your blocksizes, if you try sync or async writes, the block sizes you use and so on. But you will always get more or less write amplification.

Did you change the volbocksize for that pool? If not and you are using the default 8K volblocksize you are wasting TBs of usable capacity. With 4 disks and ashift of 12 it looks like this:
4K/8K volblocksize = you loose 50% of your capacity (10TB used for parity + 10TB wasted for padding)
16K/32K volblocksize = you loose 33% of raw capacity (10TB for parity + 3,2TB wasted for padding)
64K volblocksize = you loose 27% of raw capacity (10TB for parity + 0,8TB wasted for padding)
So I would atleast increase the volblocksize to 16K. Volblocksize os only used for zvols, so datasets or LXCs shouldn't be affected by that padding overhead because a default recordsize of 128K is used there instead.
 
Last edited:
Hey Dunuin! Thanks for getting back!
I got the same mainboard and ECC RAM (and yes, its RGB ECC XD) for my Gaming PC. Keep in mind that only one of the M.2 slots is connected to the CPU. The other one is connected to the mainboards chipset and will share the same bandwidth (4 PCIe 4.0 lanes) with all the other stuff on the mainboard. So all 6 SATA ports + NIC + soundcard + one M.2 will share the same 4 PCI lanes. If you use your M.2 the SATA and NIC performance may drop and vice versa.
Nice! It was my old gaming rig as well until I got the M-ITX one for my node 202 case (and yes that one has RGB XD). I was aware of the second slot sharing with the mainboard chipset but didn't think it would ever be an issue as I though PCIE 4.0 lanes were close to 2GB/ps each. The SSD's are also Gen 3, but if I start getting issues I guess I'd start with a seperate NIC.

You will always get write amplification. My write amplification is between factor 3 (big async sequential writes) and 81 (4k random sync writes). If I write with 10 MB/s inside the VM and got a factor 81 write amplification the host is writing with 810 MB/s to store this. If my pool on the host is only capable of writing with 200MB/s I wouldn't be able to write with more then 2,47 MB/s. How bad your write amplification is depends on several factor like your pool setup, your filesystem used inside the guest, your blocksizes, if you try sync or async writes, the block sizes you use and so on. But you will always get more or less write amplification.
Interesting, I've never really thought about amplification as data goes up the storage stack, performacne and capacity losses tho deffinatly. How do you find out the write amplification for a workload? I'd have thought this would have been sequential as I'm just rsyncing files to a newly formatted disk.

As for what I'm using withint the guest VM, it's Ubuntu 20.04 with its OS disk on the NVMe pool and a seperate disk on the PELX pool formatted with xfs
Code:
root@plex-server:~# xfs_info /dev/sdb
meta-data=/dev/sdb               isize=512    agcount=10, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=2684354550, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Did you change the volbocksize for that pool? If not and you are using the default 8K volblocksize you are wasting TBs of usable capacity. With 4 disks and ashift of 12 it looks like this:
4K/8K volblocksize = you loose 50% of your capacity (10TB used for parity + 10TB wasted for padding)
16K/32K volblocksize = you loose 33% of raw capacity (10TB for parity + 3,2TB wasted for padding)
64K volblocksize = you loose 27% of raw capacity (10TB for parity + 0,8TB wasted for padding)
So I would atleast increase the volblocksize to 16K. Volblocksize os only used for zvols, so datasets or LXCs shouldn't be affected by that padding overhead because a default recordsize of 128K is used there instead.
Nope, I used the defaults like a mug! XD I'm going to remake the PLEX pool shortly I think.
Would this also have anything todo with the fact the there seems to be 16.34 TB used on the PLEX pool after creating a single 10TB disk? And maybe the performance issues....

I'm basically hoping this is the answer to all my problems XD

Thanks
 
Hey Dunuin! Thanks for getting back!

Nice! It was my old gaming rig as well until I got the M-ITX one for my node 202 case (and yes that one has RGB XD). I was aware of the second slot sharing with the mainboard chipset but didn't think it would ever be an issue as I though PCIE 4.0 lanes were close to 2GB/ps each. The SSD's are also Gen 3, but if I start getting issues I guess I'd start with a seperate NIC.
Its not only the NIC + soundcard + M.2 + SATA. Its also all USB and one of the PCIe slots. You are right, 4x PCIe 4.0 is 8GB/s bandwidth. But sum up all the stuff:
6x SATA = 3,6GB/s
1x M.2 = 8GB/s
1x NIC = 0,125GB/s
1x PCIe Slot (16x physical as 4x electrical) = 8GB/s
3x USB 3.2 gen2 = 7,5 GB/s
4x USB 3.2 gen1 = 5GB/s
8x USB2
1x Sound
...
So basically 32GB/s bandwidth thats needs to be pushed between chipset and CPU through a 8GB/s link. So it should be fine as long as you are not using a gen4 M.2 or that one PCIe slot.
So that one PCIe Slot is basically useless.

Interesting, I've never really thought about amplification as data goes up the storage stack, performacne and capacity losses tho deffinatly. How do you find out the write amplification for a workload? I'd have thought this would have been sequential as I'm just rsyncing files to a newly formatted disk.

As for what I'm using withint the guest VM, it's Ubuntu 20.04 with its OS disk on the NVMe pool and a seperate disk on the PELX pool formatted with xfs
Code:
root@plex-server:~# xfs_info /dev/sdb
meta-data=/dev/sdb               isize=512    agcount=10, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=2684354550, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
I'm right niw doing alot of write amplification tests in this thread.
There is alot stuff like this:
4k sync/async write/read on 2 disk ZFS mirror using ashift=12 & volblocksize=4K comparing ext4 with journaling vs ext4 without journaling vs xfs with journaling:
ext4 journal syncext4 no journal syncxfs journal sync
ext4 journal async
ext4 no journal async
xfs journal async
Write Performance:1,87 MiB/s4,26 MiB/s2,4 MiB/s125 MiB/s133 MiB/s129 MiB/s
Read Performance:9,25 MiB/s9,67 MiB/s9,67 MiB/s295 MiB/s282 MiB/s295 MiB/s
W.A. fio -> guest:5,11 x2,24 x2,86 x1,01 x1 x1 x
W.A. guest -> host:10,02 x9,67 x12,34 x3,67 x2,17 x4,20 x
W.A. host -> NAND:1,11 x1,15 x1,09 x1,46 x1,46 x1,41 x
W.A. total:56,84 x25,19 x38,38 x5,44 x3,16 x6,53 x
W.A total is the total write amplification from fio benchmark inside the guest to the NAND cells of the SSDs. So I got only 2,4/9,67MiB/s speed with factor 38 write amplification writing/reading with 4K sync writes to XFS. All the same but with async instead of sync IO it gets up to 129/295MiB/s and only with a write amplification factor of 6,53. You see that sync IO is really terrible...so don't wonder if your DBs are quite slow^^
Nope, I used the defaults like a mug! XD I'm going to remake the PLEX pool shortly I think.
Would this also have anything todo with the fact the there seems to be 16.34 TB used on the PLEX pool after creating a single 10TB disk? And maybe the performance issues....

I'm basically hoping this is the answer to all my problems XD

Thanks
Yes, most of that that additional 6TB should be the padding overhead you could prevent by using a bigger volblocksize.
 
Last edited: