Poor disk performance

vanadiumjade · Jul 26, 2021

Hi,

I have a dell r910:
2 x Intel E7-4860 10core 20 thread
256 gig ram
Dell H200 HBA controller (factory firmware)
3 x Samsung 4TB SSD.

I am getting abysmal disk performance, struggling to get past 70-100MBps
Disks are capable of 500+ I would have thought 300+MBps should be achievable.

No idea if this is a proxmox issue - but want to know what steps I could take to investigate and how I could improve disk performance.

dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 12.0071 s, 89.4 MB/s

dd if=/dev/zero of=/tmp/test2.img bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 44.5972 s, 11.5 kB/s

Jade

Dunuin · Jul 26, 2021

You should tell us what kind of storage you are using (LVM/ZFS/Ceph/...), what the options for that storage are, how the drives are used with that (for example raidz1 or some other kind of raid) and storage options your guests use. Would also be useful if you can tell us the SSD model. If it is a consumer SSD super slow sync writes can be normal.

vanadiumjade · Jul 26, 2021

Dunuin said:
You should tell us what kind of storage you are using (LVM/ZFS/Ceph/...), what the options for that storage are, how the drives are used with that (for example raidz1 or some other kind of raid) and storage options your guests use. Would also be useful if you can tell us the SSD model. If it is a consumer SSD super slow sync writes can be normal.

Samsung 870 ssd:
https://www.pbtech.co.nz/product/HD...CEG09d6Ci-9Suz1A7Qp1LqEbh3R0Mx7xoCfKcQAvD_BwE

At the moment drives are broken into two groups.

1 disk zfs and next week was going to add mirror drive. This is where all the VM disks live.

2 disks zfs no raid. This is used just for truenas for NFS shares did above vms store some meta data. ( For example nextcloud config is stored on vm disk in group one, all user data is stored on NFS share to truenas on 2nd pool )

The data is backed up no another server, so I went with the most simple option to start with.
My last setup was super redundant on the host. But was really hard to rebuild. The concept this time was vms are easy to rebuild with some tested scripts from another host.

This is a personal home lab so don't have an endless money tree.

So I will be kicking myself if I spent 2k on drives that are useless! Doh.

Thanks in advance.
Jade

Dunuin · Jul 26, 2021

dd if=/dev/zero of=/tmp/test2.img bs=512 count=1000 oflag=dsync

What is your ashift? If you sync write 512B blocks to a pool that is setup with a ashift of 12 (4K block size) you get massive write amplification. Never write blocks that are smaller then your smallest block size. Also your consumer SSDs got no powerloss protection so they can't cache sync writes and because of that writes are slower and can't be optimized prior to writing so your write amplification will explode again.

An example...lets say you try to write a 512B block to a 4K (ashift=12) ZFS pool and that Pool tries to write it to a SSD that is actually working with a 16K block sizes (but is lying and reporting itself as using 512B or 4K LBA) and can only erase blocks ob 128K:

So you write a 512B block. That is smaller than the blocksize ZFS is using so it can'T just write it, it will need to write 4K. Then your SSD will receive the 4K block, cant write it because it works interally with 16K so it will write it as a 16K block. But it can't write a 16K block without erasing a complete 128K block first. So it needs to read seven 16K blocks, erase the full 128K and write the seven 16K blocks + your new 16K again. Because it is done as a sync write the SSDs can't do anything else until this write is complete (so no parallelization possible) and will write one block after another.
So if you write 1000x 512B actually 1000x 128K are read and written. Thats bad for the performance and bad for the wear of the SSD.

With an enterprise SSD that wouldn't be such a big problem because it can still cache sync writes because no data can be lost on an power outage (because of the powerloss protection).
A enterprise SSD would just merge 1000x 4KB as 32x 128K in cache so it needs only to write 32x 128K.
So for the same amount of data (1000x 512B = 512K) a consumer SSD writes for example 128M (1000x 128K) and a enterprise SSD only 4M (32x 128K). Thats why enterprise SSDs are way faster if you got workloads with alot of sync writes.

With the "dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync" command it isn't that bad because you are just writing a big block where the SSDs can handle it much better because you are not forcing it to write blocks that are smaller than it can handle.

And a raidz1 is also bad for IOPS if you are using these 3 SDDs that way. Here a mirror (raid1) or striped mirror (raid10) is generally recommended as a VM storage because you don't need to do all the parity calculations so there is less overhead. With any raidz1/2/3 you would also need to increase your volblocksize or you are wasting alot of capacity due to padding overhead. And a bigger volblocksize may be bad again for stuff like DBs that do alot of small writes.

By the way...89.4 MB/s for a sync write isn't slow.

And to quote the FAQ of the official Proxmox ZFS benchmark again:

Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise-class SSD?
No. Never. These SSDs wont provide the required performance, reliability or endurance. See the fio results from before and/or run your own fio tests.

So if you can still return the SSDs you might want to get something that is enterprise grade instead if you don't like the sync write performance.
But there it is also important not to get a enterprise SSD that is only rated for read-intense workloads. You want a SSD that is made for write-heavy workloads or atleast mixed workloads.

vanadiumjade · Jul 26, 2021

Thanks so much for the response!

Its probably way to late to return them, so im keen at this point to get them in the best shape I can.

How do I check what ashift I set? I suspect I left it as the default of 12. ( couldnt find anywhere is proxmox UI, or any zpool command that returns it )
Would that be accurate? I saw that drives often lie about sector sizes so the drive is backward compatible with xp or something like that.

fdisk -l /dev/sdc
Disk /dev/sdc: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: Samsung SSD 870
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 31AA00C3-384F-E749-AB9B-D3A42344598B

Device Start End Sectors Size Type
/dev/sdc1 2048 7814019071 7814017024 3.7T Solaris /usr & Apple ZFS
/dev/sdc9 7814019072 7814035455 16384 8M Solaris reserved 1

Dunuin · Jul 26, 2021

Some old EVOs were using a 8K blocksize internally. So maybe its still 8K but possibly way higher meanwhile. Nobody really knows what the SDDs are doing inside. And the block size for erasing stuff (you always need to do a erase before a write) should be way higher. Something like 128K, 256K or on bad SSDs maybe even higher like 4MB. So there is not much that you can do. You could try ashift 12/13/14 and look if one of them is faster. In case of ashift=14 you would also need to create new VMs because they probably are using a volblocksize of 8K and it would be bad to write a 8K volblocksize block to a 16K (ashift=14) pool. And for your DD command you would also need to increase the "bs" so that it isn'T smaller than your ashift.

You can get your ashift with this command: zpool get ashift YourPoolName

But ashift=12 should be fine for most cases because Windows and so on is using 4K too by default so manufacturers should have optimized the firmware of the the SSDs to work well with 4K to get better benchmarks. Sync writes are just stuff that no consumer really cares about because they don't run databases so consumer SSDs are just not build to work well for such a workload. If they would care abount sync write performance they would have added a powerloss protection to the hardware. Same for continous writes. They are just build to deliver high performance for short bursts of IO. If you consistently write to them the performance will drop. Thats another thing no consumer cares about. They want to see high numbers in short benchmarks and not just medium performance but that 24/7.

vanadiumjade · Jul 26, 2021

its ashift=12.

I might get another drive and test different ashifts until I find the best performance, then build another zfs array.

Then get two better ssd running in mirror for VM disks... and use 4x4tb for meta array.

Thanks so much for the help here... its hard when you only know enough to make it worse! haha

Thanks Jade

vanadiumjade · Aug 4, 2021

Dunuin said:
Some old EVOs were using a 8K blocksize internally. So maybe its still 8K but possibly way higher meanwhile. Nobody really knows what the SDDs are doing inside. And the block size for erasing stuff (you always need to do a erase before a write) should be way higher. Something like 128K, 256K or on bad SSDs maybe even higher like 4MB. So there is not much that you can do. You could try ashift 12/13/14 and look if one of them is faster. In case of ashift=14 you would also need to create new VMs because they probably are using a volblocksize of 8K and it would be bad to write a 8K volblocksize block to a 16K (ashift=14) pool. And for your DD command you would also need to increase the "bs" so that it isn'T smaller than your ashift.

You can get your ashift with this command: zpool get ashift YourPoolName

But ashift=12 should be fine for most cases because Windows and so on is using 4K too by default so manufacturers should have optimized the firmware of the the SSDs to work well with 4K to get better benchmarks. Sync writes are just stuff that no consumer really cares about because they don't run databases so consumer SSDs are just not build to work well for such a workload. If they would care abount sync write performance they would have added a powerloss protection to the hardware. Same for continous writes. They are just build to deliver high performance for short bursts of IO. If you consistently write to them the performance will drop. Thats another thing no consumer cares about. They want to see high numbers in short benchmarks and not just medium performance but that 24/7.

I would like some advice on SSD for VMs.

Im thinking about buying a couple of these: https://www.pbtech.co.nz/product/HDDSAM883119/Samsung-PM883-Series-19TB-25in-V4-TLC-V-NAND-Enter
Would these be okay? or if not what would you suggest.... this is for a home lab and its already out of control lol... so would ideally like the best bang for buck that wont have constant IO issues!

Also what is the best configuration to run for VMs. I was thinking about one drive with mirror. or will I get better performance with 4 drives?

Thanks in advance Jade

Dunuin · Aug 5, 2021

If you are low on money you could try to find some second hand SSDs. Server hosters will remove them after some years because of warranty and reliability even if they weren't used that much. Especially SSDs made for write intense workloads may have a ton of TBW left. But you should ask to see a SMART test before buying so you see that the drives are working and how much they were used.

Just one pair of SSDs in mirror should do the job. But the more SSDs you stripe the more performance you should get. Especially for SATA where the SATA protocol will cap the performance.

vanadiumjade · Aug 5, 2021

Dunuin said:
If you are low on money you could try to find some second hand SSDs. Server hosters will remove them after some years because of warranty and reliability even if they weren't used that much. Especially SSDs made for write intense workloads may have a ton of TBW left. But you should ask to see a SMART test before buying so you see that the drives are working and how much they were used.

Just one pair of SSDs in mirror should do the job. But the more SSDs you stripe the more performance you should get. Especially for SATA where the SATA protocol will cap the performance.

Im happy to spend money.. but would be unhappy to buy something that doesnt fix my issue. Would those samsung fit the bill?

I presume I can extend two more drives in the future if I get low on VM space? if I did get another two drives in the future... do I need to rebuild the array? or I can just add? do I need to copy everything off the array and then back again so its striped correctly?

Thanks so much for your help. Just want to make sure im making the right decision.

Jade

p.s second hand market in NZ is not really that cheap.

Dunuin · Aug 5, 2021

vanadiumjade said:
Would those samsung fit the bill?

I don't own them so I can't tell you if they are good or not. Maybe you find some fio benchmarks online so you can compare them with your old disks. They got a powerloss protection so sync writes should be fine, durability is not that great for a enterprise SSD (they got 1.3 DWPD over only 3 years warranty; a evo got 0.3 dwpt, a pro 0.6 dwpd and write-heavy enterprise MLC SSD got up to 10 DWPT over 5 years warranty). Not sure how good the performance is but it sounds like that the PM883 is made primarily for big capacity and read intense workloads so I wouldn't think it got a really great write performance. The SM883 is the version that is made for mixed workloads and should have faster writes and be more durable (3 DWPD over 5 years warranty ) because it uses faster and more durable but smaller MLC NAND instead of slower and less durable but bigger TLC NAND. On paper both got the same IOPS but I would guess on heavy loads the SM883s performance wouldn't drop as low as the the PM883 because of the MLC NAND that should offer a lower latency.

vanadiumjade said:
I presume I can extend two more drives in the future if I get low on VM space? if I did get another two drives in the future... do I need to rebuild the array? or I can just add? do I need to copy everything off the array and then back again so its striped correctly?

ZFS isn't really striping like a normal raid. It will just spread data across multiple disks. You can add more mirrors later and stripe them to extend the pool. In that case all old data will stay on that old mirror and new data should be written to the new mirror until both mirrors are evenly filled. Then the pool should write half to the old and half to the new mirror. So the capacity will add up but the write performance won't increase until both mirrors are equally filled.

vanadiumjade · Aug 6, 2021

I brought and installed https://www.pbtech.co.nz/product/HDDSAM883119/Samsung-PM883-Series-19TB-25in-V4-TLC-V-NAND-Enter
Moved all my VMs, went from 15%-40% IO load when cpu was less than 15%... with same load now io is <1%.

Peak for last hour or so was 1.7%.... I was having all sorts of issues with rancher and nextcloud etc... all gone.
Night and day difference!

thanks for the advice... feels great when it all works as expected.

Jade

Search

Search

Poor disk performance

vanadiumjade

New Member

Dunuin

Distinguished Member

vanadiumjade

New Member

Dunuin

Distinguished Member

vanadiumjade

New Member

Dunuin

Distinguished Member

vanadiumjade

New Member

vanadiumjade

New Member

Dunuin

Distinguished Member

vanadiumjade

New Member

Dunuin

Distinguished Member

vanadiumjade

New Member