New Install, Doing Storage Right (I hope?)

tufkal

New Member
Feb 3, 2018
11
1
1
44
Greetings all,

I currently have a small production server running <10 VMs that is having performance problems. Most notable, WA (I/O delay). All of my VMs are backed up to a NFS share on a NAS weekly, so my plan is to next weekend tear the box apart, reinstall PVE, and restore the backups.

Current configuration
PVE 4.3
12 Core 6000 Series AMD Opteron CPU
24GB DDR3 ECC/REG Memory
2x1TB 7200rpm Standard Drives in RAID1 (via ZFS) for PVE & ISOs
8x1.5TB 7200rpm Enterprise Drives in RAID-Z2 (via ZFS) on 3WARE 9650SE (JBOD) w/o BBU for VMs

Projected Configuration
PVE 5.1
12 Core 6000 Series AMD Opteron CPU
48GB DDR3 ECC/REG Memory
2x1TB 7200 Standard Drives in RAID1 (viz ZFS) for PVE & ISOs
4x500GB 3D NAND SATA3 SSDs in RAID-Z (vis ZFS) on PCI-E SATA3 4 Port Controller for VMs
8x1.5TB 7200rpm Enterprise Drives in RAID-Z2 (via ZFS) on 3WARE 9650SE (JBOD) with BBU for Data Drives

As you can see the key differences are:
-More memory
-RAID-Z1 of SSDs for the VMs
-Adding BBU to the 9650 for write caching (I think this will help? Write caching work in JBOD?)
-Will add chunks of storage from 8x1.5TB pool for VMs that need it.

-----

First question, anyone see anything alarming or glaringly wrong with that setup? Am I missing something completely that shows how little i know about PVE and ZFS?

-----

Second questions, the setup. Typically I run the install wizard and only let the wizard create the RAID1 of 1TB drives for PVE, and then on first boot run the zfs/zpool commands from console for the other areas. The trick here is that I want to take advantage of every bit of speed that ZFS can offer me, including anything special for the SSDs (which I have never used on ZFS, TRIM?).

What would be my optimal zfs/zpool commands to create the ssd pool?

Likewise, for the enterprise drive pool on the 3WARE 9650, what would be my optimal zfs/zpool commands to create that large RAID-Z2 pool?

NOTES:
The only flag I think I set on my original setup was 'zfs set compression=lz4' on the large 8 drive pool.

Any helps or pointers is greatly appreciated.
 
It's been a while since I looked into ZFS and I'm inexperienced, but here's what jumps out at me:

Performance on ZFS is driven by caching. There's the ARC which is system RAM devoted to caching recently used data so access is nearly instantaneous. There's L2ARC, which is used for less-recently-used data and it's generally stored on fast storage and RAM is used to index it. So if you want to increase read speeds, then increasing system RAM and devoting more to the ARC is worth doing. (It's not always worth it to have an L2ARC...)

Now, writes are different. If I remember correctly ZFS is designed to hold pending writes, adding them to a single transaction group, and synching the entire pile of pending writes to disk every 5 seconds in a way that assures that disk integrity is intact should something fail between (or during) transactions. This can speed things up in general because the transactions can be re-ordered and optimized, and since the hard drive is the slowest part of the computer this can lead to big overall gains. The problem is this:

For writes that need to be committed in the mean-time (think critical stuff that can't be lost), generally the software waits until the storage subsystem (ZFS) says "yep, it's been written to disk." The way to speed this up with hardware RAID is to have a nonvolatile RAM write cache, so you can copy it to NVRAM quickly, report back that it's been written, and if the power fails the data is still stored, ready to be written when the power is back up.

ZFS does something similar, with a ZFS Intent Log which grabs pending writes between those 5 second transaction writes. If you move this log to fast SSD (google ZIL and SLOG) then you can greatly reduce the wait times and increase overall performance. (Remember: by default in a virtual environment everything written from VMs waits until it gets confirmation from ZFS that it's been written, because the virtualization system doesn't know if that's a nude you downloaded that's being written, or a $250,000 e-commerce order that requires ACID compliance on its writes. Everything is being treated as critical data, so everything needs a confirmation it's been protected by being written to disk.)

So, I'd add a fast SSD for each ZFS pool, partition it appropriately (~ 5 times the maximum amount of data that can be written by the pool in a second), and see how things look then.
 
Last edited:
A bit of Googling has me interested. Instead of using SSDs for the VMs and 7200s for the data, this concept of ZIL/ARC and mixing them all together to get the best of both worlds. Unfortunately I know nothing about how to implement it. Seems like a lot of ZFS misinformation out there since SSDs have come into play.

You said it's been a while since you looked at ZFS. With 8 1.5TB drives and 4 500GB SSDs, what would you recommend to setup PVE? Something else?
 
You are getting close, but will need to do some edits on that conf... A couple observations-

The 3Ware cards gave me terrible performance on another platform when I used to use them, in fact I think I had that exact model and had to ditch it for another card.

That aside - it is not an HBA - this is the glaring issue with ZFS, it will give pretty bad performance, even in JBOD mode, the OS still sees a raid device instead of the real drive, I tried the JBOD trick on a Dell H710 - a MUCH higher rated card than your 3ware, and it was "so-so", switching to an HBA gave me much better perf, in fact I could actually watch data hit the disks with iostat in *mbps*, where as before it was like zfs was being reluctant to send data @ kbps. The HBA allows ZFS to monitor the exact state of the drive at all times.

That is a pretty old card, if you want to use something cheap I might suggest a Dell H310 (non-mini) or IBM M1015, it can be flashed to LSI 9211-8i IT mode (pure HBA). If you are not willing to swap in a HBA, dont use ZFS on the 3ware, if it has the bbu, just ran normal LVM on it and you will have 30-40% better write perf.

SSD... these sound like Sandisk or Samsung desktop drives, they would be good for a RAID10 ZFS array, only 4 wont do much for raidz, I would not use them for cache drives, you might run separate partitions on the SSD array on your VMs. I remember the Samsung Pro drives had a write life around 150TB.

RAM - I may be wrong, but I feel you need at least 64gb before adding an SSD read cache (L2ARC), you can add a write cache (SLOG) anytime, but it needs to be an industrial SSD with high write endurance, most likely not the ones you got. The ZIL only needs to be a 4-5gb partition, the L2ARC 100-200 depending on your RAM, a L2ARC DB consumes RAM. I would suggest the Intel Optane 900p for a ZIL, it has an expected write life over 5 petabytes for under $400. If you need something budget minded, you might look at the Sun F40 or F80 on ebay. Your ZIL needs good write speed, reads dont matter, the L2ARC needs good read speed. ZIL should always be mirrored, L2ARC can be striped. Unfortunately I have yet to see the ideal small ZIL device for a good price, but newer tech dwarfs the speed of older smaller ones at a larger size and better price, you might check a few of these or these.

RAIDZ on spinning drives... that is good for space, but not IO, what are your 10 guests running? If they are IO intensive, ie a high traffic database you may not like it, or you can place the DB on a virtual drive on the SSD array, and put bulk storage on your spinners, this could work fine. Otherwise, a ZFS RAID10 on the spinning drives will give much better IO.

The SSD cache drives are not magic, they will not cache 100% of your traffic, you will on occasion see the direct speed of the spinners.

If you want to imagine what speed you will have on different SSD or spinners, check this: https://calomel.org/zfs_raid_speed_capacity.html
 
  • Like
Reactions: Tmanok
If you are determined to do ZFS, ideal config:

- RAID1 for PVE boot.
- RAID10 of 8 drives
- 3x PCIe SSD, 2 for ZIL/SLOG, and 1 for L2ARC read cache.
4x desktop SSD, i dont know what to do with these, if you have slots, use them for a faster RAID10 array if your use case can put anything on there. You could *fool* with striping these as a read cache, loss of read cache will not harm your data, loss of write cache WILL HARM data.

Enable compression on all zfs pools, this will give a speed bootst.

1. Install PVE as you said, use installer to mirror boot drives.
2. identify all your disks properly, use lsblk, then find the id to use in zfs, avoid using drive letters, plugging in a USB or other drive and rebooting could re-order your letters.

ls -l /dev/disk/by-id

Then set all your zfs drives to gpt:
Code:
parted /dev/sdc mklabel gpt
parted /dev/sdd mklabel gpt
## do all of them

Then make your RAID10:
Code:
mkdir /mnt/bigraid10
zpool create -f -m /mnt/bigraid10 bigraid10 mirror /dev/disk/by-id/ata-ST500DM002-1BD142 /dev/disk/by-id/ata-ST500DM002-1BD144 ## that would be sdc and sdd in theory
## for brevity I will use sdx next, but you should always use the by-id mapping instead!!
zpool add bigraid10 mirror /dev/sde /dev/sdf
zpool add bigraid10 mirror /dev/sdg /dev/sdh
zpool add bigraid10 mirror /dev/sdi /dev/sdj
zfs set compression=lz4 bigraid10
## Add the zil mirror to your spinners:
## First partition it down:
parted /dev/sdk mklabel gpt mkpart zfs 0% 5G
parted /dev/sdl mklabel gpt mkpart zfs 0% 5G
zpool add bigraid10 log mirror /dev/sdk1 /dev/sdl1
## add a couple of l2arc drives in a stripe (if you were to *play* with desktop ssd):
zpool add bigraid10 cache /dev/sdm /dev/sdn
## The l2arc drives should also be partitioned down in per-portion to your ram

Then add the zfs to the storage gui in PVE, you can also add the /mnt/bigraid10 as a *directory* where you can save isos, backups, cts, etc.

Some commands to help tune/maintain your pool:
Code:
zpool status
zfs list
arcstat 2 100
zpool iostat -v bigraid10 3 100
## put this in cron weekly (afterhours):
zpool scrub bigraid10
 
Last edited:
Great code examples, now I'm torn.....

Do I go that route with a full big ZFS pool with SSD caches, or do I use the raid card's RAID6 on the spinners and LVM the SSDs.......

All of the machines are idle 95% of the time, and only a few of them need access to the big 8 drive storage.

If I kept the 8 drive with the card (and possibly upgrade the card), whats the best use of the 4x SSD? Does ZFS play nice with SSDs as far as TRIM and all that so I can just RAIDZ them? What would be an example command to create a pool when all are SSDs?
 
On your original system you talk of io issues... I am highly suspect of the 3ware, ~8 years back I used 20-30 of those on both windows and linux servers in bare metal and perf was on the low side, I used one on a low-load hypervisor with 1 VM! and it was unbearable, and it wasnt even RAID5/6. I would hope you could spend the $20-30 for a non-raid card of some LSI variant, or a raid card that can be converted to non raid, like the H310/M1105, this is your lowest hanging fruit, and may even cost less than a BBU.

But - Dont take my word on this, google it here for slowness in general, and of course more specifically on ZFS here.... think about it - this card was released in 2006, 12 years ago!! it was their very first PCIe card, it did not go well back then.

Anyways proceed methodically:
Step 1. Benchmark your current storage with just one VM if possible, use Anvil or fio.
Step2. Upgrade the system, and run the same benchmark on a single VM on the RAID card as normal LVM, I have a feeling you will notice a *slight* improvement with a bbu, but most likely not enough to solve your issue. Worst case scenario: you will be no worse than before.

Do all that on the same spinning drives, so you can compare apples>apples, then if you wish, do another on the SSDs.

RAID5/6 will give bad write throughput on any card, worse on 3ware. Throwing ZFS on top of that will only lower your write performance much worse, having an SSD cache will most likely give minor improvement, with frequent slowness.

With RAID6, you throw out 2 of the 8 drives, only 2 drives shy of a RAID10 capacity on the same platform 9tb vs 6tb, same goes for ZFS, and you will get 30-40% write improvement moving to zfs mirror instead of raidz. If you are set on raidz, I would suggest putting the majority of your workload on the ssds as a separate array, and give separate partitions to certain VMs needing large space on the spinners.

Know this - ZFS is not magic, its primary intention is NOT speed, in fact it makes drives work twice as hard, because its primary intention is data integrity. This is why many users try to breath life back into it doing all these caching methods, and if your data behavior does not fit the cache scenario, you will see the bare disk speed for what it is. Thats not to say you cant make zfs work acceptably fast, you just have to pay attention to a lot more details than LVM/Hardware RAID (with better cards). You dont just throw a cache drive at it and call it a day.

ZFS fully supports SSD/Trim. Without knowing your use case, I am not sure how SSDs fit in the picture, if you have more hba ports/slots, and your VMs can fit, do it, it may save you time in the long run doing maintenance like updates/reboots, or if not, return them if possible and upgrade to enterprise cache devices like Optane 900p or Sun F40/F80. ZFS will definitely trash a desktop ssd in short time if you try to use it as a write cache. I have a 250gb Samsung 860 I used less than 1 month, with only 1 30gb VM that is often turned off, I am only using it for some benchmark comparisons, it is a cache in front of 6 spinning drives, and in that short time it hit 1% of its write wear.

You still have not told us much more about your use case, are they Windows VMs? What is your storage.cfg? Are you using raw block devices, ie LVM raw, or qcow on ext4, are you using virtio drivers? There are lots of areas to improve perf.

Here are some old comparisons I ran using your similar train of thought - run zfs on JBOD, but note I am using a ~$700 raid card with 1gb cache+bbu, vs your 256 cache. https://forum.proxmox.com/threads/storage-format-benchmarks.35119/ I plan to soon add some benchmarks for ZFS on bare HBA.

To setup a mirror, simply run the same zpool create command I show above on the first pair of disks, then add the 2nd pair of disks to that using zpool add, you can keep adding pairs to the same pool. Or for raidz, you can zpool create myarray raidz2 disk1 disk2 disk3 disk4 etc. More on the wiki: https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks

Lastly, a LVM volume wont cry if you bring it to 99% full, but ZFS will drop performance once you hit 85% capacity, so yes this may drive you away from RAID10, but maybe you need larger disks, or simply adding 2 more would give a RAID10 the same capacity of RAID6, and 2 more spindles = more io. Obviously it boils down to how many bays/ports you have available... great excuse to try to use PCIe SSD cards to keep bays open for big spinners.

Again, note the vendor recommendation is *RAID10*: https://www.proxmox.com/en/proxmox-ve/requirements
[/babbling]
 
  • Like
Reactions: Tmanok and NewDude
On your original system you talk of io issues... I am highly suspect of the 3ware, ~8 years back I used 20-30 of those on both windows and linux servers in bare metal and perf was on the low side, I used one on a low-load hypervisor with 1 VM! and it was unbearable, and it wasnt even RAID5/6. I would hope you could spend the $20-30 for a non-raid card of some LSI variant, or a raid card that can be converted to non raid, like the H310/M1105, this is your lowest hanging fruit, and may even cost less than a BBU.

But - Dont take my word on this, google it here for slowness in general, and of course more specifically on ZFS here.... think about it - this card was released in 2006, 12 years ago!! it was their very first PCIe card, it did not go well back then.

Anyways proceed methodically:
Step 1. Benchmark your current storage with just one VM if possible, use Anvil or fio.
Step2. Upgrade the system, and run the same benchmark on a single VM on the RAID card as normal LVM, I have a feeling you will notice a *slight* improvement with a bbu, but most likely not enough to solve your issue. Worst case scenario: you will be no worse than before.

Do all that on the same spinning drives, so you can compare apples>apples, then if you wish, do another on the SSDs.

RAID5/6 will give bad write throughput on any card, worse on 3ware. Throwing ZFS on top of that will only lower your write performance much worse, having an SSD cache will most likely give minor improvement, with frequent slowness.

With RAID6, you throw out 2 of the 8 drives, only 2 drives shy of a RAID10 capacity on the same platform 9tb vs 6tb, same goes for ZFS, and you will get 30-40% write improvement moving to zfs mirror instead of raidz. If you are set on raidz, I would suggest putting the majority of your workload on the ssds as a separate array, and give separate partitions to certain VMs needing large space on the spinners.

Know this - ZFS is not magic, its primary intention is NOT speed, in fact it makes drives work twice as hard, because its primary intention is data integrity. This is why many users try to breath life back into it doing all these caching methods, and if your data behavior does not fit the cache scenario, you will see the bare disk speed for what it is. Thats not to say you cant make zfs work acceptably fast, you just have to pay attention to a lot more details than LVM/Hardware RAID (with better cards). You dont just throw a cache drive at it and call it a day.

ZFS fully supports SSD/Trim. Without knowing your use case, I am not sure how SSDs fit in the picture, if you have more hba ports/slots, and your VMs can fit, do it, it may save you time in the long run doing maintenance like updates/reboots, or if not, return them if possible and upgrade to enterprise cache devices like Optane 900p or Sun F40/F80. ZFS will definitely trash a desktop ssd in short time if you try to use it as a write cache. I have a 250gb Samsung 860 I used less than 1 month, with only 1 30gb VM that is often turned off, I am only using it for some benchmark comparisons, it is a cache in front of 6 spinning drives, and in that short time it hit 1% of its write wear.

You still have not told us much more about your use case, are they Windows VMs? What is your storage.cfg? Are you using raw block devices, ie LVM raw, or qcow on ext4, are you using virtio drivers? There are lots of areas to improve perf.

Here are some old comparisons I ran using your similar train of thought - run zfs on JBOD, but note I am using a ~$700 raid card with 1gb cache+bbu, vs your 256 cache. https://forum.proxmox.com/threads/storage-format-benchmarks.35119/ I plan to soon add some benchmarks for ZFS on bare HBA.

To setup a mirror, simply run the same zpool create command I show above on the first pair of disks, then add the 2nd pair of disks to that using zpool add, you can keep adding pairs to the same pool. Or for raidz, you can zpool create myarray raidz2 disk1 disk2 disk3 disk4 etc. More on the wiki: https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks

Lastly, a LVM volume wont cry if you bring it to 99% full, but ZFS will drop performance once you hit 85% capacity, so yes this may drive you away from RAID10, but maybe you need larger disks, or simply adding 2 more would give a RAID10 the same capacity of RAID6, and 2 more spindles = more io. Obviously it boils down to how many bays/ports you have available... great excuse to try to use PCIe SSD cards to keep bays open for big spinners.

Again, note the vendor recommendation is *RAID10*: https://www.proxmox.com/en/proxmox-ve/requirements
[/babbling]

Very good information, let me see if I can clear it up for you so you can make a proper recommendation.

-8VMs, mix of Debian and Windows 7, doing various activities (PBX, file server, managed backup server, file sync node, etc)
-Incredibly slow with current setup of RAIDZ2 of the 8 spinners.
-All VMs are usually <5% CPU all day, all are on-demand and none too intensive when demanded
-9650SE being used because its the 16 port version in a chassis with 12 hot swap bays. Basically using it as a simple SATA controller giving me the ability to use so many disks. BBU was not used originally as I thought it not needed in JBOD (still not sure on that)
-Have some Samsung 960 PRO available to help speed things up and my original research led to me look at using them as caches for ZFS
-VMs created default in PVE (ext4/qcow/virtio), no direct mapping to drives.

After learning some things in this thread, and doing some of my own research, I am rethinking using ZFS at all. I'm now thinking I should simply do LVM/MDADM raid on the SSDs for the VMs to be on, and use the 3ware RAID6 on the spinners. This gives me in the end what I want, fast VMs and large storage, albeit in 2 places.

Am I wrong to now be thinking away from a single ZFS?
 
  • Like
Reactions: Tmanok
Let us know how that 3ware turns out... benchmark it, know your performance as an actual number, and associate that with what you experience on your VM workload... I suspect it will do about 300mb/s on RAID6 - I wouldnt want too many win-vms touching that. I would not touch the 3ware with zfs, but the only way to know how bad it is..... [falls off the cliff trying 3ware].

The transfer rate on the 9650 is 300mb/s, putting a 550mb/s ssd on that would be a waste - go buy some cheaper ssds to match that cheap card.

If you are on a budget, you are about $50 off from getting a proper ZFS system up, you already spent ~$1500 by the sound of it, why not fix it right. You could use a pair of LSI cards at $20-30 each giving you yet more performance by splitting your load across multiple cards. Putting more spindles on a single 3ware is just piling up your bottleneck. Why, when there's- https://www.ebay.com/itm/Dell-PERC-H310-RAID-Controller-Card-153/232681137395

Samsung 960 is a desktop SSD, if you wanted to use them in an array that would be fine but they are not intended to be used on a server as a write cache, you say your your workload is low (knowing the OS is nice, but what are your debs doing?), they may work out fine, but that would be a waste of a costly SSD, and you may not even need it (you say its low load)- The write cache only needs 5gb max, and most experts say don't waste your time on a read cache unless you have 64gb RAM, even then read cache needs around 200gb max. If your budget has come close to an end, you can get a 2nd hand F40 with the proper ratings for about $100 each, sometimes less-
https://www.ebay.com/itm/Sun-Flash-Accelerator-F40-400GB-Solid-State-Memory-p-n-7026993/391966591725
If you are on a tighter budget, you could try the F20 for a measly $20-
https://www.ebay.com/itm/Sun-Micros...s-Flash-Accelerator-F20-SAS-Card/382376113388

ZFS could work extremely well on any of your ssds as long as they are not plugged into a 3ware card, if you are set on that card, I would avoid ZFS.

If it were me I would ZFS the ssds on a stand-alone HBA and run the majority of my work load on that, definitely anything write intensive. Put the other drives on another stand-alone HBA, and most likely you will not need any caching at all as the majority of your workload should be coming from the SSD. ZFS should give you great read performance on the spinning drives, if your application is mostly read it will work very well.

If you are not willing to pay attention to the glaring issue at hand of using a *12-year old SATA2* card then stay away from ZFS. You are dumping heaps of cash into a wheel barrow with holes in it - go to town with that.

If you want to pay attention, start by benching your win7, run Anvils storage benchmark, open task manager>resource monitor, watch the disk Queue length there, open the pve server page and watch the server summary for io wait, use all the tools to tell you where you are.
 
  • Like
Reactions: NewDude
You were real worried about ZFS and SSD TRIM - the 3ware does not support TRIM, unless they have come out with an update (i dont think broadcom cares to support them).

This is what you could be looking at-
https://forum.proxmox.com/threads/ssd-on-raid1-beware-of-disk-degradation-over-the-time.19055/

On another note, if you are going to sink more in to that card by getting a bbu, I think it has a setting to force writeback without bbu, this should show you the exact performance of a bbu without wasting further time/$$.
 
OK ok ok!! I'm going to buy a HBA, your case is rock solid~ It's a reality check that a card that can do everything I want the 9650 does but better for <$50.

The plan is going to be to ZFS RAIDZ the SSDs on a H310, and ZFS RAIDZ2 the spinners on another H310. There really is only a few VMs that need more than 100GB storage and will need space from the bigger spinner zpool, so I think keeping the SSDs and spinners separate makes the most sense (VMs get pure SSDs).

For around $80 it looks like I can get two, and then I can stop worrying and get down to business.

You mentioned 'non mini' in a previous post a bout the H310, and referenced a way to flash it with firmware. I'd be interested in this process, and to make sure I get the right one.

Do you have a PO box I can send my 9650 to you so you can break it over your knee in victory?!? :p
 
I bought a pair for $20 once ;) As long as it looks like a "pcie card", its not the mini, that will be really obvious.

Flashing takes ~5-10 minutes and a couple reboots- https://techmattr.wordpress.com/201...-mode-dell-perc-h200-and-h310/comment-page-1/

Towards the bottom of that post there is a summary of the commands in 6-7 lines... most importantly be sure to note the SAS address prior to flashing, I would install/flash 1 card at a time to avoid confusing the units.

Just keep that 3ware someplace where it can collect lots of dust.
 
...ok something is bugging me, and since I have the attention of someone who knows quite a bit I'll just ask.

I'm concerned about doing a RAIDZ of SSDs, no matter what the HBA/card. RAIDZ is glorified RAID5. Won't all that parity I/O tear those SSDs up super fast?

Maybe I was not clear, but I wanted to do RAIDZ of 4 500GB drives and get 1.5TB of space for VMs.

Perhaps I should be considering a LVM of 2 ZFS 500x500 mirrors (stripe of mirrors RAID10 style). Lose 500GB but gain years of use on the 4 drives?

Or am I overthinking this and the parity in RAIDZ is not as bad as the parity in traditional RAID5, and I can just go RAIDZ and get N-1 space out of N SSDs without worry.
 
Well, let's think it through using normal RAID levels and keeping ZFS out of it. Let's say you're writing a 100 megabyte file:
  • With RAID1 each device is writing the whole file, so you'll have 100MB written per device.
  • With RAID0 each device gets ~ half of the file because it's striped, so in your case that would be 25MB.
  • With RAID5, each device gets 100MB / (# drives minus one) = 33.3MB written. You're writing one third to each of the three data drives, and a parity drive is getting another 33.3MB.
That's typical, right? Again, it's been a long time since I thought this all through but that's how I'm remembering it.

I'd say your RAIDZ writes will be better than if you were running a mirror.
 
Well, let's think it through using normal RAID levels and keeping ZFS out of it. Let's say you're writing a 100 megabyte file:
  • With RAID1 each device is writing the whole file, so you'll have 100MB written per device.
  • With RAID0 each device gets ~ half of the file because it's striped, so in your case that would be 25MB.
  • With RAID5, each device gets 100MB / (# drives minus one) = 33.3MB written. You're writing one third to each of the three data drives, and a parity drive is getting another 33.3MB.
That's typical, right? Again, it's been a long time since I thought this all through but that's how I'm remembering it.

I'd say your RAIDZ writes will be better than if you were running a mirror.

Your example is so simple it makes me want to copy & paste it to a Wiki somewhere. Makes perfect sense. For some reason I thought with SSDs, TRIM, parity, there was some underlying big no-no for using SSDs in a parity raid. I'm over thinking things now...
 
Short take: I would call that "not the whole story", but most likely (you say you have light use), you wont notice any issue for several years ... I know, sorry, blah blah here-

Rest of the story and other important points-
ZFS already writes a transaction log of *synchronous data* (zil) to the drives, then when you do raidz, you add Raid6 parity to that, 1 write command is generating 6 writes to the disks, so yes, it is working the disks harder. I hold this company in very high regard when it comes to zfs, they are the supporters of freenas, VERY important info-
https://www.ixsystems.com/blog/o-slog-not-slog-best-configure-zfs-intent-log/

If you want to know how zfs reacts with ssd vs hdd, read this: https://storagegaga.wordpress.com/2...-ssds-a-better-match-than-other-file-systems/ it does not cover the effects of raidz, it is more generalized.

The main issue is synchronous writes to the zil (IMPORTANT: NOT ALL DATA is synchronous, async data skips zil)-

On a full SSD array that has good random io, a separate zil typically does not make sense (caveat later), so yes it is added wear to the ssds, but this is where storage design comes in to play (is that why you are here?)... Analysis of your activity to project how much io you plan to do on your storage will aide you in buying the right stuff. Big servers that run big databases needing big speed run ssd, and the designer buys the right ssd that is rated with the write life they need. Your situation may not mandate a huge write life, many people may warn you it will die in a year... but they cannot accurately say that knowing nothing about your system, they may last 5-6 years, or 8-10, who knows. As noted by Newdude, the more drives you have, the more your load is spread out, the longer they live. A zil is a single drive that all sync writes filter through before getting pushed to the array - there is no spreading the load, it is getting crushed, this is why zil is always mirrored, failure could cause some data loss.

Now - the CAVEAT to the ssd deal, (know *your* data) how many of your applications do sync writes, ie SQL DB, NFS, and how good are the SSDs at doing sync writes?? They might be blazing fast, but terrible at sync writes. While this article aims at CEPH, the same tests apply to ZFS, and it will tell you how well your SSD handles sync writes-
http://www.sebastien-han.fr/blog/20...-if-your-ssd-is-suitable-as-a-journal-device/

Or if you are lazy, google the model of your ssd and see if anyone else has done sync benchmarks.

Clear example of desktop vs enterprise ssd:
Samsung 512gb 850 Pro is warrantied to be 100% filled, erased and refilled 300 times in its life (150tbw).
Intel P3700 400gb is designed to be 100% filled/re-filled 18,250 times (7.3tbw, or 10 drive writes per day)

Convert that to pounds and tell someone you can lift it, 300lb, impressive - how about 18,250 lb??

That is a flippin monumental difference!!! Not all enterprise drives are equal, so a good storage designer picks a ssd based on its characteristics meeting the data's needs. If a drive has enormous read speed, but bad sync write speed - dont use it for zil, use it as a read cache instead.

Good case for proper SSDs here if you think of doing a dedicated ZIL/SLOG, and Samsung 960 Pro gets a dishonorable mention, they talk of it falling off the chart, see the graphs here-
https://www.servethehome.com/exploring-best-zfs-zil-slog-ssd-intel-optane-nand/

Why do we put up with this cr@p??: "because we care about our data", ZFS offers a higher level of data integrity and many other features.

Your comment about LVM@ZFS.... dont do it. ZFS already does the RAID10 you described natively.

If you create a basic pool without specifying any specific raid type, the default nature of zfs for any further drives added to that pool is to stripe them. So you make a mirror pool, then you add another mirror to that pool (gets striped), add another mirror if you like (gets striped). Optimally you add all stripes at once, but you can add further stripe in the future, just of 1st stripe set is full, data will remain there and no further data will go there, the bulk of load will go to the new stripe pair, this is just a way of adding space, but it will be "lopsided" performance-wise.

Example:
Code:
zpool create -f -m /mnt/ssdarray ssdarray mirror /dev/sdc /dev/sdd ## "mirror" is the command to raid1 those 2 disks
zpool add ssdarray mirror /dev/sde /dev/sdf  ## adds a second mirror to be striped into the pool

You could do that all in one command, but I like to break them down because I use by-id paths and it gets hard to read, do not use sdc, sdd like my example, drives can get re-ordered, but the /dev/disk/by-id will not change, use ls -l to see the mapping to sd names.

Lastly - know *your* data, what is it doing?? how fast, how much, how often? This is easy since you already have established data, check the smart stats on the drive to find out how much data has been written to it- not how full it is, but how much has been written over its entire lifetime. Behind the 3w card you can check like so:
https://www.cyberciti.biz/faq/unix-linux-freebsd-3w-9xxx-smartctl-check-hard-disk-command/

On a non-raid card just check the /dev/sda lettering. Find the LBAs, and check if you have 512, or 4k sectors.
http://www.virten.net/2016/12/ssd-total-bytes-written-calculator/
http://www.jdgleaver.co.uk/blog/2014/05/23/samsung_ssds_reading_total_bytes_written_under_linux.html
 
  • Like
Reactions: NewDude

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!