which raid

dragonslayr

Renowned Member
Mar 5, 2015
53
2
73
For a small budget.

This is for a single host.

I've been greatly disappointed in ZFS raid 10, the FSYNCS/SECOND doesn't get it. at all!

Proxmox existing vm with mdadm raid 1 = 999 beats the crap out of it. (Tested while guest were running, one of them a mail server)

old perc5 raid card= 3964 on anther machine.

While I'm aware I could add 2 ssd's to the machine to improve performance, that just blows the whole idea of software raid.

#####################################
So, while I have a strong preference for software raid, I've never tried raid 10 mdadm so I don't know what I'm getting into.

As for hardware raid, then I'll get a command line I'm not familiar with (kinda sucks in a pinch), Though I'm aware you can run a windows vm and monitor the raid from that... sigh..


I'm a bit sour at the moment as I just found out the machine I built has to be rebuilt.

I'd really like some suggestions or just some patting on the back and telling me it'll all workout some day.. ha
 
Hi Dragonslayr,

ZFS is not built for speed, it's the vast number of great features like volume management, compression, deduplication, snapshots and online resizing it is built for. So please do not compare it to ext4/xfs on mdadm, it'll always loose with respect to speed (sync per second). Yet with compression, ZFS will outperform mdadm with respect to throughput. It depends on your workload if you really need the fsync/sec.

Software RAID is officially not supported, so if you plan to buy a subscription, it'll be not covered.

To your PERC5 raid numbers, I do not know how much is write-cached, but my pure SSD (enterprise grade) machine with 6x 960 GB in RAID10 yields less fsyncs than your result (write cache disabled):

Code:
root@proxmox4 ~ > pveperf
CPU BOGOMIPS:      91195.92
REGEX/SECOND:      693628
HD SIZE:           7.75 GB (/dev/mapper/pve-root)
BUFFERED READS:    376.89 MB/sec
AVERAGE SEEK TIME: 0.17 ms
FSYNCS/SECOND:     3492.56
DNS EXT:           16.99 ms
DNS INT:           6.96 ms

DL380G6 with P410i and two 10k SAS drives (mirrored, no write cache) yields also only 42.62 fsync/sec. On two identical notebooks with an ordinary 2,5'' hd i achieve 32.62 fsync/sec with EXT4 on one and 79.75 fsync/sec with ZFS (no ZIL, only L2ARC on SSD) on the other. So these high numbers are often "cheated" and not the real power of the hard disk. An ordinary hard disk can only provide at most 160 IOPS.

Please test your systems with a real hard disk benchmark like fio.
 
While I'm aware I could add 2 ssd's to the machine to improve performance, that just blows the whole idea of software raid.

You don't need 2 SSD's to drastically improve ZFS performance, you only need one. It will be significantly cheaper then a decent HW RAID card, also much faster. Most likely a single 128GB SSD connected via SATA3 (6 Gbps), that costs about 70-80 USD will change your entire perception of what your server is capable of. Partition it with gdisk so the first partition is 16 GB, the second is 100 GB. Leave the rest free for the drive's self maintenance.

Then read this page:
https://pve.proxmox.com/wiki/Storage:_ZFS
You will find a section called "Create a new pool with Cache and Log on one Disk". Do what it says: the first 16 GB partition will be your ZIL (write cache), the second 100 GB will be your L2ARC (read cache).

Start using your server and give it time, most likely a few days without reboot. Your writes that would block the entire server and cause a hiccup will speed up instantly due to the write cache, and your workload will get cached in the L2ARC in a few hours or days time. My similar setup (4x2TB HDD ZFS RAID10, 1x256GB Samsung 850 Pro SSD with 20GB ZIL, 200GB L2ARC) caused a 4-5x speedup in OLTP (heavily transactional database) benchmarks compared to the hard drives only! ZFS is amazing, you can even use RAIDZ if you have an SSD for cache.

Forget pveperf, it's buggy on ZFS. If you really want to benchmark, use sysbench or fio. Also forget mdadm, if your server loses power at the wrong time, your entire software RAID will be inaccessible and you will most likely lose all your data (happened to us several times).
 
Last edited:
Probably you need 2 SSD to you can put ZIL in a RAID1 config? As far as I understand, if your ZIL gets corrupted, you are doomed (or just loose the most recent writes?).
Also I've seen everywhere recommendation for DC (enterprise class) SSD and Samsung 850Pro is not among them (intel dc3710 is, or Samsung new Sm 863 line is, like 845 DC Pro was).
If you can enlighten me I would be happy, thanks
 
Probably you need 2 SSD to you can put ZIL in a RAID1 config? As far as I understand, if your ZIL gets corrupted, you are doomed (or just loose the most recent writes?).
Also I've seen everywhere recommendation for DC (enterprise class) SSD and Samsung 850Pro is not among them (intel dc3710 is, or Samsung new Sm 863 line is, like 845 DC Pro was).
If you can enlighten me I would be happy, thanks

The things you keep repeating here belong to the same class of FUD (Fear, Uncertainty and Doubt) as the necessity of ECC RAM and the need to use a BBU (battery cache) on a RAID controller. You are not "doomed" if you lose your ZIL, you merely lose your last couple of seconds of synchronous writes (async writes go straight to disk from RAM). And since ZFS is CoW (copy-on-write), your filesystem will stay unharmed. For example: in a database config you will lose your last transaction or two, but your DB will be just fine. Now OTOH if any hardware fails at a bad time in your server running md raid, you will likely lose all your data (it happened to us many times).

The OP clearly stated his setup is a low budget single host config
(probably for home use), so I can't fathom why on Earth would you start recommending enterprise-grade hardware in this thread.

Read more, comment less.
Also it's spelled "lose", loose means not firmly or tightly fixed in place (adjective) or release (as a verb). :)
 
Last edited:
I am busy re-reading all these great reply's for the 3rd time. And will read them again after that to be sure I've got it. To those replying here, I felt a need to post a very heartfelt THANK YOU!
I really mean it. Thank you so much!

Oh, I guess I do have one question already, if I do just one ssd drive and include zil, and it dies, I understand the machine will not boot and I'll be in a world of hurt. Is this true?
 
You don't need 2 SSD's to drastically improve ZFS performance, you only need one. It will be significantly cheaper then a decent HW RAID card, also much faster. Most likely a single 128GB SSD connected via SATA3 (6 Gbps), that costs about 70-80 USD will change your entire perception of what your server is capable of. Partition it with gdisk so the first partition is 16 GB, the second is 100 GB. Leave the rest free for the drive's self maintenance.

Then read this page:
https://pve.proxmox.com/wiki/Storage:_ZFS
You will find a section called "Create a new pool with Cache and Log on one Disk". Do what it says: the first 16 GB partition will be your ZIL (write cache), the second 100 GB will be your L2ARC (read cache).

Start using your server and give it time, most likely a few days without reboot. Your writes that would block the entire server and cause a hiccup will speed up instantly due to the write cache, and your workload will get cached in the L2ARC in a few hours or days time. My similar setup (4x2TB HDD ZFS RAID10, 1x256GB Samsung 850 Pro SSD with 20GB ZIL, 200GB L2ARC) caused a 4-5x speedup in OLTP (heavily transactional database) benchmarks compared to the hard drives only! ZFS is amazing, you can even use RAIDZ if you have an SSD for cache.

Forget pveperf, it's buggy on ZFS. If you really want to benchmark, use sysbench or fio. Also forget mdadm, if your server loses power at the wrong time, your entire software RAID will be inaccessible and you will most likely lose all your data (happened to us several times).

1.Is there a problem if ssd is partionet with fdisk and not with gparted ?
2.How much should ZIL Write cache be hit ? In my server it does not go up more than 100-150MB and as I see ZIL increases IO Wait too much when puts those info in main pool
3.Do you have any recommendation or tuning for ZFS because my config is 2x2TB Drives and one disk SSD log 40gb for zil and one other with 150GB for L2ARC
 
1.Is there a problem if ssd is partionet with fdisk and not with gparted ?
2.How much should ZIL Write cache be hit ? In my server it does not go up more than 100-150MB and as I see ZIL increases IO Wait too much when puts those info in main pool
3.Do you have any recommendation or tuning for ZFS because my config is 2x2TB Drives and one disk SSD log 40gb for zil and one other with 150GB for L2ARC

I can answer the first question, use gdisk instead of fdisk for a gpt partition.

Now, quit hijacking my thread. haha
 
1.Is there a problem if ssd is partionet with fdisk and not with gparted ?

fdisk is on some systems limited to partition table only, gparted can write GPT.

2.How much should ZIL Write cache be hit ? In my server it does not go up more than 100-150MB and as I see ZIL increases IO Wait too much when puts those info in main pool

ZIL is hit every time on a synchronous write, not a asynchronous one. ZIL will wear our your SSD eventually and it does not use trim.

3.Do you have any recommendation or tuning for ZFS because my config is 2x2TB Drives and one disk SSD log 40gb for zil and one other with 150GB for L2ARC

As I wrote earlier in other threads of yours: Read how L2arc impacts L1arc. You will not speed up things more by throwing more L2arc at it. You are bound to an upper limit of l2arc by arc size in your ram. If you exceed this limit, you will decrease your performance.
 
Oh, I guess I do have one question already, if I do just one ssd drive and include zil, and it dies, I understand the machine will not boot and I'll be in a world of hurt. Is this true?

I haven't tested this, but I doubt it, ZFS is quite resilient. If you are really interested, why not try it? Install Proxmox on your HDDs, add the ZIL later, then physically disconnect the SSD during operation.
Do tell us what happened!

1.Is there a problem if ssd is partionet with fdisk and not with gparted ?
2.How much should ZIL Write cache be hit ? In my server it does not go up more than 100-150MB and as I see ZIL increases IO Wait too much when puts those info in main pool
3.Do you have any recommendation or tuning for ZFS because my config is 2x2TB Drives and one disk SSD log 40gb for zil and one other with 150GB for L2ARC

1. gdisk=GPT. Everyone (even the Proxmox wiki) says use GPT with ZFS. Why would you want to use fdisk, seems idiotic.
2. I don't understand your question.
3. I have no idea what's going on in your system, you should find that out. Couple of tips:
- Your ZIL is probably unnecessarily big (you need like 10 seconds of max IO throughput, so 10-20GB is should be more than enough).
- L2ARC needs lots of RAM if you have many files (like with OpenVZ)

There are many tools to check ZIL, L2ARC usage. Read this post and the comments, you can find out about the tools:
http://constantin.glez.de/blog/2011/02/frequently-asked-questions-about-flash-memory-ssds-and-zfs
 
It looks to me like current fdisk (proxmox, jessie) does know about GPT:

Code:
root@hypervisor:~# fdisk -l /dev/sdm


Disk /dev/sdm: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: AAB7CAFD-DF6C-0040-B568-9A583EDA516C


Device          Start        End    Sectors   Size Type
/dev/sdm1        2048 1953507327 1953505280 931.5G Solaris /usr & Apple ZFS
/dev/sdm9  1953507328 1953523711      16384     8M Solaris reserved 1


root@hypervisor:~# gdisk -l /dev/sdm
GPT fdisk (gdisk) version 0.8.10


Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present


Found valid GPT with protective MBR; using GPT.
Disk /dev/sdm: 1953525168 sectors, 931.5 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): AAB7CAFD-DF6C-0040-B568-9A583EDA516C
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 1953525134
Partitions will be aligned on 2048-sector boundaries
Total free space is 3437 sectors (1.7 MiB)


Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      1953507327   931.5 GiB   BF01  zfs
   9      1953507328      1953523711   8.0 MiB     BF07
 
Thanks for your kind words, especially at the end of your answer, is a pleasure have my concerns answered by you :)
I read a lot, that's why I asked about your suggestions since they seem contradictory to what I've read around.
Low budget does not mean "I'm happy to lose all my data", and a small Intel 3500 does not cost too much if other solution are not reliable (but you seem to suggest they are instead, I take note).
I'm not a ZFS expert like you and I ask for right suggestions to save money, time and data in "just trying myself" if others can provide good (kind) advices.
 
As for hardware raid, then I'll get a command line I'm not familiar with (kinda sucks in a pinch), Though I'm aware you can run a windows vm and monitor the raid from that... sigh..

Not sure which Hardware RAID card you are using, but, if the 'windows' software uses LSI 'MegaRAID Storage Management' (MSM) software, this runs perfectly on Linux as well ...
I've installed it onto the ProxMox VE host itself using alien to convert the rpm's to deb packages.
Make sure the 'vivaldiframeworkd' is started, and you can graphically manage the RAID arrays on your adapter.
I'm using a LSI 9260-16i and managing it via MobaXterm built-in X-server.
 
Not sure which Hardware RAID card you are using, but, if the 'windows' software uses LSI 'MegaRAID Storage Management' (MSM) software, this runs perfectly on Linux as well ...
I've installed it onto the ProxMox VE host itself using alien to convert the rpm's to deb packages.
Make sure the 'vivaldiframeworkd' is started, and you can graphically manage the RAID arrays on your adapter.
I'm using a LSI 9260-16i and managing it via MobaXterm built-in X-server.
Hi,
and if you use an areca sas-raid-controller you got an web-interface (own nic) to configure the volumes (and the cli is also easy).

Some post above gkovacs wrote, that zfs is much faster than a hardware raid... I guess this depends strongly by the raid-controller /system / numer of disks...

with an ARC-1882I + 8*SAS in raid-10 I got following performance:
Code:
pveperf /var/bareos/spool
CPU BOGOMIPS:      83806.08
REGEX/SECOND:      2739510
HD SIZE:           898.24 GB (/dev/sdj1)
BUFFERED READS:    1027.16 MB/sec
AVERAGE SEEK TIME: 4.89 ms
FSYNCS/SECOND:     8922.59
DNS EXT:           71.52 ms
DNS INT:           0.45 ms
Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!