[SOLVED] Best Practice ZFS/SW Raid HDD

PBS is installed as a vm with 2 files (virtual sda+sdb) on PVE which hold the raid5.
When you access files inside the datastore (virtual PBS sda) it's like fio benchmark with random read+write requests.
When you access files onto the raid5 directly (datastore moved to raid5 and mounted inside PBS) it's like fio benchmark read+write without random.
Which fio tests give better results ?!
 
PBS is installed as a vm with 2 files (virtual sda+sdb) on PVE which hold the raid5.
When you access files inside the datastore (virtual PBS sda) it's like fio benchmark with random read+write requests.
When you access files onto the raid5 directly (datastore moved to raid5 and mounted inside PBS) it's like fio benchmark read+write without random.
Which fio tests give better results ?!
This answer question b. May you also have an answer to the rest? :)
Thank you!
Just to be sure that I understand you right:
a) virtual memory: you mean the swap partition? My memory in use is about 20%. The rest is free.
b) should this bring a benefit when I „remove“ the layer of virtual disk? I could also directly switch to pve + pbs bare metal. Just using the vm because where I installed Idk that I can install in parallel.
d) yes I read about that raidz1 is slow like a single disk. So I am right that this is slower than a raid5? Right now I‘ve 5 Datastores (on same Raid5). When I create a vdev for each, should I have better performance? The space which is allocated to this vdev is thin provisioned? Sorry I‘m totally newbie in ZFS.
I have the option to switch the server by pay 30€/m more and have 2x1 TB NvmE extra for using as special drive.
 
To a) No, swap should be configured (partition if any time needed) but I always disable it because when it's really needed the hw config isn't sufficient. I mean /proc/sys/vm.
To b) Removing virtual disk layer is a big benefit. PBS on bare metal is the best option but then has it's own / or even this local raid-store which eg means like something like switching these actual PVE host (=one virt host lost in env) into a PBS host. And please ... ever try to differentiate OS and data - actual you have a ~45TB "/" partion on PVE host and you cannot switch to other fs for data without reinstalling the host.
Prepare: --> cp -a /usr/lib/tuned/throughput-performance /etc/tuned/.
systemctl status tuned ?
If it's not running do "systemctl enable tuned --now".
tuned-adm active ?
If it's not answer throughput-performance do "tuned-adm throughput-performance" and again "tuned-adm active" .
 
Last edited:
Okay thank you. So I think I have to reinstall the server and may directly switch to the setup with nvme for boot partition (raid1) and use 4x 22TB as Raid10 with zfs.
Use a partition of nvme (0,3% of storage HDD) as special device and will move the VM to another node that I‘ve only PBS installed. Sound like a good plan?
Do I still need tuned?
 
Reinstall a bare metal PBS is the best option. If that's best with zfs in an own opinion and if you plan to use zfs special option also I would prepare for 5% available storage, so in case of 4x22TB as "mirror in 2 vdevs" it' 44/20 -> 2x 2TB nvme as mirror - separate from boot raid1 disks.
Tuned is just prepred as a service if you installed it and has no mentioned configuration changes applied yet but while these would change virtual mem, scheduler, mdraid etc all these would not be applied from kernel to a zfs as it's unused there.
If you are done with your new server setup post your success results even if happy then or not as you after that again has option to switch out zfs.
You've got a lot to do now, wish you good luck.
 
Dear Waltar,
thank you! I‘ll start with it, but I have no chance to add an extra boot device for pbs. The server is fix from provider. Why 5%? Everywhere I read about 0,3% for meta cache.
I‘ll post the result so other with same question can see the results also.
 
ZFS special device is able to be configured to save small files up to 16MB addionally (which is all files when recordsize is 16MB also) to the dataset metadata. Default is to not save any small file in the special device and then 0.5% is common but if functionalitiy is there then it will likely to be used and so I suggest 5% just to be able to use.
The important part is if a special device is used and fail while not configured as mirror all data in the pool is lost also.
 
Last edited:
It's one of the few backup solutions that makes heavy use of deduplication instead of more classic approaches like differential backups. With those differential or full backups the IO is pretty sequential but deduplication causes stuff to be scattered across the disk with lots of references causing random IO.

See all those explanations of staff members like for example this one:
The storage on the PBS is another factor. Since backups are stored in many chunks, the data is spread out randomly. Then you might have multiple clients accessing the data and some maintenance tasks at the same time. Therefore, random IO is something to look out for. This is the reason why we recommend to use SSDs if possible. They provide much better random IO performance than HDDs.
 
I wonder how actually relevant that is and not theoretical; I am an admittedly light user of PBS, but in my case (approx 40vms with 4 hosts) I'm able to reach a high percent of wire speed to a relatively slow raidset on a spinners in a single raid6 volume... in any case, thanks for making me aware.
 
It gets especially bad when a scrub is running in parallel. Not that unusual here that a weekly re-verify task will then run for 10 hours to verify a fraction of the 2TB datastore on a 4-HDD raidz1 + special device SSD mirror. Don't want to know how slow that would be scrubbing and re-verifying a 45TB datastore.
 
Last edited:
Anybody there who has a zfs pool with special device configured and >=64cores and would be so kind to run a metadata only test with elbencho ?
elbencho is a new kind of benchmark tool to test parallel filesystems like beegfs, lustre and so on while being easier as mdtest, ior etc to evaluate I/O performance, tool designer is beegfs developer Sven Breuner, https://github.com/breuner/elbencho, it's a ~6MB tar.gz file with static libs included.
R730xd, 2x E5-2683v4, 9x8TB hdd @h730p raid5 +special 3x3TB nvme raid1 by mdadm, all inodes on raid1 with all data in so called extents in raid5, can loose 2 nvme, loose 3 nvme all data in raid5 is gone (or do a raid1 of 4 nvme), write and rm of files it's not at it's best as all hw is quiet old yet:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md125 65630846272 15373372600 50257473672 24% /hxfs
elbencho -r -w -d -t 64 -n 64 -N 3200 -s 0 --lat -F -D /hxfs/test/nix # elbencho -h for help on options
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
MKDIRS Elapsed time : 1ms 120ms
Dirs/s : 48152 34080
Dirs total : 86 4096
Dirs latency : [ min=12us avg=1.20ms max=3.88ms ]
---
WRITE Elapsed time : 4m31.561s 5m50.821s
Files/s : 36970 37361
Files total : 10039852 13107200
Files latency : [ min=13us avg=1.63ms max=15.8ms ]
---
READ Elapsed time : 12.576s 12.640s
Files/s : 1038515 1036903
Files total : 13061407 13107200
Files latency : [ min=2us avg=59us max=62.8ms ]
---
RMFILES Elapsed time : 7m3.876s 9m7.189s
Files/s : 23571 23953
Files total : 9991588 13107200
Files latency : [ min=12us avg=2.56ms max=125ms ]
---
RMDIRS Elapsed time : 3.433s 4.526s
Dirs/s : 901 904
Dirs total : 3094 4096
Dirs latency : [ min=1.20ms avg=65.7ms max=151ms ]
---
Thanks in advance.
PS: Could be run on ceph also if any have one :)
 
Last edited:
It gets especially bad when a scrub is running in parallel. Not that unusual here that a weekly re-verify task will then run for 10 hours to verify a fraction of the 2TB datastore on a 4-HDD raidz1 + special device SSD mirror. Don't want to know how slow that would be scrubbing and re-verifying a 45TB datastore.
For scrub and resilvering on zfs (mirror/raidz) vs raid is a breakeven point at around 25-30% to the usage of a zfs pool or a non-zfs raid. When usage is that low zfs is faster, when it exceeds that usage rate a conventional raidset is faster as it's running sequential to all disk nevertheless if there is no filesystem onto, a filesystem without or with data, all no difference then. Running I/O on a running scrub/resilver/rebuild slows that down, on a conventional one up to 15%. Had a 16TB hdd rebuild in raidz2 (4 vdevs a 6) in the shorttime past with 32% full and took 32h while serving I/O.
A zfs draid is completely different with erasure coding and has no compare in local raid controller but erasure coding is available in external storage systems eg. from Netapp as E-series and DDN as sfa series and appliances (=included lustre in vm's), both doing block checksum also and work work virtual spares.
Btw. a Dell ME4084 with 84x8TB hdd (670TB) scrub runs to all 84 used disks in around 57-59h (every weekend friday 6pm until monday 5am, varies on backup jobs fr+sat evening), had 6x14 disks in adapt mode which used virtual spares like draid but ME4/5 storage has no checksum support. And while the firmware is dummy in ME storage for scrub runs as running for weeks endless in automatic mode that should be disabled and started with cron from host as manual scrubs, doing 2 sets at once 3 times to the 6 sets to reach that.
 
Last edited:
As in above bench "READ" is answered by fs cache doing in 2 steps and empty cache between that READ must come from special raid1:
elbencho -w -d -t 64 -n 64 -N 3200 -s 0 --lat /hxfs/test/nix
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
MKDIRS Elapsed time : 1ms 116ms
Dirs/s : 48448 35268
Dirs total : 64 4096
Dirs latency : [ min=11us avg=1.12ms max=3.55ms ]
---
WRITE Elapsed time : 5m19.774s 5m52.207s # as it's 3nvme raid1 it's 39.3M files for the kernel/mdadm
Files/s : 37136 37214
Files total : 11875349 13107200
Files latency : [ min=12us avg=1.68ms max=51.3ms ]
---
echo 3 > /proc/sys/vm/drop_caches # empty fs cache (or reboot), for zfs do zpool export ... && zpool import ...

elbencho -r -t 64 -n 64 -N 3200 -s 0 --lat -F -D /hxfs/test/nix
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
READ Elapsed time : 14.154s 15.863s
Files/s : 845048 826274
Files total : 11961593 13107200
Files latency : [ min=5us avg=72us max=8.62ms ]
---
RMFILES Elapsed time : 7m2.491s 7m42.899s # as it's 3nvme raid1 it's 39.3M files for the kernel/mdadm
Files/s : 27819 28315
Files total : 11753581 13107200
Files latency : [ min=12us avg=2.21ms max=21.6ms ]
---
RMDIRS Elapsed time : 3.753s 4.428s
Dirs/s : 912 925
Dirs total : 3424 4096
Dirs latency : [ min=1.55ms avg=66.5ms max=162ms ]
---
 
Last edited:
@waltar @Dunuin just to be sure: When I use this special device for metadata also in pve. Do I have any benefits? Because the vm disks has it's own metadata?!
 
Last edited:
No, I don't think so, special device is useful if you have millions of files to access not just 100 .raw or .qcow files.
 
  • Like
Reactions: news
Mmh, I'm not shure, we don't use any zvol at all until today. With zvol you use a "virtual device" and I think than it's again like 1 file and probably won't help ... but this depends on the zfs code internals to block management. As we don't use it I don't have experience there, sorry. But nervertheless if using zfs probably you will drive better with running in a dataset instead of using zvol, take a read on this:
https://jrs-s.net/2018/03/13/zvol-vs-qcow2-with-kvm/ 5
https://jrs-s.net/2016/06/16/psa-snapshots-are-better-than-zvols/ 2
https://jrs-s.net/2017/03/15/zfs-clones-probably-not-what-you-really-want/ 2
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!