Need advice on ZFS / BTRFS Disk allocation within proxmox

Q-wulf · Feb 2, 2016

Some Background:

This is going to be my 6th Home-Lab Proxmox-Node.
I have a 3-Node Proxmox+Ceph Cluster that houses every single critical service i operate.
I have a single-Node 24-Disk Proxmox+ceph "Cluster" i use for Media-Storage + Backups + Surveillance, basically a giant 60 TB node providing all kinds of network storage
I have a single-node Dual-boot Windows / Proxmox Node i use for gaming (and still hope to eventually use for Passthrough gaming + Media-Encoding station)

Now, i recently got my hands on a surplus server and I'd like to use this for

Backup Services, mostly Slave (Master-Slave) VM's on ZFS
Tertiary Backup-Space (the 5th copy in 3-2-1) using KVM + Rockstor on BTRFS
getting to know ZFS better
Some general tinkering and experimenting

To this end i have filled this node up with as much spare disks i had floating around:

58.69 GB SSD - OS-Drive
58.69 GB SSD
74.53 GB
74.53 GB
149.05 GB
232.89 GB
232.89 GB
232.88 GB
298.09 GB
298.09 GB
465.76 GB
465.76 GB
465.76 GB
465.76 GB
931.51 GB
1820.00 GB

Disks 3-10 have about 70-90 MB/s
Disks 11-14 do some 150-160 MB/s
Disk 15 does around 180 MB/s
Disk 16 does 70 MB/s

Now i have devoured most of the Guides i could find on ZFS. Even the ones about mirroring ZFS-Striped pools on Top of Partitions, to build yourself a sort of Raid-60 on top of Partitions. I know i can always replace a Disk with a same-sized or larger one. I also know that features like

Question 1: How would you go and distribute the disks to get

Get the most Performance out of ZFS
Get the most Storage Capacity out of Rockstor

I was thinking of using

either 5 Disks (6-10 - 696 GB), or 4 Disks (11-14 - 930 GB ) as a Raid-Z2
Disk 3,4,5 via Stripe as a ZIL
Disk 2 as a L2Arc
Disk 15,16 and the Set not used for Raid-Z2 (preferable Disks 11-14) for Rockstor

Is this sound, or a bad Idea ?

Question 2: for Rockstor / BTRFS, would you pass the physical Disks through, or put it on vDisks (virtio) as raw, assuming i'm using mirroring inside Rockstor ?

alexskysilk · Feb 2, 2016

I want to help, but I really dont understand what you are trying to accomplish. Lets leave off the technology for the moment and address that- I am getting the sense that you just want to play with the software/hardware stacks, in which case any of your suggested configs are valid.

If there is an actual problem you want to solve, lets define it.

1. Why do you have a "single-Node 24-Disk Proxmox+ceph "Cluster"? wouldnt this be way more efficient/faster if you just had a freenas box (hell, any linux box)? ceph has a lot of overhead that you're not gaining any benefit from.
2. What are you active storage needs, and how much is archival? by needs I dont mean your xvid collection from the 90s, I mean your pr0n

3. Do you pay for electricity/AC? do you have a place to house all that equipment that doesnt make you and your loved ones deaf?

As for your zfs config, I would suggest adding 32gb of ram to your server and leaving off ZIL/L2ARC drives. their benefit is minuscule compared to ram and can cause pain.

Q-wulf · Feb 2, 2016

alexskysilk said:

As for your zfs config, I would suggest adding 32gb of ram to your server and leaving off ZIL/L2ARC drives. their benefit is minuscule compared to ram and can cause pain.

Click to expand...

I have 64 GB ECC installed at the moment. That said, should i use the SSD as a log-Device then, or not use it at all for the ZFS-pool ? I guess i can scrab the Log-Stripe then too and use it for Rockstor.

alexskysilk said:
If there is an actual problem you want to solve, lets define it.

There is 2 "problems"
The basic Question was the ZFS setup seeing what Drives I have available at present (its only gonna get better once i replace smaller sized disks from the 24-Disk server with bigger ones and use those for this setup (eventually, probably, hopefully). Since i never gave ZFS much though beyond the FreeNas based NAS or as a Proxmox OS install option with Proxmox-4

The other one is wether it is better to pass the actual Hard Disks to a BTRFS based VM, or use a virtual Disk with Raw.
I have no godly idea on what the performance implications are here.

alexskysilk said:
Why do you have a "single-Node 24-Disk Proxmox+ceph "Cluster"? wouldnt this be way more efficient/faster if you just had a freenas box (hell, any linux box)?

please let me point out that i work with Ceph on a datacenter basis for a living. (you can see this by looking at my posts)
This is for Home-Lab and inner-nerd usage and to accomodate the storage-needs of 3-generations at my house.

Lemme answer the "why" regardless:

because using FreeNas is really, really inefficient and not as tolerant to drive-failures as the single-node Ceph-Cluster is. On top of that i can not bring up VM's on that machine if i ever feel like it. I have some 4 TB drives and some 1 TB drives. I really easily can go to 8 or 10 TB drives if i have the need for some storage-capacity, by simply replacing the small ones with bigger ones and have ceph do some backfilling (which is really quick locally)
I run a Ceph Erasure-Coded pool with K = 20 M=4 at the moment for not so important data. Thats a overhead of 20% compared to 400% on a 4-way replica. Afaik ZFS can only do 3 Parity Disks.
I also run a K16 M8 EC-Pool where i Store backups from my 3-node Cluster (thats SSD only btw)
I have at present a 6x samsung 950 Pro NVME SSD based Cache-Tier (Replication 3) sitting in front of the EC-Pools. (its at present because i have 10 of em in total and keep cannibalising some to use for other experiments

I'm currently using FreeNas and some older OpenFiler based Vm's to provide NFS /SMB services, but i'll soon move to Rockstor (as its more lightweight and suited for sitting on top of ceph)

alexskysilk said:
What are you active storage needs, and how much is archival? by needs I dont mean your xvid collection from the 90s, I mean your pr0n

I use about 45 TB at present.

around 500 GB is primary Backups
Some 2 TB is Really, really important Backup-Storage for files / Documents i have accumulated in 45 years.
around 10 TB is media from my Home-surveilance-system
about 5TB of loss-less music digitized from my and my dads vinyls
About 24 TB of Movies / TV series that i digitized in different versions (720p / 1080p / lossless Blue ray / some 4K Releases from when i was in singapoore) all in non-compressed formats.
3.5 TB of "Home-Movies" my Dad shot on the F-104 / F-4

alexskysilk said:
do you have a place to house all that equipment that doesnt make you and your loved ones deaf?

That is called a server-room. Alternatively you can stick em into the cellar if you have one. The noise is really low tho, Not like you need massive ventilation power once you discover the chimney effect. I could probably watch TV in the same room or have a regular level discussion.

the "server room" gets fed from the house AC, which gets fed from the adsorption-chiller which is solar-fed.

alexskysilk said:
Do you pay for electricity/AC?

Its called Solar power backed up by nuclear energy at nights / bad weather, makes for a really cheap combination to get energy.

alexskysilk · Feb 2, 2016

I have 64 GB ECC installed at the moment. That said, should i use the SSD as a log-Device then, or not use it at all for the ZFS-pool ? I guess i can scrab the Log-Stripe then too and use it for Rockstor.

For zfs, yes (dont use the SSD.) for Rockstor- I dont have any experience with btrfs so I cant say.

The basic Question was the ZFS setup seeing what Drives I have available at present (its only gonna get better once i replace smaller sized disks from the 24-Disk server with bigger ones and use those for this setup (eventually, probably, hopefully).

This is probably not an ideal setup for performance purposes, but is usable. If you group like disks into seperate vdevs in your pool you'd probably get the best overall effect. It is true you can replace disks in place with larger ones in a given vdev, just be aware that the overall size will always be a multiple of the smallest disk in the group. Also, Rebuild takes time and results in reduced fault tolerance for the duration. If this fits your use case- I guess it does work....

The other one is wether it is better to pass the actual Hard Disks to a BTRFS based VM, or use a virtual Disk with Raw.

Generally speaking its better to have your storage be separate then your compute resources, especially if its used for reasons OTHER then hypervisor storage. If you dont need hyperconvergence for space or power reasons, you're better off using one device for storage and the others for compute. As for performance, if you must run your storage with a VM head use whatever your hardware allows; if your host allows SRVIO, use it.

because using FreeNas is really, really inefficient and not as tolerant to drive-failures as the single-node Ceph-Cluster is.

I have no idea what you mean by that.

On top of that i can not bring up VM's on that machine if i ever feel like it. I have some 4 TB drives and some 1 TB drives. I really easily can go to 8 or 10 TB drives if i have the need for some storage-capacity, by simply replacing the small ones with bigger ones and have ceph do some backfilling (which is really quick locally)

The use case for VM storage and general purpose NAS are completely different. Erasure coded ceph pools have terrible performance for VM storage, which gets exponentially worse with the speed of the individual OSDs (eg 3.5" drives have IOPs ion the <100 range). using the same OSDs for your VMs and your GP data would necessarily result in a reduction of available performance for your VMs; Why would you want to mix these use cases up, especially since you have so many resources you can bring to bear? use your fastest drives for vms, and you big drives for NAS.

Q-wulf · Feb 2, 2016

alexskysilk said:
The use case for VM storage and general purpose NAS are completely different. Erasure coded ceph pools have terrible performance for VM storage, which gets exponentially worse with the speed of the individual OSDs (eg 3.5" drives have IOPs ion the <100 range). using the same OSDs for your VMs and your GP data would necessarily result in a reduction of available performance for your VMs; Why would you want to mix these use cases up, especially since you have so many resources you can bring to bear? use your fastest drives for vms, and you big drives for NAS

Yes and No.

Yes VM-Storage and a general Purpose Nas are completely different. And they typically do not mix.
No, as once you use a Cache-Tier you can have both Speed and large capacities , on the same Hardware, both the speed and the performance.

alexskysilk said:
Erasure coded ceph pools have terrible performance for VM storage, which gets exponentially worse with the speed of the individual OSDs (eg 3.5" drives have IOPs ion the <100 range).

True, with a Erasure-Coded Pool i'm dedicating some Cpu-Cycles into doing the parity calculations.If i had a multi-node cluster the latency would also play into it. (Now it doesn't). But that only happens for COLD Objects. Objects that would sit in COLD-storage anyways. The Upshot is that i use a Cache-Tiering Pool in Readforward mode. That is basically 2 TB of usable Caching Capacity (R3 pool) sitting currently on 6x 300K/100K IO capable SSD's.

Why would i not take advantage of this ?
Every object that comes in gets written to the cache first. Then over time migrates down to the EC-Pool when its not "hot" anymore. Same thing goes for VM-Storage, the difference is, that all that VM-Data that sits on the VM's and is not considered "hot" gets moved to EC. Now, should that cold-data need to be read, it gets read straight from the EC-Pool, instead of first migrating back to hot-storage.

Its about better utilising the space of your Cache you have AND getting the smallest overhead.

I'm not even sure why you would NOT use this option.

alexskysilk said:
As for performance, if you must run your storage with a VM head use whatever your hardware allows; if your host allows SRVIO, use it.

- I have the option to "passthrough a HBA" - which in cases of specific LSI can lead to data loss (came up on the forums yesterday again)
- I have the option to "pass through a disk" like this: "scsi5: /dev/disk/by-id/scsi-SATA_ST5000VN000-1H4_Z111111"
- i have the option to use directory / LVM of a locally mounted /available Disk

I'd like to know which is faster with proxmox AND does not have a 99.5% chance to lead to data-loss. (like the HBA passthrough)

alexskysilk · Feb 2, 2016

But that only happens for COLD Objects. Objects that would sit in COLD-storage anyways. The Upshot is that i use a Cache-Tiering Pool in Readforward mode. That is basically 2 TB of usable Caching Capacity (R3 pool) sitting currently on 6x 300K/100K IO capable SSD's.

Why would i not take advantage of this ?

I can't answer that question. If I read you right, I question why you arent just running your hot storage on your "caching" pool.

Every object that comes in gets written to the cache first. Then over time migrates down to the EC-Pool when its not "hot" anymore. Same thing goes for VM-Storage, the difference is, that all that VM-Data that sits on the VM's and is not considered "hot" gets moved to EC. Now, should that cold-data need to be read, it gets read straight from the EC-Pool, instead of first migrating back to hot-storage.

Makes sense, except from I gathered from your described use case that none of this actually comes into play. Cache doesnt solve io bottlenecks, it just smooths out spikes at best, and just hides them some at worst. if your cache is sufficient to hide the performance drop of your EC pool, you're not using your pool, just your cache (see above comment.)

Its about better utilising the space of your Cache you have AND getting the smallest overhead.

/shrug. your trying to fix problems you dont have. is your time really worthless?

I'm not even sure why you would NOT use this option.

I'm going to assume this is an honest question. You wouldnt because it creates complexity and ties multiple functions to the same failure point without actually buying you anything- you can achieve the same thing with LVM+bcache or zfs. I have not benchmarked ceph in this role because it never occurred to me to try; my guess is that it would perform worse with Ceph (Ceph has much more overhead) even if you're running it as a local file system (which is what you're doing running it on a single node.) Dont get me wrong, I dont really want to talk you out of it, I'm just telling you why I wouldnt.

I'd like to know which is faster with proxmox AND does not have a 99.5% chance to lead to data-loss. (like the HBA passthrough)

No clue. I never tried doing this.

Q-wulf · Feb 5, 2016

alexskysilk said:
I can't answer that question

alexskysilk said:
is your time really worthless?

alexskysilk said:
I'm going to assume this is an honest question.

alexskysilk said:
I have not benchmarked ceph in this role

alexskysilk said:
my guess is that it would perform worse with Ceph

This made me wonder if i just kept feeding a specific species of internet user.

alexskysilk said:
Cache doesnt solve io bottlenecks, it just smooths out spikes at best, and just hides them some at worst. if your cache is sufficient to hide the performance drop of your EC pool, you're not using your pool, just your cache (see above comment.)

True, for Cache that is.
But a Ceph Cache-Tier, is not just Cache, its also tiered Storage, that automatically gets utilisation based on the rules you have set up, without the need to ever do manually assignment of what you consider hot or non hot data.

In my Case i have set it up to keep all data on the Cache-Tier for 72 Hours.
If in that 72 hours it has not been accessed 3 times, it gets considered "cold" and marked for down-grade.
Once my Cache-Tier hits 0.6 (70% - roughly 1.3 TB) capacity, it will start to slowly move Cold-objects stuff down to the EC-Pool for LONGTERM storage.
Once its at 0.8 (80%)- it will start aggressively making room, starting with the coldest objects first.

Now why would i do this ? Because the Cache-Tier is a magnitude faster then the EC-Pool. But the EC-Pool can store Data with less overhead, and protection of concurrent drive failures - yes those happens) on a single pool, then ZFS can do.

The LVM+bcache or zfs becomes somewhat invalid, because it does not matter if I use a single-Node Ceph or a single-node ZFS or a single node LVM +bcache node. If the Node Fails, then i can not use the Node until i fix it.
The only difference is, that with Ceph i can take a hit of multiple drives (>3) AND actually am able to compensate for a failed Raid-Controller, while my pool stays accessible.

But lets be honest with ourselves:

alexskysilk said:
I want to help, but I really don't understand what you are trying to accomplish.

The only useable advise to my ACTUAL Questions was, that i should loose the SSD for ZIL/L2ARC and trust in Ram.
The rest was about stuff that had no actual relation to the topic, the questions clearly specified in the first Post (Marked in Bold and underlined), nor even about the SYSTEM i was looking for help with.

Instead it was used to go of on a tangent and how there would never be any other use-case for personal storage then storing pornography:

alexskysilk said:
What are you active storage needs, and how much is archival? by needs I dont mean your xvid collection from the 90s, I mean your pr0n

As such i will now refrain from answering off-topic question in this topic and leave you with this:
Have a nice day / week / month / life and thank you for your on topic suggestions.

kind regards,
Q-Wulf

Q-wulf · Feb 5, 2016

Question still remaining open:

What is the best option to assign storage space to a Virtual Machine running BTRFS and used 100% as Backup-space ?

- I have the option to "passthrough a HBA" - which in cases of specific LSI can lead to data loss (came up on the forums yesterday again)
- I have the option to "pass through a disk" like this: "scsi5: /dev/disk/by-id/scsi-SATA_ST5000VN000-1H4_Z111111"
- i have the option to use directory / LVM of a locally mounted /available Disk

I'd like to know which is faster (Write/Read performance) with proxmox AND does not have a 99.5% chance to lead to data-loss. (like the HBA passthrough)and potentially allow me to still access my drive should for whatever reasons my Node become unrecoverable (Data Recovery from/of vDisks)

Search

Search

Need advice on ZFS / BTRFS Disk allocation within proxmox

Q-wulf

Well-Known Member

alexskysilk

Distinguished Member

Q-wulf

Well-Known Member

alexskysilk

Distinguished Member

Q-wulf

Well-Known Member

alexskysilk

Distinguished Member

Q-wulf

Well-Known Member

Q-wulf

Well-Known Member