4 A.) ZFS Definition
The first thing people often get wrong is, that ZFS isn't just a software raid. Its way more than that. It's software raid, it's a volume manager like a LVM and it's even a filesystem. It's a complete enterprise grade all-in-one package that manages everything from the individual disks down to single files and folders.
You really have to read some books or at least read several tutorials to really understand what it is doing and how it is doing it. It's very different compared to traditional raid or file systems. So don't make the mistake to think it will work like the other things you used so far and you are familiar with.
Maybe I should explain some common ZFS terms so you can follow the tutorial a bit better:
- Vdev:
Vdev is the short form of "virtual device" and means a single disk or a group of disks that are pooled together. So for example a single disk could be a vdev. A raidz1/raidz2 (aka raid5/raid6) of multiple disks could be a vdev. Or a mirror of 2 or more disks could be a vdev.
All vdevs have in common that no matter how many disks that vdev consists of, the IOPS performance of that vdev won't be faster than the single slowest disk that is part of that vdev.
So you can do a raidz1 (raid5) of 100 HDDs and get great throughput performance and data-to-parity-ratio, but IOPS performance will still be the same as a vdev that is just a single HDD. So think of a vdev like a single virtual device that can only do one thing at a time and needs to wait for all member disks to finish their things before the next operation can be started.
- Stripe:
When you want to get more IOPS performance you will have to stripe multiple vdevs. You could for example stripe multiple mirror vdevs (aka raid1) to form a striped mirror (aka raid10). Striping vdevs will add up the capacity of each of the vdevs and the IOPS performance will increase with the number of striped vdevs. So if you got 4 mirror vdevs of 2 disks each and stripe these 4 mirror vdevs together, then you will get four times the IOPS performance, as work will be split across all vdevs and be done in parallel. But be aware that as soon as you loose a single complete vdev, the data on all vdevs is lost. So when you need IOPS performance its better to have multiple small vdevs that are striped together than having just a single big vdev. I wouldn't recommend it, but you could even stripe a mirror vdev (raid1) and raidz1 vdev (raid5) to form something like a raid510 ;-).
- Pool:
A pool is the biggest possible ZFS construct and can consist of a single vdev or multiple vdevs that are striped together. But it can't be multiple vdevs that are not striped together. If you want multiple mirrors (raid1) but don't want a striped mirror (raid10) you will have to create multiple pools. All pools are completely independent.
- Zvol:
A zvol is a volume. A block device. Think of it like a LV if you are familiar with LVM. Or like a virtual disk. It can't store files or folders on its own but you can format it with the filesystem of your choice and store files/folders on that filesystem. PVE will use these zvols to store the virtual disks of your VMs.
- Volblocksize:
Every block device got a fixed block size that it will work with. For HDDs this is called a sector which nowadays usually is 4KB in size. That means no matter how small or how big your data is, it has to be stored/read in full blocks that are a multiple of the block size. If you want to store 1KB of data on a HDD it will still consume the full 4KB as a HDD knows nothing smaller than a single block. And when you want to store 42KB it will write 11 full blocks, so 44KB will be consumed to store it. What's the sector size for a HDD is the volblocksize for a zvol. The bigger your volblocksize gets, the more capacity you will waste and the more performance you will lose when storing/accessing small amounts of data. Every zvol can use a different volblocksize but this can only be set once at the creation of the zvol and not changed later. And when using a raidz1/raidz2/raidz3 vdev you will need to change it, because the default volblocksize of 8K is too small for that.
- Ashift:
The ashift is defined pool wide at creation, can't be changed later and is the smallest block size a pool can work with. Usually, you want it to be the same as the biggest sector size of all your disks the pool consists of. Let's say you got some HDDs that report using a physical sector size of 512B and some that report using a physical sector size of 4K. Then you usually want the ashift to be 4K too, as everything smaller would cause massive read/write amplification when reading/writing from the disks that can't handle blocks smaller than 4K. But you can't just write ashift=4K. Ashift is noted as 2^X where you just set the X. So if you want your pool to use a 512B block size you will have to use an ashift of 9 (because 2^9 = 512). If you want a block size of 4K you need to write ashift=12 (because 2^12 = 4096) and so on.
- Dataset:
As I already mentioned before, ZFS is also a filesystem. This is where datasets come into play. The root of the pool itself is also handled like a dataset. So you can directly store files and folder on it. Each dataset is its own filesystem, so don't think of them as normal folders, even if you can nest them like this: YourPool/FirstDataset/SecondDataset/ThirdDataset.
When PVE creates virtual disks for LXCs, it won't use zvols like for VM, it will use datasets instead. The root filesystem PVE uses is also a dataset.
- Recordsize: Everything a dataset will store is stored in records. The size of a record is dynamic and will be a multiple of the ashift but will never be bigger than the recordsize. The default recordsize is 128K. So with a ashift of 12 (so 4K) and a recordsize of 128K, a record can be 4K, 8K, 16K, 32K, 64K or 128K. If you now want to save a 50K file it will store it as a 64K record. If you want to store a 6K file it will create a 8K record. So it will always use the next bigger possible recordsize. With files that are bigger than the recordsize this is a bit different. When storing a 1M file it will create eight 128K records. So the recordsize is usually not as critical as the volblocksize for zvols, as it is quite versatile because of its dynamic nature.