Noob need help: My Zpool size is half what I expect.

brephil · Apr 24, 2021

I have the following pool created:


root@pvenode01:~# zpool list -H  nas_data_0
nas_data_0      21.8T   970G    20.9T   -       -       0%      4%      1.00x   ONLINE

It is a made up of 2 raidz2 striped (8x 3TB drives):

Code:

root@pvenode01:~# zpool status -v  nas_data_0
  pool: nas_data_0
 state: ONLINE
  scan: none requested
config:

        NAME           STATE     READ WRITE CKSUM
        nas_data_0     ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            sda        ONLINE       0     0     0
            sdb        ONLINE       0     0     0
            sdc        ONLINE       0     0     0
            sdd        ONLINE       0     0     0
          raidz2-1     ONLINE       0     0     0
            sde        ONLINE       0     0     0
            sdf        ONLINE       0     0     0
            sdg        ONLINE       0     0     0
            sdh        ONLINE       0     0     0
        logs
          mirror-2     ONLINE       0     0     0
            nvme2n1p1  ONLINE       0     0     0
            nvme3n1p1  ONLINE       0     0     0
        cache
          sdi          ONLINE       0     0     0
          sdj          ONLINE       0     0     0

errors: No known data errors

The problem is when I add it to the "datacenter" as storage it shows only 10.24 TiB not 21.8 TiB.

Here is my /etc/pve/storage.cfg

Code:

zfspool: nas_data_0
        pool nas_data_0
        content images,rootdir
        mountpoint /nas_data_0
        sparse 1

What am I doing wrong?

pve version is 6.3-6

Dunuin · Apr 24, 2021

Thats normal. zpool will only tell you the raw size of all drives, not the usable size after subtracting the space for parity.
With your pool setup you can only use half of your drives at best in theory, so only 12 of 24 TB are usable. And Proxmox is showing you TiB not TB. 12 TB are only 10.91 TiB. And you always loose some space due to the filesystem and so on, so 10.24 TiB is totally fine.

Also keep in mind that everything you write to a virtual HDD will be 33% bigger if you don't increase your volblocksize. So in reality you might only be able to store 7.2 TiB on that pool, because the 7.2 TiB of data will consume 10.91 TiB of space because of bad padding. But using datasets (and therefore LXCs) should be fine because there the recordsize (128K default blocksize) is used and not the volblocksize (8K default blocksize).
And then you should never completely fill up your pool because ZFS is a copy-on-write filesystem and always need some free space to work with or your pool gets fragmented or might even stop working (and you can't defragment HDDs with ZFS so thats a real problem). After 80% it gets slower and after 90% it will switch to panic mode. So if you take that into account too, you actually can only use 6.34 TiB right now.

Why aren't you just using a single raidz2 pool with 8 disks or raidz3 with 9 disks if you want even more reliability? Raid never replaces a backup, so a unstripped raidz2 should be totally fine. In both cases (2x raidz2 of 4 drives or 1x raidz2 of 8 drives) you might loose your complete pool if 3 drives die at the same time. Only benefit of your setup is that with luck your pool might survive 2 to 4 failing drives.
And increasing the volblocksize should help against the bad padding, but you need to do that before creating your first VM because that can't be changed later.

brephil · Apr 24, 2021

Thanks for the response! This helps... wow 33%...

I have a chassis with space for 14 HDD and 2 SSD for read cache. I only have 8 HDDs now, but want to expand later. I do not know if zfs can do that yet (expand raidz2), seemed like it was in plan 2 years ago, but is it real yet? If it can then I will just go with raidz2.... otherwise my plan was to bring in sets of raidz2 or raidz1 maybe? I'm experimenting at this point, so rebuilding everything over and over is OK.

BTW, this zpool is to expose to a trueNAS vm and other vms with critical data requirements. So this volblocksize is important. Thank You.

Dunuin · Apr 24, 2021

brephil said:
I do not know if zfs can do that yet (expand raidz2), seemed like it was in plan 2 years ago, but is it real yet?

You can't expand a raidz with more drives without rebuilding it (and deleting it first). Only option here is to replace smaller drives with bigger drives one after another or to stripe another set of disks.

brephil said:
If it can then I will just go with raidz2.... otherwise my plan was to bring in sets of raidz2 or raidz1 maybe?

Raidz1 would be bad if you got critical data requirements.

brephil said:
BTW, this zpool is to expose to a trueNAS vm and other vms with critical data requirements. So this volblocksize is important. Thank You.

If you want a trueNAS VM you should attach all the drives to a HBA card and PCI passthough that HBA with all drives attached to it to the TrueNAS VM. That way the VM can directly access the drives without any additional virtualization layer between that could cause problems and overhead/write amplification. And you would need to create that pool inside the TrueNAS VM and use NFS/iSCSI/SMB so that other VMs can access it.
Its not possible to directly use a ZFS pool inside VM if you create that pool on your host. That only works with LXCs.

brephil · Apr 24, 2021

Thanks, I thought about a pass through but my l2 write cache is not attached to the HBA. Can I pass through a partition from my nvme devices?

Dunuin · Apr 24, 2021

brephil said:
Thanks, I thought about a pass through but my l2 write cache is not attached to the HBA. Can I pass through a partition from my nvme devices?

You can use "qm set" to pseudo pasthrough a partition to a VM but in that case there is a virtual virtioSCSI controller in between so I think you wouldn't got the NVMe performance. And I don't know it is possible at all to PCI passthough a NVMe SSD.

Ramalama · Apr 24, 2021

brephil said:
It is a made up of 2 raidz2 striped (8x 3TB drives):

A basic question...
Doesn't that work like a raid z4 xD

what i mean is, a stripe of (4 hdds of raid z2)x2 sounds pretty stupid for me.
Because you archive the same with a pure mirror, without the cpu calculation overhead....
Or you do an raid10, which does the same, but gives you more write speed....

Or you do an raid z2, but just one for all 8hdds and you get more storage?

Or did i missed something? xD

Edit: Ah just seen, that Denuin asked the same, ignore me then xD

brephil · Apr 24, 2021

Raidz2 + Raidz2 not as a mirror but striped. I was experimenting with that so when I want to add more drives how best to add to an existing raidz2. To grow to pool basically. As I know you cannot just add disks to raidz2.

So just practicing to see if I do a raidz2 with 8 drives how best to add another raidz to it sometime later.

Dunuin · Apr 25, 2021

Ramalama said:
Doesn't that work like a raid z4 xD

what i mean is, a stripe of (4 hdds of raid z2)x2 sounds pretty stupid for me.
Because you archive the same with a pure mirror, without the cpu calculation overhead....

Or you do an raid10, which does the same, but gives you more write speed....

The point is that in any raid10 style pool you might loose the complete pool as soon as the second disk dies. Its no difference if you use 2 or 14 drives in a mirror. In the worst case the complete pool is lost if 2 disks of the same mirror die. And its not uncommon that a second disk dies while resilvering because that really stresses the disks.

I still don't think that 2x raidz2 stripped is a good option. You said you only got 14 slots so you could only do one expansion to 12 drives later. And even with 12 drives you would only be able to use 18TB. With that setup 2 (worst case) to 4/6 (best case) disks may die.

If you just create a raidz2 of 8 drives you can directly use 18TB so you don't need to expand at all, because you got that 18TB with just 8 drives. with this setup any 2 disks (worst and best case) may die.

I think a good option for reliability and storage would be a single raidz3 with 7 drives. You would be able to use 12TB, any 3 drives may fail and you could stripe another raidz3 of 7 disks later. As 2x raidz3 of 7 disks stripped you would get 24TB of usable storage and any 3 drives (worst case) up to 6 drives (best case) may die.

And if you need performance and reliability but capacity isn't the big problem you could try something like a raid10 but with mirror of 3 instead of 2. That way any 2 disks of a mirror may die so the reliability is like with raidz2 but it should be way faster because you got more stripes and you don't need all the complex parity data calculations. Keep in mind that raidz1/2/3 isn't recommenden in general as a VM storage because its to slow and because you would need to increase the volblocksize very much.
With such a setup you could use 4x mirrors of 3 disks to get 12TB of usable storage and 2 (worst case) to 8 (best case) disks may die.

brephil · Apr 25, 2021

OK here is what I'm planning on for this... I will be giving the Media Datasets to the TrueNAS VM, the rest can go to other vms for other purposes. Any problems/suggestions? Please note: I have anther zpool for the VMs OS disk (NVME 1 x 1TB mirror), so this pool is just for large data sets that need good usable space, descent resiliency (2 drive fail of 14) and good read speed (12x read). Hoping the write cache can help improve write performance, but that is not super important for this.

Code:

Zpool Configuration
use ashift =13 (future proof)

L2ARC
Use Stripe of 2 x 400 TB SSD 

ZIL
Use Mirror of 2 x 29.8 GB NVME

Array
RAIDz2 14 X 3 TB HDD

Zpool Name  = dpool01

Total RAW = 42 TB
Total Usable = 36 TB (max use should be < 80% at  28.8TB)

Datasets:
Media        
- Compression On
- recordsize = 1M (better for movies)
- quota = 15 TB

Security
- Compression On
- quota = 3 TB

Data
- Compression ON
- quota = 9.8 TB

Home
- Compression ON
- quota = 1 TB

Notes on Guest VM using these:
- Only sync writes make use of the zil slog
- To force via pool write cache=writethrough (or direcsync, BTW what is the difference ?)

Dunuin · Apr 25, 2021

brephil said:
OK here is what I'm planning on for this... I will be giving the Media Datasets to the TrueNAS VM, the rest can go to other vms for other purposes. Any problems/suggestions?

Like I already said, you can't passthrough datasets to VMs. If you want to access a dataset from a VM your host needs to work as a NAS and share the dataset using SMB/NTFS/iSCSI. In that case you don't need a TrueNAS VM at all and you need to manage the pools, replication, auto snapshots, shares and so manually using the CLI because Proxmoxs GUI can't do that. If you want a TrueNAS VM to work as your NAS you need to bring the unformated disks into the VM and create a ZFS pool inside that VM.

brephil said:
- To force via pool write cache=writethrough (or direcsync, BTW what is the difference ?)

Do you want to force every write to be a sync write so it will be cached? As far as I know that will make the pool even slower.

Ramalama · Apr 25, 2021

Dunuin said:
The point is that in any raid10 style pool you might loose the complete pool as soon as the second disk dies. Its no difference if you use 2 or 14 drives in a mirror. In the worst case the complete pool is lost if 2 disks of the same mirror die. And its not uncommon that a second disk dies while resilvering because that really stresses the disks.

I still don't think that 2x raidz2 stripped is a good option. You said you only got 14 slots so you could only do one expansion to 12 drives later. And even with 12 drives you would only be able to use 18TB. With that setup 2 (worst case) to 4/6 (best case) disks may die.

If you just create a raidz2 of 8 drives you can directly use 18TB so you don't need to expand at all, because you got that 18TB with just 8 drives. with this setup any 2 disks (worst and best case) may die.

I think a good option for reliability and storage would be a single raidz3 with 7 drives. You would be able to use 12TB, any 3 drives may fail and you could stripe another raidz3 of 7 disks later. As 2x raidz3 of 7 disks stripped you would get 24TB of usable storage and any 3 drives (worst case) up to 6 drives (best case) may die.

And if you need performance and reliability but capacity isn't the big problem you could try something like a raid10 but with mirror of 3 instead of 2. That way any 2 disks of a mirror may die so the reliability is like with raidz2 but it should be way faster because you got more stripes and you don't need all the complex parity data calculations. Keep in mind that raidz1/2/3 isn't recommenden in general as a VM storage because its to slow and because you would need to increase the volblocksize very much.
With such a setup you could use 4x mirrors of 3 disks to get 12TB of usable storage and 2 (worst case) to 8 (best case) disks may die.

raid 10 with 8 disks = (4raid0+4raid0) mirrored
You have right with 2 drives for a possible die.
But it has to be the exact same hdd on both mirrors. Imagine that the hdd's are numbered:
1/2/3/4 + 1/2/3/4
so on both mirrors need to die hdd 1 & 1... if on mirror 1, hdd 1 dies and on the other mirror hdd 2, hdd 1&2 are still available in the whole raid10.
If 2 dies on same mirror, it's not a problem either.

So the chances, that exactly the 2 correctly corresponding drives fails (out of 8) at same time or let's say in the same month, before you replace one of them, is in my opinion extremely "unwahrscheinlich" xD
But when we start to talk, what happens if 3/4 drives will die, the risks gets much much higher that corresponding drives will fail.

But really, realistically, if one drive fails, and in the time of exchange another drive dies, is definitively possible, but that 2/3 drives fails in that time is very unlikely.

That's why i think raid 10 is still one of the best solutions if you have to decide between speed, usable space and safety.

Raid z2 gives much more space and the ability that any 2 drives can fail, raid z3 gives you even the option to ignore a failed drive for a long time.
But don't forget that the performance is extremely decreased if you have a failed drive on an z1/2/3 raid, so you will want to exchange it anyway asap.
What i mean is, in the end, you run a much slower storage (especially write speeds), with higher latency (cpu calculations), for the feeling that any of 3 drives can fail, while raid 10 gives you almost the same, with much more speed. But less usable storage indeed.

Tbh, if that's 8hdd drives, i would personally definitively go for a raid 10... But if that's 8 ssd drives, probably an raid z2/max z3, would be a better solution. And i talk so much about that performance is so important, because hdd's are so extremely slow for almost any vm usage today. Except if it's storage for static files, like samba/backups/etc... xD

However, the discussion can be endless, i totally agree that every raid has pro's and contras and it depends extremely on everyone's personal opinion. Just i had a very bad experience on my side with the performance and overhead of an raid z on hdds.

Cheers

brephil · Apr 25, 2021

Dunuin said:
Like I already said, you can't passthrough datasets to VMs. If you want to access a dataset from a VM your host needs to work as a NAS and share the dataset using SMB/NTFS/iSCSI. In that case you don't need a TrueNAS VM at all and you need to manage the pools, replication, auto snapshots, shares and so manually using the CLI because Proxmoxs GUI can't do that. If you want a TrueNAS VM to work as your NAS you need to bring the unformated disks into the VM and create a ZFS pool inside that VM.

Do you want to force every write to be a sync write so it will be cached? As far as I know that will make the pool even slower.

Thanks for the help... yes maybe it would help to clarify.

I will be passing the "dpool01/Media" as a large disk to the TrueNAS vm (virtIO). The "dpool01/Security" to my zoneminder vm again as a large disk. The others prob as iSCSI or more likely NFS. Only Media is going to TrueNAS. I like some of the cloud sync options and all and it make it easy to setup in the UI, and SMB is a snap to setup.

Technically I could mount locally (on the host) and prob pass through the directory too, but think that would not be a good idea so did not investigate that option.

Here is the storage.cfg...

Code:

zfspool: media-zfs
        pool dpool01/Media
        content rootdir,images
        mountpoint /dpool01/Media
        sparse 1

As to that write cache session on these (I too wanted to keep them write back enabled but this took me off this from the zfs 101 site:

Adding a LOG vdev to a pool absolutely cannot and will not directly improve asynchronous write performance—even if you force all writes into the ZIL using zfs set sync=always, they still get committed to main storage in TXGs in the same way and at the same pace they would have without the LOG. The only direct performance improvements are for synchronous write latency (since the LOG's greater speed enables the sync call to return faster).

However, in an environment that already requires lots of sync writes, a LOG vdev can indirectly accelerate asynchronous writes and uncached reads as well. Offloading ZIL writes to a separate LOG vdev means less contention for IOPS on primary storage, thereby increasing performance for all reads and writes to some degree.

So, I took this to mean I really needed the VM to write-through (to force sync write) to the disk in order to make use of the SLOG vdev I created in the backing zpool. Either way, I am seeing good use of the NVME mirror as I type this, so as I move crap into the dpool01/Media dataset (via the exposed HD to the VM). The writes are limited by my network so need to do some more tests to make sure.

Dunuin · Apr 25, 2021

brephil said:
Thanks for the help... yes maybe it would help to clarify.

I will be passing the "dpool01/Media" as a large disk to the TrueNAS vm (virtIO). The "dpool01/Security" to my zoneminder vm again as a large disk. The others prob as iSCSI or more likely NFS. Only Media is going to TrueNAS. I like some of the cloud sync options and all and it make it easy to setup in the UI, and SMB is a snap to setup.

That would be an option. But keep in mind that you will get alot of write amplification because you will run ZFS inside ZFS.
With PCI passthrough and a TrueNAs VM it will look like this:
HDDs <- HBA <- ZFS (inside VM) <- dataset <- data

Without PCI passthough and TrueNAS VM using virtual HDDs it will look like this:
HDDs <- HBA <- ZFS (on host) <- zvol <- virtio SCSI <- ZFS (inside VM) <- dataset <- data

brephil said:
Adding a LOG vdev to a pool absolutely cannot and will not directly improve asynchronous write performance—even if you force all writes into the ZIL using zfs set sync=always, they still get committed to main storage in TXGs in the same way and at the same pace they would have without the LOG. The only direct performance improvements are for synchronous write latency (since the LOG's greater speed enables the sync call to return faster).

However, in an environment that already requires lots of sync writes, a LOG vdev can indirectly accelerate asynchronous writes and uncached reads as well. Offloading ZIL writes to a separate LOG vdev means less contention for IOPS on primary storage, thereby increasing performance for all reads and writes to some degree.

So, I took this to mean I really needed the VM to write-through (to force sync write) to the disk in order to make use of the SLOG vdev I created in the backing zpool. Either way, I am seeing good use of the NVME mirror as I type this, so as I move crap into the dpool01/Media dataset (via the exposed HD to the VM). The writes are limited by my network so need to do some more tests to make sure.

AS far as I understand that text it tells you that a SLOG won't help you with async writes. It will help if you got workloads with sync writes (which you possibly don't have if you don't run DBs and mainly write files using NFS/SMB which is using async writes) and in that specific case it might help with async writes too, because sync writes only need to be written once and not twice to the pool so there is less written and therefore the HDDs aren't that busy and because of that the async writes are a little bit faster (because they aren't slowed down that much by sync writes).
So a SLOG is totally useless if your workload don't got alot of sync writes. As far as I know its bad to force fast aync writes to be handled as slow sync writes just so that they can be cached. That should be only usefull if you want more data safety and you think async writes aren't save enough.

Search

Search

Noob need help: My Zpool size is half what I expect.

brephil

Member

Dunuin

Distinguished Member

brephil

Member

Dunuin

Distinguished Member

brephil

Member

Dunuin

Distinguished Member

Ramalama

Renowned Member

brephil

Member

Dunuin

Distinguished Member

brephil

Member

Dunuin

Distinguished Member

Ramalama

Renowned Member

brephil

Member

Dunuin

Distinguished Member

We value your privacy