ZFS unable to import on boot/unable to add VD on empty zfs pool...

jaceqp

Well-Known Member
May 28, 2018
98
7
48
44
Hi there...
I've installed PM on my testlab:
- 4-core xeon
- 32GB DDR3 ECC
- 2x600GB SAS (hardware PCI-E RAID1) - boot drive with PM itself + VM boot VD's...
- 6x1TB 2,5" mixed SATA drives (IT mode via mothreboard integrated sata ports)

My goal was to launch a VM NAS of some sort...

I've installed PM on HW raid volume, aswell with OpenMediaVault VM. Then under web-gui I created ZFS pool (RAIDZ aka RAID5) using 6x1TB SATA drives. On ZFS pool I've created another VD as dedicated storage drive for OMV.
What made me wonder was a warning during a PM reboot:

systemctl:
Code:
  zfs-import-cache.service                                                                         loaded active     exited    Import ZFS pools by cache file
● zfs-import@ZFSQNAP.service                                                                       loaded failed     failed    Import ZFS pool ZFSQNAP
  zfs-mount.service                                                                                loaded active     exited    Mount ZFS filesystems
  zfs-share.service                                                                                loaded active     exited    ZFS file system shares
  zfs-volume-wait.service                                                                          loaded active     exited    Wait for ZFS Volume (zvol) links in /dev
  zfs-zed.service                                                                                  loaded active     running   ZFS Event Daemon (zed)

ZFS seems to look ok:
Code:
root@PROXTEMP:~# zfs list
NAME      USED  AVAIL     REFER  MOUNTPOINT
ZFSQNAP  2.79M  4.22T      153K  /ZFSQNAP
root@PROXTEMP:~# zpool status
  pool: ZFSQNAP
 state: ONLINE
config:

        NAME                                          STATE     READ WRITE CKSUM
        ZFSQNAP                                       ONLINE       0     0     0
          raidz1-0                                    ONLINE       0     0     0
            ata-ST1000LM035-1RK172_WL1969PY           ONLINE       0     0     0
            ata-WDC_WD10SPZX-08Z10_WD-WX61AA8DRJ83    ONLINE       0     0     0
            ata-ST1000LM035-1RK172_WL1XHFF3           ONLINE       0     0     0
            ata-ST1000LM035-1RK172_WL1LD5Q6           ONLINE       0     0     0
            ata-WDC_WD10JPVX-60JC3T1_WD-WXC1A38D5T75  ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD100_96PNT6KDT          ONLINE       0     0     0

errors: No known data errors

Soafter adding ca 4TB VD to OMV I've created a share and started a file transfer... It crashed after 30GB of transfer with notice that there's not enough space...
I couldn't create any (even tiny one) VD on ZFS later on.

Code:
Aug 12 12:36:08 PROXTEMP pvedaemon[7136]: VM 100 creating disks failed
Aug 12 12:36:08 PROXTEMP pvedaemon[7136]: zfs error: cannot create 'ZFSQNAP/vm-100-disk-0': out of space
Aug 12 12:36:08 PROXTEMP pvedaemon[1990]: <root@pam> end task UPID:PROXTEMP:00001BE0:00035613:6114F997:qmconfig:100:root@pam: zfs error: cannot create 'ZFSQNAP/vm-100-disk-0': out of space

Finally forced to remove VD and creating a new one but still out of space issue...

That was yesterday. Today (without reboot) I can create vd's normally so... any ideas?
ZFS Pool and all disks seem to be healthy...

Code:
root@PROXTEMP:~# zfs get written
NAME                   PROPERTY  VALUE    SOURCE
ZFSQNAP                written   153K     -
ZFSQNAP/vm-100-disk-0  written   89.5K    -
root@PROXTEMP:~# zpool status
  pool: ZFSQNAP
 state: ONLINE
config:

        NAME                                          STATE     READ WRITE CKSUM
        ZFSQNAP                                       ONLINE       0     0     0
          raidz1-0                                    ONLINE       0     0     0
            ata-ST1000LM035-1RK172_WL1969PY           ONLINE       0     0     0
            ata-WDC_WD10SPZX-08Z10_WD-WX61AA8DRJ83    ONLINE       0     0     0
            ata-ST1000LM035-1RK172_WL1XHFF3           ONLINE       0     0     0
            ata-ST1000LM035-1RK172_WL1LD5Q6           ONLINE       0     0     0
            ata-WDC_WD10JPVX-60JC3T1_WD-WXC1A38D5T75  ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD100_96PNT6KDT          ONLINE       0     0     0

Some other disks summary:
Code:
root@PROXTEMP:~# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   558G  0 disk
├─sda1                         8:1    0  1007K  0 part
├─sda2                         8:2    0   512M  0 part
└─sda3                         8:3    0 557.5G  0 part
  ├─pve-swap                 253:0    0    16G  0 lvm  [SWAP]
  ├─pve-root                 253:1    0    30G  0 lvm  /
  ├─pve-data_tmeta           253:2    0     5G  0 lvm
  │ └─pve-data-tpool         253:4    0 485.6G  0 lvm
  │   ├─pve-data             253:5    0 485.6G  1 lvm
  │   ├─pve-vm--100--disk--0 253:6    0    30G  0 lvm
  │   ├─pve-vm--101--disk--0 253:7    0    30G  0 lvm
  │   └─pve-vm--102--disk--0 253:8    0    60G  0 lvm
  └─pve-data_tdata           253:3    0 485.6G  0 lvm
    └─pve-data-tpool         253:4    0 485.6G  0 lvm
      ├─pve-data             253:5    0 485.6G  1 lvm
      ├─pve-vm--100--disk--0 253:6    0    30G  0 lvm
      ├─pve-vm--101--disk--0 253:7    0    30G  0 lvm
      └─pve-vm--102--disk--0 253:8    0    60G  0 lvm
sdb                            8:16   0 931.5G  0 disk
├─sdb1                         8:17   0 931.5G  0 part
└─sdb9                         8:25   0     8M  0 part
sdc                            8:32   0 931.5G  0 disk
├─sdc1                         8:33   0 931.5G  0 part
└─sdc9                         8:41   0     8M  0 part
sdd                            8:48   0 931.5G  0 disk
├─sdd1                         8:49   0 931.5G  0 part
└─sdd9                         8:57   0     8M  0 part
sde                            8:64   0 931.5G  0 disk
├─sde1                         8:65   0 931.5G  0 part
└─sde9                         8:73   0     8M  0 part
sdf                            8:80   0 931.5G  0 disk
├─sdf1                         8:81   0 931.5G  0 part
└─sdf9                         8:89   0     8M  0 part
sdg                            8:96   0 931.5G  0 disk
├─sdg1                         8:97   0 931.5G  0 part
└─sdg9                         8:105  0     8M  0 part
zd0                          230:0    0   3.9T  0 disk

root@PROXTEMP:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   16G     0   16G   0% /dev
tmpfs                 3.2G  1.1M  3.2G   1% /run
/dev/mapper/pve-root   30G  8.4G   20G  30% /
tmpfs                  16G   43M   16G   1% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
ZFSQNAP               4.3T  256K  4.3T   1% /ZFSQNAP
/dev/fuse             128M   16K  128M   1% /etc/pve
tmpfs                 3.2G     0  3.2G   0% /run/user/0
 
Last edited:
If you use 6x 1TB as raidz1 and dont increase the volblocksize you only get 3TB of total storage because you are loosing 1TB to parity and 2TB to padding. So you cant create a 4TB zvol because you only got usable 2,4TB (10-20% always should be kept unused) of storage.
That padding overhead is indirect. ZFS will tell you you got 5TB of storage but because everything will be 66% bigger due to padding so storing a 3TB zvol will need 5TB of storage.
 
Last edited:
...so storing a 3TB zvol will need 5TB of storage.
Uhm. That seems quite ineficient for usable storage. I mean 2.6TB usable ot of 6TB's with expexted single drive failure... Or am I missing something crucial while creating ZFS pool initially? Both online ZFS calcs + PM Web-Gui shows I shall have over 4TB usable?
In that case I might consider another 'cheapo' RAID controller and use pure RAID5 instead.. My primary raid controller on this one is Fujitsu branded D2516-C11 with external bbu for ca 7-8USD.
https://manualzz.com/doc/o/klvva/modular-raid-controllers--d2516-
Not sure all laptop grade disks compatibility with this one...
 
Last edited:
Uhm. That seems quite ineficient for usable storage. I mean 2.6TB usable ot of 6TB's with expexted single drive failure... Or am I missing something crucial while creating ZFS pool initially? Both online ZFS calcs + PM Web-Gui shows I shall have over 4TB usable?
In that case I might consider another 'cheapo' RAID controller and use pure RAID5 instead.. My primary raid controller on this one is Fujitsu branded D2516-C11 with external bbu for ca 7-8USD.
https://manualzz.com/doc/o/klvva/modular-raid-controllers--d2516-
Not sure all laptop grade disks compatibility with this one...
You don't use ZFS because you want raid. ZFS should be used if you want raid and you want to be sure that the data won't corrupt. And because you want CoW, a self-healing filesystem, compression on block level, deduplication, replication, snapshots and so on. All stuff that a HW raid controller can't do.

With ZFS and raid cards it is basically the same like with ECC and non-ECC RAM. ZFS is basically the ECC for your disks. ZFS won't be as fast as a HW raid and will eat alot of RAM and CPU performance, but without it your disks won't be able to detect or correct corruptions. So not using ZFS (or similar filesystems like BTRFS, CEPH, ...) is like not using ECC RAM. It might be cheaper and faster but you are sacrificing system stability and data integrity.

And like I already said, you don't need to loose 3 of 6TB. You just need to setup your ZFS right. ZFS isn't working out of the box. You need to learn how to use it and not just keep using the defaults. You can calculate the right volblocksize yourself or look at this table. ZFS will tell you that you can use 5 of 6 TB and that is basically true. ITs for example true if you use a dataset. Datasets default to 128K recordsize and here you get no padding overhead and can write 5TB to your pool, because 128K is high enough for most raidz pools. But zvols will use the default 8K volblocksize and not the 128K recordsize and the more drives your raidz consists of, the higher you need to choose the volblocksize or you get alot of padding overhead wasting space. You would atleast need a volbocksize of 32K if you use 6 disks as raidz1 with a ashift of 12. Look here:
4K/8K volblocksize = 50% of raw capacity lost (17% parity + 33% padding)
16K volblocksize = 33% raw capacity lost (17% parity + 16% padding)
32K/64K/128K volblocksize = 20% raw capacity lost (17% parity + 3% padding)
256K volblocksize = 18% raw capacity lost (17% parity + 1% padding)
512K/1024K volblocksize = 17% raw capacity lost (17% parity)

Here is a blog post of the ZFS developer where he explains how raidz works on block level and why there is padding overhead.

So if you use the default 8K volblocksize you can only use 50% (17% lost for parity + 33% lost because of padding overhead) of that storage. You will loose 1 TB because of the parity. And because of padding overhead everything you will write to a zvol will be 66% bigger. So if you write 3TB to a zvol that will consume 5TB of space (3TB of data + 2TB of padding).
And then you can never fully utilize a pool because ZFS is a copy-on-write filesystem and always needs free space for internal stuff. If the pool is 80% full it gets slow. And if its more than 90% full it switches into panic mode. So best is to set a 80% or 90% quota for that pool so that no one can totally fill it up by accident. So of this 3TB you can only use 80-90% so right now you only got 2,4-2,7 TB of usable space.
Use a 32K volblocksize and you can use 3,85-4,3 TB (or maybe even more if your data is well compressible...you can also store 100TB or more on that pool if the data can utilize deduplication and compression well..but around 4,3 TB is what you can use if that data is totally uncompressible and without similarities). Deduplication is by the way disabled by default because it needs additional 5GB RAM per 1TB of raw capacity and you possibly don't want to sacrifice additional 30GB RAM to be able to deduplicate a 6TB pool. Without deduplication your pool should only need around 8GB of RAM for its ARC (but it will use up to 16GB RAM if you don't change the the defaults). It will run with less RAM (like 2-4GB) but the pool will get slower.

And volblocksize can'T be changed later because it is only set a creation. So you need to destroy and recreate all your zvols after changing the blocksize for that pool (WebUI: Datacenter -> Storage -> YourPool -> Edit -> Block size). Creating a backup and restoring it will delete the existing zvol and replace it with a new one from the backup. So this works fine for changing the volblocksize as long as you got enough backup storage.

And you don't want to increase your volblocksize too much. It is super inefficient to write data that is smaller than your volblocksize and performance will get terrible slow because write/read amplification will get terrible high. So if you got alot of small files you want volblocksize to be small. Or if you for example got a mysql DB that is doing 16K sync writes/reads you don't want that volblocksize to be higher than 16K.

And raidz1/2/3 or any conventional raid5/6 is by design only good for sequential writes and bad for random writes. A 6 disk raidz1 only got IOPS like a single drive, write thoughput like 5 drives and read throughput like 6 drives. With a 6 disk striped mirror you would get the IOPS like 3 drives, write throughput like 3 drives and read troughput like 6 drives. If you want to use that pool as a VM storage any raidz1/2/3 or HW raid5/6 would be a terrible choice. The biggest problem with HDDs are the low IOPS they can handle and a striped mirror or raid10 with the same amount of disks could handle that 3 times better.

A mix is also possible. You could create two raidz1s of 3 disks each and stripe them together. This would also need a 32K volblocksize and you would loose some capacity (33% instead of 20% raw capacity lost) and loose some seqential write performance (performance of 4 instead of 5 drives) but your pool would be double as fast for random writes and may survive a second failing disk.
 
Last edited:
All righty then...
So for now, I'decided to abandon ZFS for "big" storage and use hardware raid instead. I also managed to increase total disk pool to 8x1TB sata. That combined with BBU-backed raid controler in RAID-6 mode seems to be more fault tolerant compared to my 4-bay qnap (with 2x RAID1 arrays) with it's unstable/half bricked firmware bleeding out. Overall performance atm. is secondary (if any).
My goal is to temporary move all QNAP data to some-nas-VM, share it for a few days then moving everything back after QNAP's maintenance (including replacing some disks etc).

Then, I'll surely switch back to ZFS experiments with no pressure to total capacity avail.
I just wonder what's best. Mount hw storage array on PM (ext4 lets say...) then put storage vmNAS-VD on it or perhapse PCI passthrough HWRAID controller directly to a VM and mount from there... BTW: It's not an HBA (IT) one. I might reflash it someday for ZFS purposes.

PS. Since I had to plug all 8 sata drives to raid controller, there was nothing left for my 600GB SAS's in RAID1. Therefore now I'm using 2x500GB SATA (mobo integrated sata) instead in ZFS mirror for PM itself. Obviously had to do a fresh install but took just minutes since it's a 'fresh' project anyway...