HDSIZE of NVMe system disk during PVE installation with ZFS

Garry · Feb 10, 2022

Another question - how best to organize such a "growing" storage?
That is storage to which new disks can be added so that its size increases and the configuration changes (from a single disk to a mirror or raidz).
LVM or ZFS?

Dunuin said:
Both got advantages and disadvantages. I would prefer to to bind-mount a folder from the host to a LXC and then run the NFS server inside the LXC to share it.

If we build this storage based on ZFS, then how best to forward the disks connected to the host inside the container?

Dunuin · Feb 10, 2022

Garry said:
Another question - how best to organize such a "growing" storage?
That is storage to which new disks can be added so that its size increases and the configuration changes (from a single disk to a mirror or raidz).
LVM or ZFS?

Isn't that easy to extend a pool you are booting from. Easiest would be to add new disks and create a new additional pool with them.

Garry said:
If we build this storage based on ZFS, then how best to forward the disks connected to the host inside the container?

You can create a dataset (zfs create YourPool/YourNewDataset) and then bind-mount the mountpoint of that dataset into a LXC. See here for bind-mounting into unpriviled LXCs: https://pve.proxmox.com/wiki/Unprivileged_LXC_containers

Garry · Feb 10, 2022

Dunuin said:
Isn't that easy to extend a pool you are booting from

I did not intend to expand the pool from which the system boots. I'll leave the two NVMe disks that are have used already for rpool as they are now.

I have a big SATA disk in my server also and I mean to make a new pool with this SATA disk. For example, spool.
Then I will extend the spool by plugging new SATA disks into the server and adding these disks to the spool.

Could there be any problems with this while using ZFS to serve this storage pool?

Dunuin · Feb 10, 2022

Isn't that easy with ZFS. If you want a striped mirror you can add new drives in pairs. But ZFS won't activly equilize the data on the disk after adding them, so it takes some time until the pool reaches the full performance.
Last year a feature was added that allows to extend a raidz1/2/3 but extending it will only increase its capacity but won't increase its performance or space efficiency (parity-to-data-ratio). Only way the get the full raidz performance and usable capacity is to destroy the pool and recreate it.
And remember: raid/snapshots never replace a backup. So you might want to buy additional disks to backup everything and then destroying and recreating a pool ins't that problematic as you can move datasets/zvols between pools using the "zfs send | zfs recv" commands.

Garry · Feb 10, 2022

Dunuin said:
But ZFS won't activly equilize the data on the disk after adding them, so it takes some time until the pool reaches the full performance.

I think it is not a problem - we can wait some time)
How much time it will get for 1Tb of new disk capacity?

Dunuin said:
So you might want to buy additional disks to backup everything and then destroying and recreating a pool ins't that problematic as you can move datasets/zvols between pools using the "zfs send | zfs recv" commands.

Do you mean to do the storage on single disks (not mirrored or in any raidz)?

Dunuin · Feb 10, 2022

Garry said:
I think it is not a problem - we can wait some time)
How much time it will get for 1Tb of new disk capacity?

It won't do this activly on its own. That just happens passively when deleting old data and writing new data. So that might take weeks or months depending on how much you write/delete.
You can speed that up by moving all data between pools (so everything is deleted and added again).

Garry said:
Do you mean to do the storage on single disks (not mirrored or in any raidz)?

No, I mean you best buy everything twice so you get proper backups and then deleting and recreating a pool isn't a big problem. For example, if you want 1.6TB of usable storage, get 6x 1TB disks and create two raidz1 pools using 3 disks each. Then for backups you can replicate data from the first pool to the second pool. If you then need another 800GB of storage you can buy two additional 1TB disks, destroy the first pool, recreate it as a raidz1 of 4 disks. Move the data from the second pool back to the first pool. Destroy the second pool, recreate it as a raidz1 of 4 disk and start the replication again to so everything from the first pool is synced to the second pool again.
And the second pools doesn't has to be in the same server. You could use USB Disks and only attach them every several weeks and store them in the meantime somewhere safe. Or you could do the ZFS replication over SSH between different servers.

Or you just extend the raidz pool without deleting it, so you don't need extra space to temporarily store everything while deleting the pool, but then you won't get the additional performance. And you anyway always should have multiple copies of your data if that data is important to you.

Garry · Feb 10, 2022

Garry said:
I have a big SATA disk in my server also and I mean to make a new pool with this SATA disk. For example, spool.
Then I will extend the spool by plugging new SATA disks into the server and adding these disks to the spool.

Dunuin said:
Isn't that easy with ZFS.

Maybe that it is easier to do with LVM?

Dunuin · Feb 11, 2022

Garry said:
Maybe that it is easier to do with LVM?

Maybe. But then you also loose all the ZFS functions like snapshots, block level compression, deduplication, replication, bit rot protection, ...

Garry · Feb 11, 2022

Dunuin said:
But then you also loose all the ZFS functions like snapshots, block level compression, deduplication, replication, bit rot protection, ...

1) About deduplication - can we really use it? I have read everywhere - do not use deduplication...

Now I have no budget to buy additional disks to make mirrors.
But storage capacity should grow and I will buy additional disks for adding them to the pool as a single disk

As I wrote that I plan video and photo files will be stored on this storage. We have all these media files in archives on different disks so the point about backup is so sharp.

2) But one thing worries me a lot - if we use ZFS pool with single disks (not in mirror or RAIDZ mode), as I understand, they can run in stripped mode only. For stripe, fault tolerance is worse than for a single disk.

So if one disk will fault - I think the whole pool will be inaccessible.
Is there any way to quickly rebuild the data pool from the good disks? without data loss

3) I see new disks zd0 and zd16 in PVE system.
What is it?

Code:

root@pve:~# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda           8:0    0   2.7T  0 disk
└─sda1        8:1    0   2.7T  0 part
zd0         230:0    0    32G  0 disk
├─zd0p1     230:1    0     1M  0 part
├─zd0p2     230:2    0     1G  0 part
└─zd0p3     230:3    0    15G  0 part
zd16        230:16   0    16G  0 disk
├─zd16p1    230:17   0     1M  0 part
├─zd16p2    230:18   0     1G  0 part
└─zd16p3    230:19   0    15G  0 part
nvme0n1     259:0    0 238.5G  0 disk
├─nvme0n1p1 259:1    0  1007K  0 part
├─nvme0n1p2 259:2    0   512M  0 part
├─nvme0n1p3 259:3    0 229.5G  0 part
└─nvme0n1p4 259:4    0   8.5G  0 part [SWAP]
nvme1n1     259:5    0 238.5G  0 disk
├─nvme1n1p1 259:6    0  1007K  0 part
├─nvme1n1p2 259:7    0   512M  0 part
├─nvme1n1p3 259:8    0 229.5G  0 part
└─nvme1n1p4 259:9    0   8.5G  0 part [SWAP]

Code:

root@pve:~# fdisk -l
Disk /dev/nvme0n1: 238.47 GiB, 256060514304 bytes, 500118192 sectors
Disk model: NE-256                                 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F73F4ECD-7205-47DA-A77F-9E91E960F91F

Device             Start       End   Sectors   Size Type
/dev/nvme0n1p1        34      2047      2014  1007K BIOS boot
/dev/nvme0n1p2      2048   1050623   1048576   512M EFI System
/dev/nvme0n1p3   1050624 482344960 481294337 229.5G Solaris /usr & Apple ZFS
/dev/nvme0n1p4 482347008 500118158  17771151   8.5G Linux swap


Disk /dev/nvme1n1: 238.47 GiB, 256060514304 bytes, 500118192 sectors
Disk model: GIGABYTE GP-GSM2NE3256GNTD             
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 629827AB-324C-4248-9AA4-1A5807E237B5

Device             Start       End   Sectors   Size Type
/dev/nvme1n1p1        34      2047      2014  1007K BIOS boot
/dev/nvme1n1p2      2048   1050623   1048576   512M EFI System
/dev/nvme1n1p3   1050624 482344960 481294337 229.5G Solaris /usr & Apple ZFS
/dev/nvme1n1p4 482347008 500118158  17771151   8.5G Linux swap


Disk /dev/sda: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk model: WDC WD30EFZX-68A
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 434F9A21-23F7-C946-8228-1DE9F0FA091D

Device     Start        End    Sectors  Size Type
/dev/sda1   2048 5860533134 5860531087  2.7T Linux filesystem
GPT PMBR size mismatch (33554431 != 67108863) will be corrected by write.
The backup GPT table is not on the end of the device.


Disk /dev/zd0: 32 GiB, 34359738368 bytes, 67108864 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disklabel type: gpt
Disk identifier: 54E4EB85-5A71-4540-9F14-6FF3FEA9B35B

Device       Start      End  Sectors Size Type
/dev/zd0p1    2048     4095     2048   1M BIOS boot
/dev/zd0p2    4096  2101247  2097152   1G Linux filesystem
/dev/zd0p3 2101248 33552383 31451136  15G Linux filesystem


Disk /dev/zd16: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disklabel type: gpt
Disk identifier: 54E4EB85-5A71-4540-9F14-6FF3FEA9B35B

Device        Start      End  Sectors Size Type
/dev/zd16p1    2048     4095     2048   1M BIOS boot
/dev/zd16p2    4096  2101247  2097152   1G Linux filesystem
/dev/zd16p3 2101248 33552383 31451136  15G Linux filesystem

Dunuin · Feb 11, 2022

Garry said:
1) About deduplication - can we really use it? I have read everywhere - do not use deduplication...

You can use it, but it will need alot of RAM (around 5GB additional RAM per 1TB of storage).

Garry said:
Now I have no budget to buy additional disks to make mirrors.
But storage capacity should grow and I will buy additional disks for adding them to the pool as a single disk

As I wrote that I plan video and photo files will be stored on this storage. We have all these media files in archives on different disks so the point about backup is so sharp.

If you already got backups then its maybe best to just extend the raid and sacrifice the missing performance gain when adding you disks. Or you stripe mirrors or raidz1/2s but then you can't add single disks and need to add more at once.

Garry said:
2) But one thing worries me a lot - if we use ZFS pool with single disks (not in mirror or RAIDZ mode), as I understand, they can run in stripped mode only. For stripe, fault tolerance is worse than for a single disk.

So if one disk will fault - I think the whole pool will be inaccessible.
Is there any way to quickly rebuild the data pool from the good disks? without data loss

Yes, if a single disks dies in a striped pool all data is lost an and there is no way to get them back or repair the pool.

Starting with a striped mirror with 2x 1TB disks later adding adding another 2/4/6 disks would look like this:

	Capacity:	IOPS:	Througput (read/write:	Disks may fail:
2 disk mirror:	40% = 0.8TB	1x	2x / 1x	1
4 disk striper mirror:	40% = 1.6TB	2x	4x / 2x	1-2
6 disk striper mirror:	40% = 2.4TB	3x	6x / 3x	1-3
8 disk striper mirror:	40% = 3.2TB	4x	8x / 4x	1-4

Starting with a 3 1TB disk raidz1 extendung it by adding 1/2/3/4/5 1TB disks later would look like this:

	Capacity:	IOPS:	Throughput (read/write):	Disks may fail:
Raidz1 3 disks:	53% = 1.59TB	1x	2x / 2x	1
Raidz1 4 disks:	53% = 2.12 TB	1x	2x / 2x	1
Raidz1 5 disks:	53% = 2.65 TB	1x	2x / 2x	1
Raidz1 6 disks:	53% = 3.18 TB	1x	2x / 2x	1
Raidz1 7 disks:	53% = 3.71 TB	1x	2x / 2x	1
Raidz1 8 disks:	53% = 4.24 TB	1x	2x / 2x	1

If you destroy and recreate a raiz1 each time you add a disk it would look like this with 3x to 8x 1TB disks:

	Capacity:	IOPS:	Throughput (read/write):	Disks may fail:
Raidz1 3 disks:	53% = 1.59 TB	1x	2x / 2x	1
Raidz1 4 disks:	60% = 2.4 TB	1x	3x / 3x	1
Raidz1 5 disks:	64% = 3.2 TB	1x	4x / 4x	1
Raidz1 6 disks:	67% = 4 TB	1x	5x / 5x	1
Raidz1 7 disks:	69% = 4.8 TB	1x	6x / 6x	1
Raidz1 8 disks:	70% = 5.6 TB	1x	7x / 7x	1

So extending a raidz1 pool only makes sense if you start with a lot of disks so you already got a good data-to-parity-ratio and throughput performance to start with as it won't increase with adding new disks.

Another option would be to stripe multiple raidz1s. That way you can add disks similar to a striped mirror but you always need to add atleast the amount of new disks that you started with.
Lets say for example you start with 4x 1TB disks as a raidz1. You could then buy another 4x 1TB disks and it would look like this without needing to destroy or recreate the pool:

	Capacity:	IOPS:	Throughput (read/write):	Disks may fail:
Raidz1 4 disks:	60% = 2.4 TB	1x	3x / 3x	1
Striped Raidz1 of 4 disks each (8 disks total):	60% = 4.8 TB	2x	6x / 6x	1-2

So in my opinion a striped mirror or striped raidz1/2 would make more sense then extending a raidz1/2 with single disks.

Garry said:
3) I see new disks zd0 and zd16 in PVE system.
What is it?

That are your zvols. Each virtual disk you create is a zvol block device that will show up as zdX.

Garry · Feb 11, 2022

Dunuin said:
That are your zvols. Each virtual disk you create is a zvol block device that will show up as zdX.

1) So can I mount them in the host and get access to their files in the host's system?

2) What option concerning disks is the better while creating ZFS pool:
1. raw disks
2. disks with created gpt partition
3. smth else

3) Should I make any preparations with a disk before adding it to zfs pool?
For example,
1. filling disk with zeros
2. any initialization of a disk

4) Should I set ashift=12 (or other) while creating ZFS pool with common SATA disk(s)?

Dunuin · Feb 11, 2022

Garry said:
1) So can I mount them in the host and get access to their files in the host's system?

Yes, but only if the VM is shutdown or you will corrupt the data.

Garry said:
2) What option concerning disks is the better while creating ZFS pool:
1. raw disks
2. disks with created gpt partition
3. smth else

Both will work. If you don't need to put other stuff on it besides ZFS I would use them raw and let ZFS partition them.

Garry said:
3) Should I make any preparations with a disk before adding it to zfs pool?
For example,
1. filling disk with zeros
2. any initialization of a disk

No, if you use raw ZFS will do the partitioning. Otherwise you need to partition the disk yourself. If you want to create that Pool using the WebUI you need to wipe it first because PVE won't allow you to create a pool with it if it already got partitions on it.

Garry said:
4) Should I set ashift=12 (or other) while creating ZFS pool with common SATA disk(s)?

That depends on your disks. For most stuff ashift=12 should be fine. You can use fdisk -l /dev/yourDisk to check its logical/physical sector size. If the physical sector size of the HDD is 4096B then ashift=12 should be used.

Garry · Feb 11, 2022

I created zfs pool mpool on /dev/sda via web-gui.

There is the question about the size of quota to should be set for the new pool.
fdisk -l shows capacity as 2.7T, zfs list shows 2.63T - see below.

What quota size should I set for mpool?

Code:

root@pve:~# zfs list
NAME                         USED  AVAIL     REFER  MOUNTPOINT
mpool                        660K  2.63T       96K  /mpool
mpool/nfs_media               96K  2.63T       96K  /mpool/nfs_media
rpool                       26.6G   194G      104K  /rpool
rpool/ROOT                  17.5G  14.5G       96K  /rpool/ROOT
rpool/ROOT/pve-1            17.5G  14.5G     17.5G  /
rpool/data                  9.03G   154G       96K  /rpool/data
rpool/data/base-101-disk-0  3.02G   154G     3.02G  -
rpool/data/vm-100-disk-0    6.01G   154G     6.01G  -

root@pve:~# fdisk -l /dev/sda
Disk /dev/sda: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk model: WDC WD30EFZX-68A
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: AB5966AD-D96D-F94F-81BA-713F64DD9528

Device          Start        End    Sectors  Size Type
/dev/sda1        2048 5860515839 5860513792  2.7T Solaris /usr & Apple ZFS
/dev/sda9  5860515840 5860532223      16384    8M Solaris reserved 1

Dunuin · Feb 11, 2022

80% or 90% of your pools capacity so 2.1TiB or 2.367TiB. If you set your pools quota to 2.367TiB you should have an eye on your pool and free up some stuff after your pool exceeds 2.1TiB.

Garry · Feb 12, 2022

Dunuin said:
80% or 90% of your pools capacity so 2.1TiB or 2.367TiB. If you set your pools quota to 2.367TiB you should have an eye on your pool and free up some stuff after your pool exceeds 2.1TiB.

So do you calculate the size to set quota from 2.63T which we see in zfs list ?

Now I see in zfs get all mpool that
mpool compression on local
What is compression = local?
May be better to switch the compression type to lzjb or lz4?

Dunuin · Feb 12, 2022

"local" means the "compression" attribute is directly set and not just inherited (which makes sense as the pools root can'T be a child of anything else). The value of the "compression" attribute is "on" which should default to lz4 compression. So actually lz4 is already used.

Garry · Jun 27, 2022

Dunuin said:
"local" means the "compression" attribute is directly set and not just inherited (which makes sense as the pools root can'T be a child of anything else). The value of the "compression" attribute is "on" which should default to lz4 compression. So actually lz4 is already used.

Hello,
I finally bought an additional HDD to enlarge my pool size. The new drive has 6Tb capacity.

I have pool with 3Tb HDD now.
Is any possibility to add this new drive to existing pool to make 9Tb total pool size?

Dunuin · Jun 27, 2022

You could stripe them so you get something like a raid0. But thats not really recommended. Best you get disks of the same size and use them in a setup with some kind of parity. Using 2 disks in a raid0 you double the change to loose everything.

Garry · Jun 28, 2022

Dunuin said:
Using 2 disks in a raid0 you double the change to loose everything

Yes, sure!

Dunuin said:
Best you get disks of the same size and use them in a setup with some kind of parity.

But my situation is I have only one disk 3Tb and one disk 6Tb already.
Unfortunaly I can't buy another disk(s) now.

I am going to try to make 9Tb storage with these two disk using LVM. Now I reading LVM manuals)

There is in Proxmox manual about LVM writen:

Create a Physical Volume (PV) without confirmation and 250K metadatasize.

# pvcreate --metadatasize 250k -y -ff /dev/sdb1

1) Why without confirmation?
2) Why 250K metadatasize?

Is 250K metadatasize enought? May be better to set metadatasize 1M, for example?
Or may be better don't set anything about metadatasize so pvcreate will make metadatasize as default?

HDSIZE of NVMe system disk during PVE installation with ZFS

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member