ZFS Raid 10 mirror and stripe or the opposite

ieronymous · Dec 13, 2021

Hi

In a zfs raid 10 of lets say 8 drives, which is the default way of proxmox configuring the drives assuming the process is done from gui?
Mirror vdevs then stripe them or the opposite

Method 1
like 2disks in mirror vdev1 / 2disks in mirror vdev2 / These 2 disks stripped in vdev3
2disks in mirror vdev4 / 2disks in mirror vdev5 These 2 disk stripped in vdev6 Mirror vdev3 and vdev6

Method 2
2disks in mirror vdev1 / 2disks in mirror vdev2 / 2disks in mirror vdev3/ 2disks in mirror vdev4 Stripe vdev1-4

Method 3
Stripe disk 1-2 in vdev1 / Stripe disk 3-4 in vdev2 / Stripe disk 5-6 in vdev3 Mirror vdev1 vdev 2 vdev 3

Probably there are other methods two similar but which one does it use?

Dunuin · Dec 13, 2021

1.) YourNode -> Disks -> select each of the 8 disks and click the "wipe" button to wipe them (PVE will only allow to use them if they are completely empty without existing partitions)
2.) YourNode -> Disks -> ZFS -> Create: select "Raid10" there and select your 8 wiped disks
3.) Datacenter -> Storage -> YourZFSPool -> Edit: Set Block size from 8K to 16K if you are using 8 disks.
4.) Optimize your ZFS pool (for example enableing relatime zfs set relatime=on YourPool)

ieronymous · Dec 13, 2021

Dunuin said:
1.) YourNode -> Disks -> select each of the 8 disks and click the "wipe" button to wipe them (PVE will only allow to use them if they are completely empty without existing partitions)
2.) YourNode -> Disks -> ZFS -> Create: select "Raid10" there and select your 8 wiped disks
3.) Datacenter -> Storage -> YourZFSPool -> Edit: Set Block size from 8K to 16K if you are using 8 disks.
4.) Optimize your ZFS pool (for example enableing relatime zfs set relatime=on YourPool)

I feel like kid which you take from hand and learn it stuff...haahah . Thank you for the reply but it is completely irrelevant to what my initial question was. It is like you answer to a how to guide from scratch,

Would you like to read my question again and see where the answer failed?

For instance in a zfs raid 10 with 4 disks the outcome of zpool status would be
mirror-0
disk1
disk2

mirror-1
disk3
disk4

So I am assuming that first it does mirrors and then stripes over? Where in the above example cli output shows the stripe? You jsut suppose so because someone chosen raid 10 and knows it . Shouldn't the outcome list the striping between mirror-0 and mirror-1 instead of me assuming it since it is one pool consisting of 4 disks?

By the way about your last two steps

Dunuin said:
3.) Datacenter -> Storage -> YourZFSPool -> Edit: Set Block size from 8K to 16K if you are using 8 disks.
4.) Optimize your ZFS pool (for example enableing relatime zfs set relatime=on YourPool)

3. This isnt always the case . It depends of the need of the user. Will he need the storage for VMs ? For Databases? (small files) Backups? (large files)
4.I can t remember where setting relatime=on helps with.

Thank you though for your answer.

Dunuin · Dec 14, 2021

ieronymous said:
Thank you for the reply but it is completely irrelevant to what my initial question was. It is like you answer to a how to guide from scratch,

Would you like to read my question again and see where the answer failed?

For instance in a zfs raid 10 with 4 disks the outcome of zpool status would be
mirror-0
disk1
disk2

mirror-1
disk3
disk4

So I am assuming that first it does mirrors and then stripes over? Where in the above example cli output shows the stripe? You jsut suppose so because someone chosen raid 10 and knows it . Shouldn't the outcome list the striping between mirror-0 and mirror-1 instead of me assuming it since it is one pool consisting of 4 disks?

A striped mirror (zfs term for raid10) are multiple mirrors (don't need to be 2 disks, you can also have 3 or 4 disks in a mirror so that 2 or 3 disks of the mirror might fail without lossing data) striped together. But striping works different compared with traditional raid. I believe it is more like a JBOD just with mirrors (or could we call it "JBOM: Just a bunch of mirrors"?^^). So if you only got 2 disk in a mirror a striped mirror with 4/6/8/10/12/... disks would result in 2/3/4/5/6 individual mirrors that are striped together. So with 8 disks created using "raid10" in the WebUI your pool would look like this:

Code:

mirror-0
 disk1
 disk2

mirror-1
 disk3
 disk4

mirror-2
 disk5
 disk6

mirror-3
 disk7
 disk8

If you use it in the installer keep in mind that not the whole disk will be used for ZFS. Only one of the many partitions of each disk will be used for pool.

ieronymous said:
3. This isnt always the case . It depends of the need of the user. Will he need the storage for VMs ? For Databases? (small files) Backups? (large files)

You can't choose the volblocksize as you like. It has to match the hardware. For a striped mirror you want a bigger volblocksize for large files. For small files you also want atleast 16K volblocksize because otherwise you can't make use of all the IOPS you disks could handle (atleast if using a ashift of 12 so 4K blocksize used by each mirror, so as far as I know you want atleast 4x 4K =16K for your volblocksize).

With raidz1/z2/z3 its even worse. No matter what blocksize your workload requires, it makes no sense to go below 16K volblocksize because otherweise you would always loose 50-75% of your raw capacity because of the additional padding overhead and even a striped mirror would be more space efficient so no need to use that if a striped mirror would get way better IOPS, would be faster to resilver, and a lower possible volblocksize with the same or even more usable capacity.

So you need to choose the hardware setup according to your workload and don't try to fit the hardware to your workload. A raidz1 will never be a good choice for a posgres DB no matter what you do or select as the blocksize. If you need a 8K blocksize, like posgres does, use a (striped) mirror and no raidz1/2/3.

ieronymous said:
4.I can t remember where setting relatime=on helps with.

Without that every read operation will cause an additional write operation to edit the metadata. So you get better IOPS, better throughput and your SSDs will live longer. Only negative effect would be that your access time isn't that accurate but it sould be accurate enough for nearly any application.

ieronymous · Dec 14, 2021

Now we are getting somewhere.
Probably missed the definition for zfs raid10. If it is a stripe of mirrors then it automatically answers my question as well.

Dunuin said:
So with 8 disks created using "raid10" in the WebUI your pool would look like this:

It is my described Method 2 then, Ok.

Dunuin said:
A striped mirror (zfs term for raid10) are multiple mirrors (don't need to be 2 disks, you can also have 3 or 4 disks in a mirror so that 2 or 3 disks of the mirror might fail without losing data) striped together.

I am trying to solve the problem of having 2 disks of the same mirror fail by combining different brand disks (mostly HGST and WD or Seagate) for each mirror. What I mean is placing them in the server in a way that odd positions are covered by one brand and even positions by another,

Having for instance 3 mirrored drives is even better (assuming you can afford the storage loss) but I dont think you can accomplish that from gui. Am I right? It seems that I have to custom create the mirrors by myself. Then, can the last step which will be striping the mirrors, be done from gui or there is no need and since I started the procedure from cli it would be wise to finish it there?

Dunuin said:
If you use it in the installer keep in mind that not the whole disk will be used for ZFS. Only one of the many partitions of each disk will be used for pool.

oh, new info here. Didn t know that. Why? So you are suggesting creating the raid from cli right? Do any extra options are necessary during the creation to avoid having only one partition of each disk being used?

You ve wrote (sorry for this way of presenting it but after editing the post couldnt let me reply quote way)
<<<<<<<<<<<<Without that every read operation will cause an additional write operation to edit the metadata. So you get better IOPS, better throughput and your SSDs will live longer. Only negative effect would be that your access time isn't that accurate but it sould be accurate enough for nearly any application.>>>>>>>>>>>>>>

A little research on this option brought ..... relatime It only updates the atime when the previous atime is older then mtime (modification time) or ctime (changed time) or when the atime is older then 24h (based on setting). Does zfs using this kernel feature to calculate times when using snapshots and shouldn t be disabled?
Since the way to check that option is with zfs command and not zpool that means relatime is an option you can set for each individual Dataset and not for the pool inside which Datasets reside right? Or you can set it for both? Of course there would be the possibility that by setting the pool automatically options are inherited to the Datasets below that pool.
Most important question though is .... can you set that option afterwards when the pool/Dataset has already data insside? Won t that effect anything?
Currently on all proxmox nodes i have created none has that option enabled. Since you brought that up though, searching some older personal guides found me the below options I was aware of :
-You should stripe mirrors for the best IO,
-RAIDZ2 is not exactly fast, don't use it for pools larger than 10 disks
-make sure when you created your zfs pool that you used an ashift value according to the physical block size of your disks
-no gains with read cache or an SLOG for specifica configurations ZFS loves RAM, soyou can also tweak the ARC to a larger size for some performance gains.
-zfs set xattr=sa (pool) set the Linux extended attributes as so, this will stop the file system from writing tiny files
and write directly to the inodes
-zfs set sync=disabled (pool) disable sync, this may seem dangerous, but do it anyway! You will get a huge performance gain.
-zfs set compression=lz4 (pool/dataset) set the compression level default here, this is currently the best compression algorithm.
-zfs set atime=off (pool) this disables the Accessed attribute on every file that is accessed, this can double IOPS
-zfs set recordsize=(value) (pool/vdev) The recordsize value will be determined by the type of data on the file system,
16K for VM images and databases or an exact match, or 1M for collections 5-9MB JPG files
and GB+ movies ETC. If you are unsure, the default of 128K is good enough for all around
mixes of file sizes.
-acltype=posixacl, default acltype=off Hm..... don t haev any info for this one but seen it like this in many occasions. Mine is set to off
-primary cache=metadata (For Vm storage) Interesting article with production level examples.
https://www.ikus-soft.com/en/blog/2018-05-23-proxmox-primarycache-all-metadata/

Thank you for all that great info!!!!!! Really appreciated since I am adding nice tips to my guides making them more advanced.
I am really looking forward for your answer for my last post

By the way I am using sas 10k 2.5 inch drives of 1.2Tb capacity each and 512b both logical/physical., so the ashift value I am going to use is 9.
For the proxmox it self which is based on a mirror ssd pair, those disks (even though ssd and lie for some reason for that) are also 512b logical/physical and ashift is 9

Dunuin · Dec 14, 2021

ieronymous said:
Having for instance 3 mirrored drives is even better (assuming you can afford the storage loss) but I dont think you can accomplish that from gui. Am I right? It seems that I have to custom create the mirrors by myself. Then, can the last step which will be striping the mirrors, be done from gui or there is no need and since I started the procedure from cli it would be wise to finish it there?

Jup, that only works using CLI. But its a oneliner:

zpool create -f -o ashift=12 YourPool mirror /dev/disk/by-id/disk1 /dev/disk/by-id/disk2 /dev/disk/by-id/disk3 mirror /dev/disk/by-id/disk4 /dev/disk/by-id/disk5 /dev/disk/by-id/disk6 mirror /dev/disk/by-id/disk7 /dev/disk/by-id/disk8 /dev/disk/by-id/disk9 mirror /dev/disk/by-id/disk10 /dev/disk/by-id/disk11 /dev/disk/by-id/disk12

"-f" will force the creation even if the disks are not empty (so you can skip the wipe step)
"-o ashift=9" will set the blocksize for each disk. So 9 would mean each disk would operate with 512B blocks (because 2^9 bytes blocksize).
"YourPool" is the name your pool will get.
each of the four "mirror /dev/disk/by-id/diskX /dev/disk/by-id/diskY /dev/disk/by-id/diskZ" blocks will define a 3-disk-mirror" and if you define multiple mirrors they will be automatically striped together.

So this would result in something like:

Code:

mirror-0
 disk1
 disk2
 disk3

mirror-1
 disk4
 disk5
 disk6

mirror-2
 disk7
 disk8
 disk9

mirror-3
 disk10
 disk11
 disk12

So any 2 disks or up to 8 disk may fail without lossing data.

If you create your Pool using the CLI you need to tell PVE to actually use it as a storage. You can do that using the GUI: Datacenter -> Storage -> Add -> ZFS

ieronymous said:
oh, new info here. Didn t know that. Why? So you are suggesting creating the raid from cli right? Do any extra options are necessary during the creation to avoid having only one partition of each disk being used?

Because the installer needs to write the bootloader and boot partition somewhere. So it writes them to all drives and keeps them in sync so in case that drive may fail, you are currently booting from, you can just select another drive. Or you select in the UEFI/BIOS to boot from all your drives, so it can always boot no matter what drive fails. Also you don't want to have your SWAP ontop of ZFS because that can cause problems. So its recommended to use a non-ZFS swap partition.
Also keep in mind that you can't just simple replace a failed disk in case your system is on that pool. You need to partition the new disk yourself first, copy over the bootloader and so on. Its described here (scroll down to "Changing a failed bootable device").

I like to install my PVE to a dedicated mirror (32GB disks will be fine) and then create another ZFS pool just as a VM storage. That way you get some benefits:
- the system got its own storage, so PVE should be able to hypervise even if the guests create so much IO load that the dedicated VM storage pool get totally unresponsive
- you can backup your guests, destroy your complete VM storage pool, create a new pool and restore the guests from backups to that new pool without needing to reinstall or setup the complete PVE again. Makes it really easier if you want to extent your VM storage Pool, need to change stuff like the ashift that can only be done at pool creation.
- you can just replace a failed disk of that VM storage pool without needing to do all the stuff described in "Changing a failed bootable device". So easier if someone else needs to replace the disk who doesn't know to do that right.

ieronymous said:
You ve wrote (sorry for this way of presenting it but after editing the post couldnt let me reply quote way)
<<<<<<<<<<<<Without that every read operation will cause an additional write operation to edit the metadata. So you get better IOPS, better throughput and your SSDs will live longer. Only negative effect would be that your access time isn't that accurate but it sould be accurate enough for nearly any application.>>>>>>>>>>>>>>

A little research on this option brought ..... relatime It only updates the atime when the previous atime is older then mtime (modification time) or ctime (changed time) or when the atime is older then 24h (based on setting). Does zfs using this kernel feature to calculate times when using snapshots and shouldn t be disabled?

No, not sure but I don't think so. Snapshots are working on block-level and atime is file-level.

ieronymous said:
Since the way to check that option is with zfs command and not zpool that means relatime is an option you can set for each individual Dataset and not for the pool inside which Datasets reside right?

Jup.

ieronymous said:
Or you can set it for both?
Of course there would be the possibility that by setting the pool automatically options are inherited to the Datasets below that pool.

ZFS will inherit all attributes to childs. So if you set something relatime=on for the pool itself, all datasets will inherit it and use it too as long as you don't explicitly tell the dataset to use something else. If you then tell a dataset to use "relatime=off" it will ignore the inherited "relatime=on" and use the "relatime=off".

ieronymous said:
Most important question though is .... can you set that option afterwards when the pool/Dataset has already data insside?

That really depends on the option you set. In general ZFS will only use this new setting for new writes. So if you for example later decide you want to encrypt a dataset, all already written data will stay unencrypted and only new data will be encrypted. And stuff like the volblocksize of a zvol can only be set once at creation and that attribute is read-only after that. But for stuff like relatime that should be unproblematic and should be changable at any time.

ieronymous said:
-zfs set sync=disabled (pool) disable sync, this may seem dangerous, but do it anyway! You will get a huge performance gain.

That basically means you lie to your guests and handle all secure sync writes as unsecure async writes. So the guest will think the data is securily stored but it is just in volatile RAM. So in case of a power outage, hardware failure or kernel crash you might kill the complete filesystem.
That is really an option I would never do. If you got problems with sync write performance, get a intel optane as a SLOG but don't just completely disable sync writes. Programmers know that sync writes are slow and they can choose if they want a write to be a secure sync write or a unsecure async write. If a programmer thinks that data is that important that it shouldn't be lost at any circumstance, he will let the program write it as a sync write knowing that it will be slow. With disableing sync writes you bascially ignore how the programmer wants a program to operate and you write everything as unsecure async writes.

ieronymous said:
-zfs set atime=off (pool) this disables the Accessed attribute on every file that is accessed, this can double IOPS

In case you want to use relatime you need to enable atime. Also keep in mind that some applications won't work when the storage isn't updating the access time. The PBS for example needs the atime for the datastore to be able to work. In such a case you might want "atime=on" and "relatime=on". If none of your application is needing the atime you can set "atime=off" and "relatime=off" for even better performance.

ieronymous said:
-zfs set recordsize=(value) (pool/vdev) The recordsize value will be determined by the type of data on the file system,
16K for VM images and databases or an exact match, or 1M for collections 5-9MB JPG files
and GB+ movies ETC. If you are unsure, the default of 128K is good enough for all around
mixes of file sizes.

Recordsize is ONLY used for datasets, not zvols. Zvols ignore the recordsize and use the volblocksize instead. In general all VMs should use zvols and all LXCs use datasets. So that recodsize will only effect LXCs but not VMs. If you want your VMs virtual disks (zvols) to operate with a 16K blocksize you need to set the volblocksize to 16K and not the recordsize.

ieronymous said:
By the way I am using sas 10k 2.5 inch drives of 1.2Tb capacity each and 512b both logical/physical., so the ashift value I am going to use is 9.
For the proxmox it self which is based on a mirror ssd pair, those disks (even though ssd and lie for some reason for that) are also 512b logical/physical and ashift is 9

In that case you got less problems with the minimum volblocksize if your disks physical blocksize if lower.

ieronymous · Dec 15, 2021

Thank you for your quick response and great insight about my considerations.
I ll add to what you have answered beginning from bottom to the top since this way I ll end up with the main topic of this post raid.

Dunuin said:
In that case you got less problems with the minimum volblocksize if your disks physical blocksize if lower.

So you agree with me that in my use case scenario ashift 9 is the best option. I know about the considerations of some people what if you want to replace a disk of native 4k blocksize afterwards? It will cause not only troubles but it will be prohibited and you have to force it to ignore the status of the drive. Never the less, the only viable option of changing sas 2.5 inch drives are with the same ones or!!! ssds. In order for the price to be at a reasonable level we ll have quantum pc's (joking) so this reason others care about doesn't affect me. Finally I noticed that even with ashift =9 or 12 the resulted default blocksize is 8k. I thought it is changed according to the ashift value.

By the way, my use case scenario is the sas drives will be split into two pools (still thinking the second pool if it is going to be a rai 10 as well or raidz1 or 2) So bottom l;ine for the configuration is

-2 DC ssds 480gb mirrored for the prox OS (already done) with ashift 9 and defaults for all other options.-

-8 or 10 or 12 from the 16 total sas drives for VM storage. Again with ashift 9 but feel free to intervene here and make suggestions.
The VMs all will be Win Serv 2019 for AD / MSSQL / Remote Server and 2 Win 10 VMs for other purposes.

-Rest of the 8-6-4 drives in a raid configuration (still thinking about a separate storage for creating extra disks for the VMs (I currently have it setup this way in our current server but after 2 years now cant remember why).Also since up until now didnt have an extra server for separate storage, I had a Dataset on the same storage for VM backups (I know single point of failure) in case I would like to restore A screwed VM .And writing this down just remembered why I wanted an extra storage for each VM. I though that snapshot wouldn t take into consideration the second drive and that would minimize the size of the backup (well couldnt be more wrong since it does take that into consideration and creates a backup file but has backed up both drives of the VM) So the WinServ19 VM which acts as an AD has an extra drive coming from a second pool just to store share files.
Second WinServ19 has also a second storage to install the different installation directories of mssql and kept root installation to c: (I know I have a very weird setup going on and that is what I am trying to correct with the new server.)

Dunuin said:
In case you want to use relatime you need to enable atime. Also keep in mind that some applications won't work when the storage isn't updating the access time. The PBS for example needs the atime for the datastore to be able to work.

You talk about application but PBS is some kind of OS not an app. By the way if you would virtualize Truenas -Scale would that need the atime/relatime or not ? How am I suppose to know what each app needs. Windows Server OSes need it as well?
Since relatime is something like a flag for atime shouldnt automatically be set to off if you turn off the atime option? Can ti stand by it self doing what

Because the installer needs to write the bootloader and boot partition somewhere. So it writes them to all drives and keeps them in sync so in case that drive may fail, you are currently booting from, you can just select another drive. Or you select in the UEFI/BIOS to boot from all your drives, so it can always boot no matter what drive fails.

Probably you are explaining about the drive someone uses for OS installation which is not my case here since as I explained I ll use 2 dedicated ssds for OS, It writes the bootloader to all drives in case of uefi installation not legacy.

Also keep in mind that you can't just simple replace a failed disk in case your system is on that pool. You need to partition the new disk yourself first, copy over the bootloader and so on. Its described here (scroll down to "Changing a failed bootable device").

True. Already done a fail over scenario and had to create first the same partitions as the original disk and then resilver. Afterwards a command is needed to sync bootloader . I know the procedure for that.

Also you don't want to have your SWAP ontop of ZFS because that can cause problems. So its recommended to use a non-ZFS swap partition.

So what needs to be done? Are you talking about the OS installation disks?

only works using CLI. But its a oneliner:
zpool create -f -o ashift=12 YourPool mirror /dev/disk/by-id/disk1 /dev/disk/by-id/disk2 /dev/disk/by-id/disk3 mirror /dev/disk/by-id/disk4 /dev/disk/by-id/disk5 /dev/disk/by-id/disk6 mirror /dev/disk/by-id/disk7 /dev/disk/by-id/disk8 /dev/disk/by-id/disk9 mirror /dev/disk/by-id/disk10 /dev/disk/by-id/disk11 /dev/disk/by-id/disk12

"-f" will force the creation even if the disks are not empty (so you can skip the wipe step)
"-o ashift=9" will set the blocksize for each disk. So 9 would mean each disk would operate with 512B blocks (because 2^9 bytes blocksize).
"YourPool" is the name your pool will get.
each of the four "mirror /dev/disk/by-id/diskX /dev/disk/by-id/diskY /dev/disk/by-id/diskZ" blocks will define a 3-disk-mirror" and if you define multiple mirrors they will be automatically striped together.

So you dont specify anywhere the stripe and it gets done by default because somehow understands the pool is one, the named one and is being created with a combination of mirrors. Ok it makes sense if it works this way.
Do you happen to know if gui takes into consideration the by-id naming scheme of disks upon raid creation or the sd[a-x] Is there a way to check it from cli?

Probably I wont use mirrors of three because ends up with way less storage than I already have on the old server. I need 12 disks for 4.8Tb (which is less if we take into consideration the 20% free space, padding overhead ..etc) and that leaves me with 4 remaining to create an extra storage, Or use 15 disks in mirrors of 3 in order to have 6Tb storage only for VMs and have it store backups as well.

Closing with my final question which will help me shrink my current setup. The extra disk that I use in each VM proved to be big enough and leaves a lot of empty space. Gui only expands the storage and doesnt shrink so I though to wo inside the VM and shrink the disk which means there will be just unpartitioned space. How to inform proxmox afterwards to claim that space or it still resides in the VM as unpartitioned?

Once more many thanks for your time!!!

Dunuin · Dec 15, 2021

ieronymous said:
Thank you for your quick response and great insight about my considerations.
I ll add to what you have answered beginning from bottom to the top since this way I ll end up with the main topic of this post raid.

So you agree with me that in my use case scenario ashift 9 is the best option. I know about the considerations of some people what if you want to replace a disk of native 4k blocksize afterwards? It will cause not only troubles but it will be prohibited and you have to force it to ignore the status of the drive. Never the less, the only viable option of changing sas 2.5 inch drives are with the same ones or!!! ssds. In order for the price to be at a reasonable level we ll have quantum pc's (joking) so this reason others care about doesn't affect me. Finally I noticed that even with ashift =9 or 12 the resulted default blocksize is 8k. I thought it is changed according to the ashift value.

The volblocksize should be calculated by a multiple of the blocksize the drives are using (so the used ashift). If you for example use a 3 disk raidz1 you want atleast a vollbocksize of 4 times the sector size. With ashift=12 each sector would be 4K and with ashift=9 a sector only would be 512B. So with ashift of 9 the minimum useful volblocksize would be 2K (4x 512B) but with a ashift of 12 the volblocksize should be atleast 16K (4x 4K).
Using a bigger volblocksize always works, so you could also use a 16K volblocksize with ashift=9 but not the other way round.

Also if you got a dedicated pool for your VMs it would be no problem to destroy and recreate the pool with ashift of 12 with only a few clicks (restoring backups of cause would take a while) if you would decide later that you want to change drives and use 4K ones.

ieronymous said:
By the way, my use case scenario is the sas drives will be split into two pools (still thinking the second pool if it is going to be a rai 10 as well or raidz1 or 2) So bottom l;ine for the configuration is

-2 DC ssds 480gb mirrored for the prox OS (already done) with ashift 9 and defaults for all other options.-

You should benchmark them. SSDs always lie and report to be working as 512B or 4K but in reality they are working internally with a way bigger blocksize like 8K or 16K or even more. So it might be fast to use ashift=12 or even ashift=13 no matter what the SSD is reporting.

ieronymous said:
You talk about application but PBS is some kind of OS not an app.

Its only a OS if you install it as a OS from the PBS ISO. You can also install a normal Debian and then add the proxmox-backup-server package. Same with PVE.

ieronymous said:
By the way if you would virtualize Truenas -Scale would that need the atime/relatime or not ? How am I suppose to know what each app needs. Windows Server OSes need it as well?

You need to check each program that you want to run if it needs a access time or not. Thats why atime is enabled by default. Better loosing some performance to update the access time even if it isn't needed than not updating it and running into strange problems like a PBS that deletes backups that are actually needed. So relatime is a good compromise. Access time is still updated but with way less additional writes.

ieronymous said:
Since relatime is something like a flag for atime shouldnt automatically be set to off if you turn off the atime option? Can ti stand by it self doing what

For ZFS relatime needs atime to be enabled. If you disable atime the relatime won't work, no matter if relatime is set to on or off.

ieronymous said:
Probably you are explaining about the drive someone uses for OS installation which is not my case here since as I explained I ll use 2 dedicated ssds for OS, It writes the bootloader to all drives in case of uefi installation not legacy.

jup

ieronymous said:
So what needs to be done? Are you talking about the OS installation disks?

jup

ieronymous said:
Closing with my final question which will help me shrink my current setup. The extra disk that I use in each VM proved to be big enough and leaves a lot of empty space. Gui only expands the storage and doesnt shrink so I though to wo inside the VM and shrink the disk which means there will be just unpartitioned space. How to inform proxmox afterwards to claim that space or it still resides in the VM as unpartitioned?

Did you check the "thin" checkbox when creating the pool using the GUI? In that case your virtual disks should be thin-provisioned and it doesn't really matter if you got unprovisioned space because this unused space won't consume any capacity. For ZFS to be able to free up space you need to use a protocol like virtio SCSI that supports TRIM and also tell the VM (VM config) and guest OS to use TRIM/discard.
But bigger than needed virtual disks can still be annoying in case you want to use "stop" mode backups, because there dirty bitmaps can't be used and a backup job needs to read in the complete virtual disk byte by byte even if the part of the virtual disk is actually unpartitioned, so the backups will take longer.

You might find this helpful to shrink the zvol: https://forum.proxmox.com/threads/shrink-a-attached-zfs-disk.46266/post-219599
But have backups because shrinking can always destroy data.

ieronymous · Dec 15, 2021

Dunuin said:
You should benchmark them. SSDs always lie and report to be working as 512B or 4K but in reality they are working internally with a way bigger blocksize like 8K or 16K or even more. So it might be fast to use ashift=12 or even ashift=13 no matter what the SSD is reporting.

Well I come back from your 15 pages post with your fio benches and have a terrible headache trying to figure out what is what.

Some tests I run on the mirrored ssds (I dont know though why should I care about those since there will be separate storage on sas drives for the Vms)

Code:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=brisi --bs=4k
    --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75

Run status group 0 (all jobs):
   READ: bw=57.7MiB/s (60.5MB/s), 57.7MiB/s-57.7MiB/s (60.5MB/s-60.5MB/s), io=6141MiB (6440MB), run=106370-106370msec
  WRITE: bw=19.3MiB/s (20.2MB/s), 19.3MiB/s-19.3MiB/s (20.2MB/s-20.2MB/s), io=2051MiB (2150MB), run=106370-106370msec


fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=brisi --bs=8k
    --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75

   READ: bw=116MiB/s (121MB/s), 116MiB/s-116MiB/s (121MB/s-121MB/s), io=6140MiB (6438MB), run=53111-53111msec
  WRITE: bw=38.6MiB/s (40.5MB/s), 38.6MiB/s-38.6MiB/s (40.5MB/s-40.5MB/s), io=2052MiB (2152MB), run=53111-53111msec


fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=brisi --bs=16k
    --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75

   READ: bw=231MiB/s (242MB/s), 231MiB/s-231MiB/s (242MB/s-242MB/s), io=6136MiB (6434MB), run=26540-26540msec
  WRITE: bw=77.5MiB/s (81.2MB/s), 77.5MiB/s-77.5MiB/s (81.2MB/s-81.2MB/s), io=2056MiB (2156MB), run=26540-26540msec


fio --filename=test --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=posixaio --bsrange=4k-128k --rwmixread=70 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=test --size=8G
test: (g=0): rw=randrw, bs=(R) 4096B-128KiB, (W) 4096B-128KiB, (T) 4096B-128KiB, ioengine=posixaio, iodepth=16

   READ: bw=528MiB/s (554MB/s), 528MiB/s-528MiB/s (554MB/s-554MB/s), io=30.0GiB (33.2GB), run=60025-60025msec
  WRITE: bw=226MiB/s (237MB/s), 226MiB/s-226MiB/s (237MB/s-237MB/s), io=13.2GiB (14.2GB), run=60025-60025msec

I suppose before each test I should install proxmox again and use different block size instead of changing just the option in the fio command from 512b to 4k to 8k to 16k.

These SSDs are DC ones with PLP, sata connection and are rated for m=multi use environments (Kingston dc500m along with seagate ironwolf pro 125)

Best scores achieved with last test but I cant even validate the results.

Dunuin said:
Did you check the "thin" checkbox when creating the pool using the GUI?

Nope. Neither for the VM storage, since no option to check or the Data Storage (from this storage comes the second disk for each VM)
I suppose enabling it afterwards has a meaning for new datasets. But for already created ones I could reset the ZFS property refreservation
Shown an example doing that but it seems weird that it does it for the VM: zfs set refreservation=none rpool/data/vm-100-disk-1 and not at the storage the VM resides which would be zfs set refreservation=none rpool/data/

If I do the refreservation method, I wont need to shrink the disk's size inside the VM afterwards?

PS During VM creation I always check Discard and use virtio SCSI.
By the way, I ve noticed that now when you choose as a scsi controller the virtio scsi during VM creation that defaults to a drive format of qcow2 instead of Raw disk image. So it is provisioned right?

Dunuin said:
The volblocksize should be calculated by a multiple of the blocksize the drives are using (so the used ashift). If you for example use a 3 disk raidz1 you want atleast a vollbocksize of 4 times the sector size. With ashift=12 each sector would be 4K and with ashift=9 a sector only would be 512B. So with ashift of 9 the minimum useful volblocksize would be 2K (4x 512B) but with a ashift of 12 the volblocksize should be atleast 16K (4x 4K).

Does that multiple needs to be in the power of 2 because of the binary system? You are explaining an example with 3 drives but 3 doesnt count anywhere in the equation. You always use a constant of 4 (4x512b=ashift of 9) =2k or (4x4k=ashift of 12)=16k .Where 3 (of 3 used disks) is helpful in the calculation here?

Someone corrected you in your post with the benches that it doesn t count the number of mirrors but the number of stripes. Didnt get it since the stripe will be always one like a roof.
2 mirrors of 3 drives and a stripe above them to make them act as one
8 mirrors of 3 drives and a stripe above them to make them act as one
So stripe is always one. I thought your point was right

Dunuin · Dec 15, 2021

ieronymous said:
Well I come back from your 15 pages post with your fio benches and have a terrible headache trying to figure out what is what.

Yes, still need to summarize that in a less confusing way in a blog post.

ieronymous said:
I suppose before each test I should install proxmox again and use different block size instead of changing just the option in the fio command from 512b to 4k to 8k to 16k.

Jep. But as you said, its not that important if you just run the system on that SSDs. Then the SSDs are always idleing anyway.
Correct benchmark would be to use the same fio test with just "--bs=16K" and run that on 4 newly created pools that use ashift=9/12/13/14 and then use the ashift that got the best results.

ieronymous said:
Nope. Neither for the VM storage, since no option to check or the Data Storage (from this storage comes the second disk for each VM)
I suppose enabling it afterwards has a meaning for new datasets. But for already created ones I could reset the ZFS property refreservation
Shown an example doing that but it seems weird that it does it for the VM: zfs set refreservation=none rpool/data/vm-100-disk-1 and not at the storage the VM resides which would be zfs set refreservation=none rpool/data/

If I do the refreservation method, I wont need to shrink the disk's size inside the VM afterwards?

Under "Datacenter -> Storage -> YourPool -> Edit" there is a "thin provisioning" checkbox but I also guess that only works for newly created virtual disks. But doing a backup+restore also should result in a "new" VM. Thats how I change my volblocksize for already created virtual disks, which also can only be set at creation.

ieronymous said:
PS During VM creation I always check Discard and use virtio SCSI.
By the way, I ve noticed that now when you choose as a scsi controller the virtio scsi during VM creation that defaults to a drive format of qcow2 instead of Raw disk image. So it is provisioned right?

Then it sounds like you are storing your VMs on file level and not on block level, because qcow2 can only be used on file level (block level where only RAW is possible should be faster with less overhead). So I guess you use a "Directory" ontop of a ZFS pool instead of directly using the ZFS pool as a "ZFS" storage.

ieronymous said:
Does that multiple needs to be in the power of 2 because of the binary system?

Volblocksize always has to match a 2^X. So something like a volblocksize of 12K won't work.

ieronymous said:
You are explaining an example with 3 drives but 3 doesnt count anywhere in the equation. You always use a constant of 4 (4x512b=ashift of 9) =2k or (4x4k=ashift of 12)=16k .Where 3 (of 3 used disks) is helpful in the calculation here?

Its because of the padding overhead described here. Look at that spreadsheet.

So with just 1 or 2 sectors (512B/1K volblocksize for ashift=9 or 4K/8K vollblocksize for ashift=12) you would always loose 50% of you total capacity when using Zvols, so it wouldn't make any sense to use a raidz when you loose the same capacity as when using striped mirror which would give better IOPS. Only at 4 or more sectors (so 2K or more for ashift=9 or 16K or more for ashift=12) you would only loose 33% of the capacity.
Thats because of the padding overhead if you choose a too small volblocksize. You always loose the 33% because of parity data (1 disk parity and 2 disks data in a 3 disk raidz1) but if you choose a too small volblocksize you also loose additional 17% due to padding overhead.
So basically...the more disks your raidz1/2/3 consists of, the bigger your volblocksize has to be, no matter what you actually need for your workload. Thats why I told you a raidz1 with ashift=12 will never be good choice for a posgres DB (writing 8K blocks) and why only a raidz1 of 3 disks (when ashift=12) will be useful for a mysql DB, because for 4+ disks raidz1 that volblocksize would need to be atleast 32K and then your MySQL will write with 16K blocks to a 32K volblocksize which is bad again, because you loose half of your IOPS/throughput and double the SSD wear.

ieronymous · Dec 15, 2021

Dunuin said:
So I guess you use a "Directory" ontop of a ZFS pool instead of directly using the ZFS pool as a "ZFS" storage.

I dont think so

Dunuin said:
Then it sounds like you are storing your VMs on file level and not on block level

Editing the conf file of the VM and checking it's disk gives me
scsi0: HHproxVM:vm-101-disk-0,cache=writeback,discard=on,size=600G
instead of
scsi0: HHproxVM:vm-101-disk-0.qcow2,cache=writeback,discard=on,size=600G

Also trying to create a vm using the VM storage pool defaults me to raw image so all my VMs are raw based images. Weird though that it lets me snapshot them. Isnt that a case for qcow2 only? Somewhere, I ve read of a way using raw images but with qcow 2 overlay, it had to be done manually though. Probably proxmox is using this method by default

Dunuin said:
Under "Datacenter -> Storage -> YourPool -> Edit" there is a "thin provisioning" checkbox

I know that is why I mentioned it is checked. Do you have an opinion whether the refreservation method works (I mentioned above) and if afterwards disk shrink inside the VM is needed?

Dunuin · Dec 16, 2021

ieronymous said:
I dont think so

Editing the conf file of the VM and checking it's disk gives me
scsi0: HHproxVM:vm-101-disk-0,cache=writeback,discard=on,size=600G
instead of
scsi0: HHproxVM:vm-101-disk-0.qcow2,cache=writeback,discard=on,size=600G

Also trying to create a vm using the VM storage pool defaults me to raw image so all my VMs are war image based. Weird though that it lets me snapshot them. Isnt that a case for qcow2 only? Somewhere I ve read the way of using raw iamges but with qcow 2 overlay but that had to be done manually . Probable proxmox is using it by default

If you use ZFS + RAW and do a snapshot PVE will use ZFSs native snapshot functionality.
If you use a qcow2 ontop of a ZFS dataset PVE should use the qcow2s snapshot functionality.
So you can snapshot with both but they behave differently. With ZFS snapshots you can for example only roll back but never redo it, because while rolling back ZFS will destroy everything that was done after taking the snapshot you rolled back to. So rolling back is a one-way road. With qcow2 snapshots you can freely jump back and forth between snapshots.

ieronymous said:
I know that is why I mentioned it is checked. Do you have an opinion whether the refreservation method works (I mentioned above) and if afterwards disk shrink inside the VM is needed?

Not sure about the refreservation. Didn't tested that myself. But refreservation will only effect the space consumed by that zvol on the pool and it won't shrink the actual size of that zvol.

ieronymous · Dec 16, 2021

@Dunuin many thanks for the replies!!!

Search

Search

ZFS Raid 10 mirror and stripe or the opposite

ieronymous

Well-Known Member

Dunuin

Distinguished Member

ieronymous

Well-Known Member

Dunuin

Distinguished Member

ieronymous

Well-Known Member

Dunuin

Distinguished Member

ieronymous

Well-Known Member

Dunuin

Distinguished Member

ieronymous

Well-Known Member

Dunuin

Distinguished Member

ieronymous

Well-Known Member

Dunuin

Distinguished Member

ieronymous

Well-Known Member