Looking for setting a new PVE instance correctly

zecas · Jan 6, 2021

Hi,

I'm in the process of finishing a server build based on proxmox.

At this time I have everything almost ready, just doing some disk tests before moving forward for the installation.

My hardware config will be like this:

2x Samsung SSD 860 EVO 250Gb (SATA3) - Will be setup on proxmox installation for OS boot, both configured for ZFS RAID1

6x HGST HUC109060CSS600, 600GB 10K SAS 2.5" HDD, for VM data, configured as 1 zpool with 2 zdevs composed of:
- zdev-1 (mirror-0) : 3x HGST HUC109060CSS600, 600GB
- zdev-2 (mirror-1) : 3x HGST HUC109060CSS600, 600GB
(for a total pool size of 1.2Tb)

Now some questions came to my mind which I would be very much appreciated if someone could clarify them to me:

1- The PVE boot will be created by proxmox installation. Will it use "/dev/sdX" disk identification or will it use "/dev/disk/by-id/..." references?

2- For the ZFS zpool for the VMs, I was planing of creating it from command line, to be sure of using disk by-id referencing and grouping disks exactly as I want. Will I be able to see the ZFS pool immediately on proxmox interface? Or do I need to do something to be able to see it?

3- Now the more tricky question for me, the ashift selection:

What should be the right setting for each ZFS (boot and VMs)?

For instance, running "fdisk -l /dev/sdX" gave the following results for HGST and Samsung SSD disks, respectively:

Code:

# fdisk -l /dev/sda
Disk /dev/sda: 558.8 GiB, 600000000000 bytes, 1171875000 sectors
Disk model: HUC10906 CLAR600
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

# fdisk -l /dev/sdg
Disk /dev/sdg: 232.9 GiB, 250059350016 bytes, 488397168 sectors
Disk model: Samsung SSD 860
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

So they all seem to report 512 bytes, even the newest SSD drives. Should I assume and use ashift=12? From what I read around, it appears to be a safe ground to choose.

Thank you very much.

zecas

Dunuin · Jan 6, 2021

zecas said:
1- The PVE boot will be created by proxmox installation. Will it use "/dev/sdX" disk identification or will it use "/dev/disk/by-id/..." references?

Here it was "/dev/sdX" and not "/dev/disk/by-id/...".

zecas said:
2- For the ZFS zpool for the VMs, I was planing of creating it from command line, to be sure of using disk by-id referencing and grouping disks exactly as I want. Will I be able to see the ZFS pool immediately on proxmox interface? Or do I need to do something to be able to see it?

You can create the pool using the CLI but proxmox will not identify it by itself. You need to tell proxmox that there is a new pool using CLI or GUI: Datacenter -> storage -> add -> ZFS

zecas said:
3- Now the more tricky question for me, the ashift selection:

What should be the right setting for each ZFS (boot and VMs)?

For instance, running "fdisk -l /dev/sdX" gave the following results for HGST and Samsung SSD disks, respectively:

View attachment 22550
(sorry for posting as image, but I'm not able to copy/paste text from the remote session I'm using)

So they all seem to report 512 bytes, even the newest SSD drives. Should I assume and use ashift=12? From what I read around, it appears to be a safe ground to choose.

512B would be ashift 9. ashift 12 is 4K. SSDs are always reporting fake block sizes. They internally operate with much larger block sizes. I would use 4K or 8K as blocksize for the SSD.
For the HDD that depends...
512B so ashift of 9 would be best but keep in mind that you can't replace a failed 512B LBA HDD later with 4K LBA one if you choose ashift of 9.

bobmc · Jan 6, 2021

Personally I would have gone with 3 x 2way mirrored pairs rather than 2 x 3way as it will mean you can expand the pool in future by adding pairs of drives rather than triples and this gives you more capacity. You can still tolerate up to 3 drives failures (providing they are all from different vdevs) and I think performance will be slightly better (no parity writes)

zecas · Jan 8, 2021

First of all, thank you for your answers, they really help me understanding more about proxmox and zfs.

Dunuin said:
Here it was "/dev/sdX" and not "/dev/disk/by-id/...".

You can create the pool using the CLI but proxmox will not identify it by itself. You need to tell proxmox that there is a new pool using CLI or GUI: Datacenter -> storage -> add -> ZFS

512B would be ashift 9. ashift 12 is 4K. SSDs are always reporting fake block sizes. They internally operate with much larger block sizes. I would use 4K or 8K as blocksize for the SSD.
For the HDD that depends...
512B so ashift of 9 would be best but keep in mind that you can't replace a failed 512B LBA HDD later with 4K LBA one if you choose ashift of 9.

I will try to put the SSD for Proxmox VE as ashift=12, even though they report as 512B, since I do believe they should be 4096B internally. I do believe I have that option on setup wizard, under the advanced options.

For me it's strange that SSD works internally as 4096B, but being reported as 512B (the value the OS knows of) an ashift=12 causes no issues.

For the HDD disks, I'm only afraid if I latter replace disks and I end up putting 4096B drives, and not taking advantage of it, since ashift cannot be changed ...

Should I expect that ashift=12 on these disks instead of ashift=9 would bring me performance issues?

bobmc said:
Personally I would have gone with 3 x 2way mirrored pairs rather than 2 x 3way as it will mean you can expand the pool in future by adding pairs of drives rather than triples and this gives you more capacity. You can still tolerate up to 3 drives failures (providing they are all from different vdevs) and I think performance will be slightly better (no parity writes)

Initially I was thinking of going for 2 x 2way with brand new Samsung SSD 860 Pro disks, but the cost was a bit high for the budget at the time.

I got these used HGST HDD disks that have some usage, and my thought was that, going for used disks (although I've tested them extensively), I would prefer going for 2 x 3way, just so that it could handle 2 disk failure on the same vdev.

Because yes you are right, now that I think of it, if I want to expand, I have to make it a vdev at a time, and it would be 3 disks.

Is there still parity on a pool of mirrors? I thought that parity would only be existent on Raid-ZX, due to the nature of keeping data redundancy, instead of having a mirror that just balances/splits data across vdevs and replicate it on their disks.

This is me learning everyday about proxmox and zfs ...

Again, thank you for your help.
zecas

Dunuin · Jan 8, 2021

zecas said:
I will try to put the SSD for Proxmox VE as ashift=12, even though they report as 512B, since I do believe they should be 4096B internally. I do believe I have that option on setup wizard, under the advanced options.
For me it's strange that SSD works internally as 4096B, but being reported as 512B (the value the OS knows of) an ashift=12 causes no issues.

It will be much higher internally. To write something SSDs will need to erase a complete row of cells and and write it again to store a minimal change. So SSDs can read single cells but only modify big rows of cells. So it could be that a row of some hundreds of KB or even some MB need to be rewritten to change a single block. Not a big problem with async writes because the SSD can cache the write operations in RAM and do it only then, when enough data is collected to change a complete row of cells. But sync writes are a problem, because if your SSD don't got a powerloss protection (consumer SSDs like your 860 Pro don't got this) this RAM caching can't be used and your write amplification will be really high. Lets say so want to write 1000x 4kb as sync writes and your SSD needs to write 1000x 256kb to store that, if 256K would be the internal blocksize of a row of cells. In that case your SSD will die 64 times faster. Databases are usually using small sync writes like 8k and that can kill a consumer SSD in months even if nearly no data is written by the guest os.

zecas said:
For the HDD disks, I'm only afraid if I latter replace disks and I end up putting 4096B drives, and not taking advantage of it, since ashift cannot be changed ...

Should I expect that ashift=12 on these disks instead of ashift=9 would bring me performance issues?

With a block size of 512B you will loose less capacity due to padding overhead. With virtualization there is a lot of padding overhead because of the mixed block sizes. For example look at my setup:
MY SSDs state that they use a physical blocksize of 512B (what they don't do). I used ashift=12 so 4K blocksize. My datasets on the ZFS pool are using a recordsize of 128K. My zvols on the ZFS pool are using a volblocksize of 32K. The guests virtual HDDs are showing a 512B blocksize ( thats how the virtio SCSI controller is handling this). The virtual HDDs are formated with a ext4 filesystem which uses a blocksize of 4K. So it looks like this:
Unknown real blocksize of the SSD used as 4K <-- 32K Zvol (virtual HDD) <-- 512B virtio SCSI <-- 4K ext4

Everytime a blocksize on the right (higher abstaction) is smaller than a blocksize on the left (lower abstaction) you get overhead.
Lets say the virtio SCSI wants to write a single block (512B of data) to the Zvol. The Zvol got a 32K blocksize so it can only do operations at 32K blocks. To store the 512B data block it needs to read 32K into RAM, change 512B of that 32K and write the 32K of data again. So 32K are read and written to write 512B. Thats a problem and your IOs and amount of read/written data will increase. Thats your padding overhead.

Ext4, at the highest abstaction layer, can't use larger block sizes then 4K unless you are using huge pages. So everthing with lower abstaction should use blocksizes of 4K or less to minimize overhead. virtio SCSI could use 4K blocksizes too but proxmox doesn't offer options to change this, so it will alway be 512B. But what you can do is to lower the volblocksize of the zvols and recordsize of the datasets as much as possible. Lets say my volblocksize would be 4K and not 32K. In that case only 4KB need to be read and written to store 512B of data and not 32K. So that is better. But because of how ZFS works (checksums, compression, parity if using raidz, ...) your blocksize of the physical drives need to be smaller than your volblocksize or you are getting padding overhead again and loose up to 33% of your drives capacity. So if you want to use 4K as volblocksize for your zvols you need to use a ashift of 9 so 512B is used there.

In short: If you use ashift=9 your volblocksize could be lower and if that is lower you get less overhead and because of that less write/read operations per second and less data read/written.

But if you use ashift=9 you are limited to 512B LBA HDDs and can't add/replace 4K LBA HDDs later. Your HDDs will work fine with ashift=12 and you could mix 512B and 4K LBA HDDs later but in that case you need to increase the volblocksize and get more overhead. By the way, the volblockisze can't be easily changed later. You would need to destroy and recreate every virtual HDD to change it because it needs to be set at the time of creation.

zecas · Jan 12, 2021

Dunuin said:
It will be much higher internally. To write something SSDs will need to erase a complete row of cells and and write it again to store a minimal change. So SSDs can read single cells but only modify big rows of cells. So it could be that a row of some hundreds of KB or even some MB need to be rewritten to change a single block. Not a big problem with async writes because the SSD can cache the write operations in RAM and do it only then, when enough data is collected to change a complete row of cells. But sync writes are a problem, because if your SSD don't got a powerloss protection (consumer SSDs like your 860 Pro don't got this) this RAM caching can't be used and your write amplification will be really high. Lets say so want to write 1000x 4kb as sync writes and your SSD needs to write 1000x 256kb to store that, if 256K would be the internal blocksize of a row of cells. In that case your SSD will die 64 times faster. Databases are usually using small sync writes like 8k and that can kill a consumer SSD in months even if nearly no data is written by the guest os.

With a block size of 512B you will loose less capacity due to padding overhead. With virtualization there is a lot of padding overhead because of the mixed block sizes. For example look at my setup:
MY SSDs state that they use a physical blocksize of 512B (what they don't do). I used ashift=12 so 4K blocksize. My datasets on the ZFS pool are using a recordsize of 128K. My zvols on the ZFS pool are using a volblocksize of 32K. The guests virtual HDDs are showing a 512B blocksize ( thats how the virtio SCSI controller is handling this). The virtual HDDs are formated with a ext4 filesystem which uses a blocksize of 4K. So it looks like this:
Unknown real blocksize of the SSD used as 4K <-- 32K Zvol (virtual HDD) <-- 512B virtio SCSI <-- 4K ext4

Everytime a blocksize on the right (higher abstaction) is smaller than a blocksize on the left (lower abstaction) you get overhead.
Lets say the virtio SCSI wants to write a single block (512B of data) to the Zvol. The Zvol got a 32K blocksize so it can only do operations at 32K blocks. To store the 512B data block it needs to read 32K into RAM, change 512B of that 32K and write the 32K of data again. So 32K are read and written to write 512B. Thats a problem and your IOs and amount of read/written data will increase. Thats your padding overhead.

Ext4, at the highest abstaction layer, can't use larger block sizes then 4K unless you are using huge pages. So everthing with lower abstaction should use blocksizes of 4K or less to minimize overhead. virtio SCSI could use 4K blocksizes too but proxmox doesn't offer options to change this, so it will alway be 512B. But what you can do is to lower the volblocksize of the zvols and recordsize of the datasets as much as possible. Lets say my volblocksize would be 4K and not 32K. In that case only 4KB need to be read and written to store 512B of data and not 32K. So that is better. But because of how ZFS works (checksums, compression, parity if using raidz, ...) your blocksize of the physical drives need to be smaller than your volblocksize or you are getting padding overhead again and loose up to 33% of your drives capacity. So if you want to use 4K as volblocksize for your zvols you need to use a ashift of 9 so 512B is used there.

In short: If you use ashift=9 your volblocksize could be lower and if that is lower you get less overhead and because of that less write/read operations per second and less data read/written.

But if you use ashift=9 you are limited to 512B LBA HDDs and can't add/replace 4K LBA HDDs later. Your HDDs will work fine with ashift=12 and you could mix 512B and 4K LBA HDDs later but in that case you need to increase the volblocksize and get more overhead. By the way, the volblockisze can't be easily changed later. You would need to destroy and recreate every virtual HDD to change it because it needs to be set at the time of creation.

Wow, very detailed info here. Took some time to digest it and search more info around to try to understand it the best I can. In a future that I move on to 4k LBA HDD disks, maybe I'll choose to create a new zpool with proper ashift settings and move data over there. But for now I think I'm better of going with ashift=9 as you recommend, to get the best out of these disks.

I saw some info around, where someone partitioned a disk, setting a partition to start on 1Mb (to assure a correct alignment start), and just used the partitions when construction the zpool. I was thinking of using the entire disks ("by-id"), but this made me think, how does ZFS deals with the alignment start? Or, since using by-id the disk is fully managed by ZFS, it starts right at the very beginning, so this question does not apply?
(link for reference: https://forums.freebsd.org/threads/single-disk-with-zfs-best-practices.70817/)

All this just made me think of what I'm attempting to do and if I'm going the correct route to achieve it:

At the moment I have a temporary proxmox machine with some Windows VMs. It has nothing special, no ZFS, no ECC, nothing. Just a simple machine with a single disk, where I installed proxmox, created the VMs and installed Windows.

In this new machine, it will be quite different ... I will create a ZFS zpool, was thinking about adding datasets like "pooldata/iso", "pooldata/vm", and set each of them to have "ISO image" and "Disk image, Container", accordingly (don't know if this is a best practice).

As far as I know, zvol is a block device, much like "/dev/sda". Will it be created automatically when I add a hard disk to the VM? For example I have several disks similar to "/dev/pve/vm-100-disk-0" (on proxmox, under local-lvm they its shown as raw format), for each hard disk of every existent VM, but it cannot be a zvol since I don't have ZFS defined for storage.

I'm also planning on making a backup of each VM on the current machine, copy it to the new server (after all is set and ready), then restore it there. Will I have to be worried about the VM hard disks, regarding any block alignment setting or definition? Or it will be transparent, as they are virtual?

VM hardware definition example:

Code:

SCSI Controller   | VirtIO SCSI
Hard Disk (scsi0) | local-lvm:vm-100-disk-0,size=40G

Basic questions, I know, just trying to jump into a moving wagon ...

Thank you.
zecas

Dunuin · Jan 12, 2021

zecas said:
I saw some info around, where someone partitioned a disk, setting a partition to start on 1Mb (to assure a correct alignment start), and just used the partitions when construction the zpool. I was thinking of using the entire disks ("by-id"), but this made me think, how does ZFS deals with the alignment start? Or, since using by-id the disk is fully managed by ZFS, it starts right at the very beginning, so this question does not apply?
(link for reference: https://forums.freebsd.org/threads/single-disk-with-zfs-best-practices.70817/)

If you just use entire disks ZFS will partition the disks for you. sdX1 and sdX9 will be created and sdX1 will be aligned to 1M.

zecas said:
In this new machine, it will be quite different ... I will create a ZFS zpool, was thinking about adding datasets like "pooldata/iso", "pooldata/vm", and set each of them to have "ISO image" and "Disk image, Container", accordingly (don't know if this is a best practice).

Yes thats useful. So every type of content may have different ZFS options.

zecas said:
As far as I know, zvol is a block device, much like "/dev/sda". Will it be created automatically when I add a hard disk to the VM?

Yes, it will.

zecas said:
For example I have several disks similar to "/dev/pve/vm-100-disk-0" (on proxmox, under local-lvm they its shown as raw format), for each hard disk of every existent VM, but it cannot be a zvol since I don't have ZFS defined for storage.

ZFS is also using RAW if you use a zvol.

zecas said:
I'm also planning on making a backup of each VM on the current machine, copy it to the new server (after all is set and ready), then restore it there. Will I have to be worried about the VM hard disks, regarding any block alignment setting or definition? Or it will be transparent, as they are virtual?

I'm not sure how the import will handle the blocksizes. Maybe the team can answer this.

ieronymous · May 6, 2021

Dunuin said:
In short: If you use ashift=9 your volblocksize could be lower and if that is lower you get less overhead and because of that less write/read operations per second and less data read/written

Do you happen know the algorithm that correlates the ashift number with the block size? or is there a constant that dictates use 9 for 512bytes and 12 for 4k drives (which by the way are the most familiar numbers for hdds, probably netapp drives have a weird value also).

New Edit probably got it randomly. ashift has to be the power of 2 in order for the outcome to equal 512b or 4k accordingly. So (2x2x2x2x2x2x2x2x2) = 512 and (2x2x2x2x2x2x2x2x2x2x2x2)=4096.

Would there be a problem if a setup of VMs in a configuration of 512bytes/ashift=9 hdds be restored (after of course those VMs had been backed up in the 512b drives) to a new set of 4k/ashift=12 drives? Does that backup hold that kind of info because it is needed afterwards or during restoration the data are going to be placed to the new drives according to the new configuration upon them and thats it?

PS I happen to have 4x2,5inch HGST sas 10K rpm who have 512 b(logixal/physical). The intention is to create a zfs Raid10 on them. Should I use ashift 9? Those Vms are going to have extra drives inside the VMs to save data and those drives are already configured with ashift 12 since they are 4k drives. Is there going to be a problem with a configuration like that?

Search

Search

Looking for setting a new PVE instance correctly

zecas

Member

Dunuin

Distinguished Member

bobmc

Renowned Member

zecas

Member

Dunuin

Distinguished Member

zecas

Member

Dunuin

Distinguished Member

ieronymous

Well-Known Member

We value your privacy