Proxmox install - ZFS on NVME, RAID1 does it make sense ?

Dunuin · Mar 19, 2024

Some notes:
1) There is good documentation on how to replace a failed disk. See chapter "Replacing a failed bootable device": https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration
But yes, would be really nice if that could be part of the webUI as I don't see why this couldn't be easily automated and then heavily reduce human error by not having people to type everything in or getting confused by what partitions to choose for the commands. I could even script a UI for that myself. All thats needed is some UI for selecting the pool name and its failed vdev, the healthy disk, the new disk and some warning that data on the new disk will be wiped. I bet I could also easily pre-select the pool by looking at "zpool status", a healthy disk by looking at the partition layout and a new disk by looking at the size and looking for unpartitioned or factory-prepartitioned NTFS/Exfat formated disks so in most cases replacing a disk would be a single click.
I really don't get why such an important feature is still missing. Especially in cases like this where some typos or missing understanding could cause massive data-loss that easily, there should be done everything possible to reduce human error.
Only problem I see would be people like me not sticking to defaults and doing manual partitioning by encrypting the rpool, adding encryped mdadm swap partitions and so on. But these people probably know what they are doing and are fine replacing the disk manually via CLI like it is done now. You could even exclude disks from the UI that doesn't match the default partitioning scheme by looking for part 1-3 or 1+9 and if other partitions are found.

2.) While only the 3rd ZFS partition is actually mirrored I don't see a problem why this should be a problem. Only other things on those disks are the first partition that is there for legacy reasons and not used at all. And the second ESP with the bootloader. But that one doesn't need to be mirrored because it is automatically kept in sync with the other disk via the proxmox-boot-tool. So there is always a working and recent copy of the bootloader on the other disk in case any disk fails.
I've got a lot of failed mirrored system disks and once you understand how to replace a disk there is really no problem at all. Once a disk fails everything will continue running as usual and you can hotswap the failed disk and fix the mirror without any downtime. And even if the whole server would crash or you reboot it with a failed disk, as long as you set up both disks as the first and second boot options in bios it will boot as normal from the remaining disk.

3) Those firecudas aren't enterprise/datacenter grade SSDs. They are only pro-sumer grade. They are missing the power-loss protection that is needed to not totally suck on sync writes. Once you do sync writes (like ZFS does when storing metadata) performance will be magnitudes worse (like factor 100) and write amplification (and therefore wear) will be much higher. As sync writes then can't be cached in DRAM.
SSDs without a buikt-in power-loss protection behave like HW raid cards without a BBU+cache.

IsThisThingOn · Mar 19, 2024

Well, this is never said anywhere that ZFS-RAID1 magically fixes all problems.

Sure, but one naturally assumes similar behavior than traditional HW Raid.

But since this also seems to be true:

Dunuin said:
And even if the whole server would crash or you reboot it with a failed disk, as long as you set up both disks as the first and second boot options in bios it will boot as normal from the remaining disk.

I think the "problem" is mitigated and I was worried for nothing

gfngfn256 · Mar 19, 2024

IsThisThingOn said:
I think the "problem" is mitigated and I was worried for nothing

I disagree.

Scenario:

Lets say disk 1 fails.

Technician hot replaces it - it gets resilvered, everything is working and chugging along.

Then after a period, (especially a long time), the other disk 2 fails.

Technician hot replaces it - it gets resilvered, everything is working and chugging along?

You think so?

Try rebooting now!

And chances are nobody will actually try and reboot for a very long time - so surprise surprise when it happens. No telling what an onsite technician does, when there's a no boot situation.

Yes its easily fixable. But this must be incorporated somewhere.

Dunuin · Mar 19, 2024

gfngfn256 said:
Lets say disk 1 fails.

Technician hot replaces it - it gets resilvered, everything is working and chugging along.

Then after a period, (especially a long time), the other disk 2 fails.

Technician hot replaces it - it gets resilvered, everything is working and chugging along?

If that technical is doing it the proper way with cloning the partition table and syncing the bootloader and not just doing a simple "zpool replace" only, both disks will still contain a working bootloader and you could replace disks as much as you like.

gfngfn256 · Mar 19, 2024

Dunuin said:
If that technical is doing it the proper way with cloning the partition table and syncing the bootloader and not just doing a simple "zpool replace" only, both disks will still contain a working bootloader.

It is possible he'll do what your expecting - but it is also possible he won't.
Just read some of the posts on this forum - many from "senior" (I hate that word) technicians.
At least this OP tried out the scenario - to discover the wheel.

Dunuin · Mar 19, 2024

gfngfn256 said:
It is possible he'll do what your expecting - but it is also possible he won't.
Just read some of the posts on this forum - many from "senior" (I hate that word) technicians.
At least this OP tried out the scenario - to discover the wheel.

Yes, I know how many people simply do a "zpool replace" because they think they know ZFS but forgetting about the other partitions and bootloaders. I stopped counting how many times I had to explain people how to fix a wrong disk replacement of the rpool. Thats my point why this should be part of the webUI, so people won't do it the wrong way in the first place.

But isn't that hard. Have a look at the partition layout. If it is only partition 1+9 you have to do a simple "zpool replace" using the whole disk. If it is partition 1+2+3 you have to clone the partition table and sync the bootloader before doing the "zpool replace" (and only do that with 3rd partition and not the whole disk).

IsThisThingOn · Mar 19, 2024

Another simple way or option would be like TrueNAS does it.
Their TrueNAS Mini systems come with a 16GB SSD. TrueNAS itself is installed on that small SSD. The SSD is not really used for anything else, thous should hopefully not fail. And even if fails, it is not a huge problem, because you create one click saves that you can reimport. OPNsense and pfSense do the same thing for their firewalls.

Maybe this could also be an option for Proxmox?

Dunuin · Mar 19, 2024

IsThisThingOn said:
Another simple way or option would be like TrueNAS does it.
Their TrueNAS Mini systems come with a 16GB SSD. TrueNAS itself is installed on that small SSD. The SSD is not really used for anything else, thous should hopefully not fail. And even if fails, it is not a huge problem, because you create one click saves that you can reimport. OPNsense and pfSense do the same thing for their firewalls.

Maybe this could also be an option for Proxmox?

Problem is that PVE is not an appliance like TrueNAS or OPNsense. You are allowed to hack the heck out of that underlying Debian. Install additional packages or edit any config file as you like without the PVE software knowing anything about or accounting for that. It's a full-fledged linux distibution. With TrueNAS and OPNsense this is easy, as you aren't supposed to do anything that the webUI isn't offering. And if you still change anything outside of the webUI it might be wiped with the next update.
If I install steam for gaming, an apache webserver, some mail server, docker environment, SMB server, desktop environment or whatever on my PVE node then PVE can't cover all of that.
But they are working on host backups to PBS...the question is how they will implement this.

gfngfn256 · Mar 19, 2024

Dunuin said:
But they are working on host backups to PBS

Just tell me when its done!

Dunuin · Mar 19, 2024

gfngfn256 said:
Just tell me when its done!

It's on he roadmap since PVE6 or 7. I hoped it would be included with PVE8. So now I'm hoping for PVE9 in 2 years.

gfngfn256 · Mar 19, 2024

Dunuin said:
It's on he roadmap since PVE6 or 7. I hoped it would be included with PVE8. So now I'm hoping for PVE9 in 2 years.

Yup, I'm also following.

edwin2024 · Mar 19, 2024

_gabriel said:
what exact model number ?
what size ?

IronWolf 125 ZA2000NM1A002 / 2Tb
STA022 Pn. 3R1108-570
Why disk size question?

RafalO · Mar 19, 2024

Just to confrm, method I did provide works fine, cloned drive boot and works fine.

edwin2024 · Mar 19, 2024

RafalO said:
Just to confrm, method I did provide works fine, cloned drive boot and works fine.

The please do a complete write-up so there is no confusion about the steps you followed

_gabriel · Mar 20, 2024

edwin2024 said:
IronWolf 125 ZA2000NM1A002 / 2Tb
STA022 Pn. 3R1108-570
Why disk size question?

because, afaik, ssd with PLP sizes are 480GB 960GB 1,92TB 3,84TB 7,86TB
IronWolf 125 isn't recommended for ZFS.
There is IronWolf 110 for ZFS.

edwin2024 · Mar 20, 2024

_gabriel said:
because, afaik, ssd with PLP sizes are 480GB 960GB 1,92TB 3,84TB 7,86TB
IronWolf 125 isn't recommended for ZFS.
There is IronWolf 110 for ZFS.

Damn, I needed the 125 pro or the 110. On the other side the system is running on a apc smart ups 750.
I found many explanations for PLP (https://www.makeuseof.com/what-is-ssd-power-loss-protection-how-does-it-work/) and I must say I never heard of it but I did noticed the strange sizes in hpe ssd's (1.92, 3,84) but never found PLP. Thanks for the attention to this point. Means even more that I have to test and document a recovery process for replacing defective drive.

_gabriel · Mar 20, 2024

edwin2024 said:
On the other side the system is running on a apc smart ups 750.
I found many explanations for PLP

apc doesn't really help.
PLP allow faster sync iops / fsync because ssd cache is always used as it's protected by PLP, like Flash-backed write cache (FBWC) BBU on hw raid.
Without PLP, ZFS bypass the internal dram ssd cache and wait data hit the slower NAND. Only write speed is impacted.

edit: you can compare fsync perf with pveperf command https://pve.proxmox.com/pve-docs/pveperf.1.html

edwin2024 · Mar 20, 2024

_gabriel said:
apc doesn't really help.
PLP allow faster sync iops / fsync because ssd cache is always used as it's protected by PLP, like Flash-backed write cache (FBWC) BBU on hw raid.
Without PLP, ZFS bypass the internal dram ssd cache and wait data hit the slower NAND. Only write speed is impacted.

edit: you can compare fsync perf with pveperf command https://pve.proxmox.com/pve-docs/pveperf.1.html

Thnx for explanation, wish I knew sooner. I guess I have to deal with what I have, on the other side proxmox is now running on 16gb with spinning disks and I'm impressed with performance of W11 installation.

emunt6 · Mar 24, 2024

RafalO said:
Most modern NVME disks requires UEFI to boot. Doing tests on new Supermicro motherboard with i9 14th gen CPU, There is only UEFI supported,
There could be option to install on 1st disk proxmox, and 2 and 3 do zfs mirror, but theres still weakpoint on 1st disk

Are you speaking of M.2 slot or U.2 drive ( NVME boot )?

I'm using SAS HDD/SSD.

MrPete · Apr 6, 2024

Ummm...

_gabriel said:
apc doesn't really help.
PLP allow faster sync iops / fsync because ssd cache is always used as it's protected by PLP, like Flash-backed write cache (FBWC) BBU on hw raid.

PLP is greatly limited. It can only handle a few tens of ms of data transfer, which should be enough to stabilize what was already sent to the device. In fact, modern NVRAM SSD's claim to provide equivalent protection, and the OpenZFS team believes it (see link below.) Beyond on-device cache protection, without UPS a system loses majorly in a power outage. UPS protects the entire system, allowing for proper flushing and shutdown without any issues.

Without PLP, ZFS bypass the internal dram ssd cache and wait data hit the slower NAND. Only write speed is impacted.

Do you have any documentation of this? I find no documentation that ZFS knows anything about hardware at all.
Get a UPS and protect everything that needs it. PLP can be nice extra peace of mind and enterprises love peace of mind... but it is in no way a panacea.

And BTW, "enterprise" SSD is not a panacea either. It does have higher speed than consumer and possibly things like PLP. That comes at a cost: worse bitrot parameters, both for write-bitrot and read-bitrot. Check the data retention specs on your fave enterprise SSD. Yes it's power-off, and typically only 3 months vs 1 year for consumer. Nobody wants to talk about power-on bitrot, which is a real thing. Data retention spec is as close as I can find for a published spec on bit rot.

I don't have the URL at hand for the IBM researcher who first discovered and proved the reality of read-bitrot, but it is quite real. Best practice: do a full disk rewrite 3-4x a year (we use DiskFresh for that on Windows). Partition tables and more tend to never be rewritten otherwise. From experience, it is quite painful when such things degrade.

.

Here's a link to the OpenZFS information on hardware caching, PLP, etc:
https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html#power-failure-protection

Proxmox install - ZFS on NVME, RAID1 does it make sense ?

Distinguished Member

Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

New Member

New Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

Active Member

Active Member