Proxmox install - ZFS on NVME, RAID1 does it make sense ?

RafalO

New Member
Mar 18, 2024
7
2
3
Hello everyone

I been testing ZFS on NVME Drives.

Installed proxmox
on 2 NVME drives:
They are SN:
7VQ09JY9
7VQ09H02

Filesystem: ZFS Raid 1 (mirroring)
install_disk.PNG


Reboot to installed proxmox and check:

zpool status -L
pool: rpool
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme1n1p3 ONLINE 0 0 0
nvme0n1p3 ONLINE 0 0 0

errors: No known data errors

To explain which disk is which S/N
ls -l /dev/disk/by-id/ | grep nvme
- truncated----
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09H02 -> ../../nvme0n1
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09JY9 -> ../../nvme1n1
---------------

Now lets simulate failure of one drive. Lets say
(7VQ09H02 -> ../../nvme0n1 ) is dead now !
It's not problem got new one:
7VQ09HSX

Power off, Replaced disk, booting proxmox.

Checking:
root@smvm:~# zpool status -L
pool: rpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme0n1p3 ONLINE 0 0 0
4351272091168832311 UNAVAIL 0 0 0 was /dev/disk/by-id/nvme-eui.6479a7823f0042a1-part3

errors: No known data errors
root@smvm:~#

There's new disk (serial 7VQ09HSX):
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX -> ../../nvme1n1
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX_1 -> ../../nvme1n1
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX_1-part1 -> ../../nvme1n1p1
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX-part1 -> ../../nvme1n1p1

We can replace UNAVAIL disk by

zpool replace -f rpool 4351272091168832311 /dev/disk/by-id/nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX

and we get:
root@smvm:~# zpool status
pool: rpool
state: ONLINE
scan: resilvered 1.33G in 00:00:02 with 0 errors on Mon Mar 18 14:49:30 2024
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.6479a7823f004368-part3 ONLINE 0 0 0
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX ONLINE 0 0 0

errors: No known data errors
root@smvm:~#

Seems fine, but problem is:
By default it was using partition3 on disk (7VQ09H02). We replaced it with nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX
as it point to physical disk, because new disk don't have any partitions.
New disk in pool nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX is now ONLINE but it doesn't boot.


How it is supposed to work ?
Do we need first dd if=oldremainingworkingdisk of=newblankdisk ?
And then add partition 3 of new disk as replace to resilver data, so new disk will be bootable ?

What's point installing proxmox on zfs raid disk as it doesn't provide "mirroring" out of box
and causes more problems for everyone who never faced drive failure.
I belive many users will get suprised when they get dead nvme disk, even with RAID1.

Can someone explain ?
 
Last edited:
It's not problem got new one:
7VQ09JY9
Just to be accurate I believe you meant here "another one". (The other one still installed).


Now on to your main problem:

Firstly, from the following it looks like you already have partitions on the "new" 7VQ09HSX. It looks like its been used in the past.

There's new disk (serial 7VQ09HSX):
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX -> ../../nvme1n1
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX_1 -> ../../nvme1n1
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX_1-part1 -> ../../nvme1n1p1
nvme-Seagate_FireCuda_530_ZP1000GM30023_7VQ09HSX-part1 -> ../../nvme1n1p1

Secondly, and this is going to hurt, I believe, with the Zfs Raid1 Install on Proxmox, you're going to need to partition the Boot/Efi partitions yourself on the replacement drive, otherwise you will have to keep on booting from the other "good" drive in the future.

Anyway this was the situation when I last checked (long time ago), and for this reason I don't do a Zfs Raid1 install. Check this - there's more out there.

In my opinion this makes this type of install pretty much "useless", apart from the automatic setup of a Zfs pool.
I agree this should be pointed out to users - as in my humble opinion - its not AIOTL - (as is on the label!).
Sorry to disappoint, somebody please correct me if I'm wrong.

EDIT: I forgot to include the official docs on the subject.
 
Last edited:
  • Like
Reactions: Kingneutron
7VQ09HSX paste mistake
Just to be accurate I believe you meant here "another one". (The other one still installed).


Now on to your main problem:

Firstly, from the following it looks like you already have partitions on the "new" 7VQ09HSX. It looks like its been used in the past.

Yes, new disk 7VQ09HSX had windows partition on it, but it makes no difference.

Secondly, and this is going to hurt, I believe, with the Zfs Raid1 Install on Proxmox, you're going to need to partition the Boot/Efi partitions yourself on the replacement drive, otherwise you will have to keep on booting from the other "good" drive in the future.

Anyway this was the situation when I last checked (long time ago), and for this reason I don't do a Zfs Raid1 install. Check this - there's more out there.

In my opinion this makes this type of install pretty much "useless", apart from the automatic setup of a Zfs pool.
I agree this should be pointed out to users - as in my humble opinion - its not AIOTL - (as is on the label!).
Sorry to disappoint, somebody please correct me if I'm wrong.

EDIT: I forgot to include the official docs on the subject.



So yes, that's not good idea for proxmox install in software Raid1.
I would say, fundamental flaw in the concept if we think that it work out of box.


Anyway, that's fixes disk replacement - after dead disk is replaced in machine:

zpool status -L (checking status by physical disk path)
zpool status (checking status by disk id)

zpool detach rpool deaddiskid (removes dead disk)
ls -l /dev/disk/by-id/ | grep nvme

dd if=/dev/nvme0n1 of=/dev/nvme1n1 bs=100M status=progress (copy from old working to fresh)
sgdisk -G /dev/nvme1n1 (new drive, new uid)
partprobe /dev/nvme1n1 (new drive, reload partition tables)
ls -l /dev/disk/by-id/ | grep nvme
nvme-eui.6479a7823f004246-part3 -> ../../nvme1n1p3 (verify new disk id with with cloned working data)

zpool labelclear -f nvme-eui.6479a7823f004246-part3 (need clear zfs label so it's not same as from working after clone)
zpool attach rpool nvme-eui.6479a7823f004368-part3 nvme-eui.6479a7823f004246-part3 -f (adding new disk to old disk as mirror)
zpool scrub rpool (check for any errors)
zpool status (verify status by disk id name)
zpool status -L (verify status by physical disks names)

This seems worked.


root@smvm:~# zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:00:01 with 0 errors on Mon Mar 18 19:13:59 2024
remove: Removal of vdev 2 copied 776K in 0h0m, completed on Mon Mar 18 18:42:51 2024
3.07K memory used for removed device mappings
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.6479a7823f004368-part3 ONLINE 0 0 0
nvme-eui.6479a7823f004246-part3 ONLINE 0 0 0

errors: No known data errors

All partitions restored, and ZFS seems mirror it as should.
 
To truly test your system now, take out the other NVME and see if it now boots.

You don't include any info on GRUB /ESP /boot/efi (partition 2) in your post. See relevant Docs above.
 
  • Like
Reactions: Kingneutron
that's not good idea for proxmox install in software Raid1
of course, it's recommended to use raid1 at least for proxmox os.

if VM disks datastore is on ZFS then use datacenter ssd as zfs is slow without and eat TBW.
 
of course, it's recommended to use raid1 at least for proxmox os.
I don't think RafalO meant not to use Raid1. He just meant that one could assume its an automatic mirror solution for the complete boot. Which it is not.
And in my opinion this should be corrected - it won't take a whole lot of trickery to accomplish this.
 
  • Like
Reactions: Kingneutron
I don't think RafalO meant not to use Raid1. He just meant that one could assume its an automatic mirror solution for the complete boot. Which it is not.
And in my opinion this should be corrected - it won't take a whole lot of trickery to accomplish this.
This is confusing because during installation you choose RAID1, which by default is supposed to be a mirror, but in reality it's only a mirror for part of it. I would expect a solution that creates a mirror for the entire installation in such a situation.
I am convinced that many people who installed in a RAID1 configuration expected and anticipated such functionality.
 
There wouldn't be anything unusual about that, but professional multi-drive solutions mostly rely on hardware disk controllers, which operate by mirroring the entirety of the disks.
 
Why do use EFI/UEFI for the Proxmox host?

It just wont do any extra, just complexity - thats why i use simple LEGACY/BIOS boot with mdadm in raid1 for Proxmox host for the most of the installations.
If you want to boot GPT disk with LEGACY/BIOS also possible (gptmbr - extlinux/syslinux, no need GRUB ).
 
Last edited:
of course, it's recommended to use raid1 at least for proxmox os.

if VM disks datastore is on ZFS then use datacenter ssd as zfs is slow without and eat TBW.
First, this is no good news. I would have assumed that replacing the disc and then performing a resync (not a Linux guy...) would do the trick.

As for Gabriel remark. Not sure what you're saying. I've got proxmox pve on 2x ssd with zfs raid 1. Is that an issue for durability?
 
Why do use EFI/UEFI for the Proxmox host?

It just wont do any extra, just complexity - thats why i use simple LEGACY/BIOS boot with mdadm in raid1 for Proxmox host for the most of the installations.
If you want to boot GPT disk with LEGACY/BIOS also possible (gptmbr - extlinux/syslinux, no need GRUB ).
Most modern NVME disks requires UEFI to boot. Doing tests on new Supermicro motherboard with i9 14th gen CPU, There is only UEFI supported,
There could be option to install on 1st disk proxmox, and 2 and 3 do zfs mirror, but theres still weakpoint on 1st disk
 
This is confusing because during installation you choose RAID1, which by default is supposed to be a mirror, but in reality it's only a mirror for part of it. I would expect a solution that creates a mirror for the entire installation in such a situation.
ZFS part is a true mirror.
Only EFI / ESP Boot parts are created on each disks with the same boot configuration.
proxmox-boot-tool will sync each EFI / ESP Boot parts on change, like the Docs explain.
 
Wow, this thread really caught me off guard.
So with a default install, mirror will not always prevent us from total failure if a disk dies?

I am fine with using the commands provided in the docs to replace a failed drive. But this only works with a system up and running? What if the power goes out or the system crashes and a drive with the bootloader will fail at the same time? It could be that the Proxmox will not get back up and running again because of this?

If that is true, this should really be mentioned in the docs.
 
But this only works with a system up and running? What if the power goes out or the system crashes and a drive with the bootloader will fail at the same time? It could be that the Proxmox will not get back up and running again because of this?
You can always boot a live system disk, e.g. also the Proxmox VE Installer. This features a rescue boot mode under Advanced Options > Rescue Boot, where all necessary tools are available to replace a failed boot disk as described in our documentation.

Booting a live system to repair your boot drive is pretty standard practice, due to obvious reasons.
 
You can always boot a live system disk, e.g. also the Proxmox VE Installer. This features a rescue boot mode under Advanced Options > Rescue Boot, where all necessary tools are available to replace a failed boot disk as described in our documentation.

Booting a live system to repair your boot drive is pretty standard practice, due to obvious reasons.
Your response is detailed & accurate: It should really appear somewhere in the official documentation. I know some people only discover this the hard-way.
 
  • Like
Reactions: IsThisThingOn
Thank you for your response! That sounds like a good solution.

Still, it can cause problems in edge cases and thus should probably be mentioned. If the VPN itself runs on Proxmox or if the Proxmox host offers no IPMI, it could get a little bit tricky to try to remote advise none IT people how to create a Proxmox bootable disk and how to repair it, if one is not prepared for this because he/she assumes that the PVE can withstand a drive failure ;)
 
if one is not prepared for this because he/she assumes that the PVE can withstand a drive failure
Well, this is never said anywhere that ZFS-RAID1 magically fixes all problems. And absence of explicit warnings does not imply that it just works, either.

But the documentation already makes it is pretty obvious that you might need to manually intervene to properly replace a ZFS boot drive.

Just for reference: You can e.g. run ext4 on an mdadm-based RAID, with separate mirrors for the boot partition and the root device. This is how I run one of my own servers. But mdadm will refuse to open a degraded RAID (for very good reason!), which needs to be explicitly done if one wants the server to continue booting. So - if no special preparations are done - still need to manually intervene.

tl;dr: There will always be cons and pros for each solution, there simply is no one-fits-all.
For production systems, one has to be always prepared for drive failures, including the worst case, anyway.
 
Last edited:
  • Like
Reactions: IsThisThingOn

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!