ZFS reserved and boot partition

pixelpoint

Member
Mar 25, 2021
21
2
8
35
Hello dear Proxmox Backup users!

One of our 4 HDDs failed in our Proxmox Backup Server.
We replaced this with a new HDD and then re-silvered the whole pool.

Now, after resilvering, I realised I may have made a mistake.

See this screenshot:
1728543500248.png

You can see here that every HDD has the following partitions:
  • ZFS
  • EFI
  • BIOS boot
All, except for one. The one we replaced.
The replacement disk shows only 2 partitions: ZFS and ZFS reserved

I am not sure if this is a problem or not.
Does ZFS reserved mean that the EFI and BIOS boot partitions are inside the ZFS reserved partition or are they missing?
If this is inside the ZFS reserved partition, will they work when rebooting and re-silvering?

I guess ZFS, as a RAID manager, volume manager and filesystem knows what it does when re-silvering and therefore everything should be fine, right?

Best regards
pixelpoint
 
Thank you for your answer.

Seeing as I am replacing a disk with itself, I did the following:
Code:
# the wrongly partitioned disk: /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT-part1

# removing the disk
zpool offline rpool /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT-part1
zpool labelclear -f /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT-part1

# from the wiki
sgdisk /dev/disk/by-id/ata-ST16000NM003G-2KH113_ZL2CANF3 -R /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT
sgdisk -G /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT
zpool replace -f rpool /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT-part1 /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT

I am now waiting for the resilver to finish (probably going to take another 7 - 14 days), so I can execute the following command:
Bash:
# from the wiki
proxmox-boot-tool format <new disk's ESP>

The ESP in question should then be partition 2 as far as I understand, correct?

Right now zpool status -v shows:
Bash:
Every 1.0s: zpool status -v                                                                                                                                                                                                                               backup: Thu Oct 10 11:16:05 2024

  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Oct 10 10:51:20 2024
        5.04G / 48.1T scanned at 3.48M/s, 0B / 48.1T issued
        0B resilvered, 0.00% done, no estimated completion time
config:

        NAME                                               STATE     READ WRITE CKSUM
        rpool                                              DEGRADED     0     0     0
          raidz1-0                                         DEGRADED     0     0     0
            ata-ST16000NM003G-2KH113_ZL2C9L19-part3        ONLINE       0     0     0
            replacing-1                                    DEGRADED     0     0     0
              ata-ST16000NM001J-2TW113_ZRS1QMPT-part1/old  OFFLINE      0     0     0
              ata-ST16000NM001J-2TW113_ZRS1QMPT            ONLINE       0     0     0
            ata-ST16000NM003G-2KH113_ZL2CABRZ-part3        ONLINE       0     0     0
            ata-ST16000NM003G-2KH113_ZL2CANF3-part3        ONLINE       0     0     0

errors: No known data errors

Thank you for your help, it's very appreciated.
Never used ZFS before PVE / PBS.

Best regards,
pixelpoint
 
Last edited:
No this is wrong:
zpool replace -f rpool /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT-part1 /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT

look on all partition table of the disks, one is different!

check cfdisk <device-name>
 
Thank you for your answer.

Seeing as I am replacing a disk with itself, I did the following:
Code:
# the wrongly partitioned disk: /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT-part1

# removing the disk
zpool offline rpool /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT-part1
zpool labelclear -f /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT-part1

# from the wiki
sgdisk /dev/disk/by-id/ata-ST16000NM003G-2KH113_ZL2CANF3 -R /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT
sgdisk -G /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT
zpool replace -f rpool /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT-part1 /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT

I am now waiting for the resilver to finish (probably going to take another 7 - 14 days), so I can execute the following command:
Bash:
# from the wiki
proxmox-boot-tool format <new disk's ESP>

The ESP in question should then be partition 2 as far as I understand, correct?

Right now zpool status -v shows:
Bash:
Every 1.0s: zpool status -v                                                                                                                                                                                                                               backup: Thu Oct 10 11:16:05 2024

  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Oct 10 10:51:20 2024
        5.04G / 48.1T scanned at 3.48M/s, 0B / 48.1T issued
        0B resilvered, 0.00% done, no estimated completion time
config:

        NAME                                               STATE     READ WRITE CKSUM
        rpool                                              DEGRADED     0     0     0
          raidz1-0                                         DEGRADED     0     0     0
            ata-ST16000NM003G-2KH113_ZL2C9L19-part3        ONLINE       0     0     0
            replacing-1                                    DEGRADED     0     0     0
              ata-ST16000NM001J-2TW113_ZRS1QMPT-part1/old  OFFLINE      0     0     0
              ata-ST16000NM001J-2TW113_ZRS1QMPT            ONLINE       0     0     0
            ata-ST16000NM003G-2KH113_ZL2CABRZ-part3        ONLINE       0     0     0
            ata-ST16000NM003G-2KH113_ZL2CANF3-part3        ONLINE       0     0     0

errors: No known data errors

Thank you for your help, it's very appreciated.
Never used ZFS before PVE / PBS.

Best regards,
pixelpoint
Small improvement:

Delete the uuid and label-id,while copying, so new ones get created, so no doubles:

sfdisk -d /dev/WORKING | sed 's/, uuid.*//; /label-id/d;' |sfdisk /dev/REPLACEMENT


---

My whole recipe:

Zpool replace disk
==================

Get disk IDs
ls -l /dev/disk/by-id/*

If you replace sdb in a sda/ sdb mirror

ls -l /dev/disk/by-id/* |grep sdb

I use the /dev/disk/by-id names eg:

ls -l /dev/disk/by-id/* |grep sda
lrwxrwxrwx 1 root root 9 Okt 10 09:23 /dev/disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNX0H638608 -> ../../sda


Get Zpool Status:

zpool status

this assumes the following disk layout:

Part 1: BIOS Boot
Part 2: EFI
Part 3: ZFS

Copy Partitions from working to new disk, without copying label and UUIDs:

sfdisk -d /dev/WORKING | sed 's/, uuid.*//; /label-id/d;' |sfdisk /dev/REPLACEMENT

Replace Disk, give ZFS Partition
zpool replace zp_pve /dev/disk/by-id/nvme-OLD-part3 /dev/disk/by-id/nvme-REPLACEMENT-part3


Check status, should resilver:
zpool status


Rewite Bootloader:
proxmox-boot-tool format /dev/disk/by-id/nvme-REPLACEMENT-part2
proxmox-boot-tool init /dev/disk/by-id/nvme-REPLACEMENT-part2
proxmox-boot-tool status

Clean /etc/kernel/proxmox-boot-uuids of old entries

proxmox-boot-tool status
proxmox-boot-tool refresh
proxmox-boot-tool clean
 
Last edited:
No this is wrong:
zpool replace -f rpool /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT-part1 /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT

look on all partition table of the disks, one is different!

check cfdisk <device-name>
So, after checking with cfdisk, I can see that /dev/disk/by-id/ata-ST16000NM001J-2TW113_ZRS1QMPT still has only 2 partitions instead of 3 and there's no BIOS boot partition in sight.

But what else am I to do?
I took a healthy disk, gave its partition table to the unhealthy disk (using sgdisk).
I thought the correct partition table might appear after the resilvering is done?

As far as I understand I cannot stop the resilvering, correct?
As it is already running, I guess I'll need to wait another 25 days until the resilvering has been finished.

Small improvement:

Delete the uuid and label-id,while copying, so new ones get created, so no doubles:

sfdisk -d /dev/WORKING | sed 's/, uuid.*//; /label-id/d;' |sfdisk /dev/REPLACEMENT


---

My whole recipe:

Zpool replace disk
==================

Get disk IDs
ls -l /dev/disk/by-id/*

If you replace sdb in a sda/ sdb mirror

ls -l /dev/disk/by-id/* |grep sdb

I use the /dev/disk/by-id names eg:

ls -l /dev/disk/by-id/* |grep sda
lrwxrwxrwx 1 root root 9 Okt 10 09:23 /dev/disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNX0H638608 -> ../../sda


Get Zpool Status:

zpool status

this assumes the following disk layout:

Part 1: BIOS Boot
Part 2: EFI
Part 3: ZFS

Copy Partitions from working to new disk, without copying label and UUIDs:

sfdisk -d /dev/WORKING | sed 's/, uuid.*//; /label-id/d;' |sfdisk /dev/REPLACEMENT

Replace Disk, give ZFS Partition
zpool replace zp_pve /dev/disk/by-id/nvme-OLD-part3 /dev/disk/by-id/nvme-REPLACEMENT-part3


Check status, should resilver:
zpool status


Rewite Bootloader:
proxmox-boot-tool format /dev/disk/by-id/nvme-REPLACEMENT-part2
proxmox-boot-tool init /dev/disk/by-id/nvme-REPLACEMENT-part2
proxmox-boot-tool status

Clean /etc/kernel/proxmox-boot-uuids of old entries

proxmox-boot-tool status
proxmox-boot-tool refresh
proxmox-boot-tool clean
Does this also work if you replace a disk with itself?
I have no physical access to the server, so I cannot just change disks on the fly.
The hoster already changed the disk, so we DO have 4 healthy disks right now.
 
Ok, the last resilvering is done and it, again, resulted in there being no boot partition.

This is the current setup:

Bash:
# lsblk -o +FSTYPE,PARTTYPENAME
sda      8:0    0  14.6T  0 disk
├─sda1   8:1    0  1007K  0 part                               BIOS boot
├─sda2   8:2    0   512M  0 part                    vfat       EFI System
└─sda3   8:3    0  14.6T  0 part                    zfs_member Solaris /usr & Apple ZFS
sdb      8:16   0  14.6T  0 disk
├─sdb1   8:17   0  14.6T  0 part                    zfs_member Solaris /usr & Apple ZFS
└─sdb9   8:25   0     8M  0 part                               Solaris reserved 1
sdc      8:32   0  14.6T  0 disk
├─sdc1   8:33   0  1007K  0 part                               BIOS boot
├─sdc2   8:34   0   512M  0 part                    vfat       EFI System
└─sdc3   8:35   0  14.6T  0 part                    zfs_member Solaris /usr & Apple ZFS
sdd      8:48   0  14.6T  0 disk
├─sdd1   8:49   0  1007K  0 part                               BIOS boot
├─sdd2   8:50   0   512M  0 part                    vfat       EFI System
└─sdd3   8:51   0  14.6T  0 part                    zfs_member Solaris /usr & Apple ZFS

Bash:
# fdisk -l
Disk /dev/sda: 14.55 TiB, 16000900661248 bytes, 31251759104 sectors
Disk model: ST16000NM003G-2K
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: E866DAA2-58D2-4291-8858-28DEDE991CF8

Device       Start         End     Sectors  Size Type
/dev/sda1       34        2047        2014 1007K BIOS boot
/dev/sda2     2048     1050623     1048576  512M EFI System
/dev/sda3  1050624 31251759070 31250708447 14.6T Solaris /usr & Apple ZFS

Partition 1 does not start on physical sector boundary.


Disk /dev/sdb: 14.55 TiB, 16000900661248 bytes, 31251759104 sectors
Disk model: ST16000NM001J-2T
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: F32452ED-45DD-DF4B-B620-19ECB248C8FF

Device           Start         End     Sectors  Size Type
/dev/sdb1         2048 31251740671 31251738624 14.6T Solaris /usr & Apple ZFS
/dev/sdb9  31251740672 31251757055       16384    8M Solaris reserved 1


Disk /dev/sdc: 14.55 TiB, 16000900661248 bytes, 31251759104 sectors
Disk model: ST16000NM003G-2K
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: BBD95D51-E904-4843-9743-DFCD9B803615

Device       Start         End     Sectors  Size Type
/dev/sdc1       34        2047        2014 1007K BIOS boot
/dev/sdc2     2048     1050623     1048576  512M EFI System
/dev/sdc3  1050624 31251759070 31250708447 14.6T Solaris /usr & Apple ZFS

Partition 1 does not start on physical sector boundary.


Disk /dev/sdd: 14.55 TiB, 16000900661248 bytes, 31251759104 sectors
Disk model: ST16000NM003G-2K
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: D9A721DB-32B7-495A-86E6-518E50B89DF3

Device       Start         End     Sectors  Size Type
/dev/sdd1       34        2047        2014 1007K BIOS boot
/dev/sdd2     2048     1050623     1048576  512M EFI System
/dev/sdd3  1050624 31251759070 31250708447 14.6T Solaris /usr & Apple ZFS

Partition 1 does not start on physical sector boundary.

Bash:
# zpool status -v
  pool: rpool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: resilvered 10.3T in 22 days 18:28:32 with 0 errors on Sat Nov  2 04:19:52 2024
config:

        NAME                                         STATE     READ WRITE CKSUM
        rpool                                        ONLINE       0     0     0
          raidz1-0                                   ONLINE       0     0     0
            ata-ST16000NM003G-2KH113_ZL2C9L19-part3  ONLINE       0     0     0
            ata-ST16000NM001J-2TW113_ZRS1QMPT        ONLINE       0     0     0
            ata-ST16000NM003G-2KH113_ZL2CABRZ-part3  ONLINE       0     0     0
            ata-ST16000NM003G-2KH113_ZL2CANF3-part3  ONLINE       0     0     0

errors: No known data errors

As one can see in the lsblk output, the partition in question is still lacking the EFI System partition.
I have re-silvered 3 times now, it always takes 20+ days to resilver, just for me to realize I did it the wrong way.
I have no way to physically unmount and/or exchange the disk (the server is in some datacenter somewhere).

Could somebody please tell me what exactly I need to do to get rid of this mess? I really don't want to try and wait again for another 20+ days.

If more information is necessary, please don't hesitate to ask, I will do my best to answer any questions.

Best regards
pixelpoint
 
Would these be the next correct steps?
Is there something I missed?

Bash:
# set disk offline
zpool offline rpool /dev/WRONG_PARTITION_DISK

# copy partition layout and erase UUID + label (thanks ubu)
sfdisk -d /dev/SOME_WORKING_DISK | sed 's/, uuid.*//; /label-id/d;' | sfdisk /dev/WRONG_PARTITION_DISK

After this, the disk in question should be showing 3 partitions again, correct?

Bash:
# replace disk with newly formed partition 3
# note: I might need the -f flag here
zpool replace rpool /dev/WRONG_PARTITION_DISK /dev/FIXED_DISK-part3

# fix EFI partition on FIXED_DISK (thanks ubu)
proxmox-boot-tool format /dev/FIXED_DISK-part2
proxmox-boot-tool init /dev/FIXED_DISK-part2

# check
proxmox-boot-tool status

All references to disks in the code segments will be replaced by their corresponding /dev/disk/by-id/ path.
 
So once again complete from the beginning ... see your new disk as "just died", offline and remove from pool.
Then I (personally) would dd the whole disk from /dev/zero to remove any zfs content and after ...
See: https://pve.proxmox.com/wiki/ZFS_on_Linux
Section: Changing a failed bootable device
I would do the zfs replace command of doku generated part-3 start only if you did the presteps of partitioning and proxmox-boot-tool done successfully (before wait again 20days...) :)
 
Just to be clear, the steps would then be:

Bash:
# set disk offline
zpool offline rpool /dev/WRONG_PARTITION_DISK

# force-remove all possibly remaining data-fragments
dd if=/dev/zero of=/dev/WRONG_PARTITION_DISK

# copy partition table from healthy disk and write new UUIDs
sgdisk /dev/SOME_HEALTHY_DISK -R /dev/WRONG_PARTITION_DISK
sgdisk -G /dev/WRONG_PARTITION_DISK

# fix bootloader on disk
proxmox-boot-tool format /dev/FIXED_DISK-part2
proxmox-boot-tool init /dev/FIXED_DISK-part2

# replace disk in zfs pool
zpool replace -f rpool /dev/WRONG_PARTITION_DISK /dev/FIXED_DISK

As far as I understand (please correct me if I'm wrong), I don't NEED to start the resilvering so I can do the "fix bootloader" steps first, correct?
The partition containing the ESP (DISK-part2) is not the one being used in ZFS (DISK-part3), so resilvering can be done after fixing bootloader, right?

Also: efibootmgr -v tells me I am using systemd-boot , the wiki only says to use grub-install or proxmox-boot-tool with grub as mode of operation. So with systemd-boot, there's nothing to do here?
Bash:
Boot0005* Linux Boot Manager    HD(2,GPT,7389137b-4398-4b65-85e3-0380b05a1243,0x800,0x100000)/File(\EFI\systemd\systemd-bootx64.efi)
Boot0007* Linux Boot Manager    HD(2,GPT,4e3335a4-3f68-4a17-85ff-732dd2470468,0x800,0x100000)/File(\EFI\systemd\systemd-bootx64.efi)
Boot0008* Linux Boot Manager    HD(2,GPT,35ef2712-9a87-4fdc-8677-b4031e96abb4,0x800,0x100000)/File(\EFI\systemd\systemd-bootx64.efi)

Best regards

pixelpoint
 
Last edited:
I did the steps above (with slight adjustements, listed below) and this should be solved now.
Resilvering is currently running, will take another ~25 days to finish.

Current partition layout

Code:
sda      8:0    0  14.6T  0 disk
├─sda1   8:1    0  1007K  0 part                               BIOS boot
├─sda2   8:2    0   512M  0 part                    vfat       EFI System
└─sda3   8:3    0  14.6T  0 part                    zfs_member Solaris /usr & Apple ZFS
sdb      8:16   0  14.6T  0 disk
├─sdb1   8:17   0  1007K  0 part                               BIOS boot
├─sdb2   8:18   0   512M  0 part                    vfat       EFI System
└─sdb3   8:19   0  14.6T  0 part                    zfs_member Solaris /usr & Apple ZFS
sdc      8:32   0  14.6T  0 disk
├─sdc1   8:33   0  1007K  0 part                               BIOS boot
├─sdc2   8:34   0   512M  0 part                    vfat       EFI System
└─sdc3   8:35   0  14.6T  0 part                    zfs_member Solaris /usr & Apple ZFS
sdd      8:48   0  14.6T  0 disk
├─sdd1   8:49   0  1007K  0 part                               BIOS boot
├─sdd2   8:50   0   512M  0 part                    vfat       EFI System
└─sdd3   8:51   0  14.6T  0 part                    zfs_member Solaris /usr & Apple ZFS

Current zpool status
Code:
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Nov 18 15:59:54 2024
        1.32T / 49.2T scanned at 22.8M/s, 1.07T / 49.2T issued at 18.4M/s
        266G resilvered, 2.17% done, 31 days 17:04:17 to go
config:

        NAME                                           STATE     READ WRITE CKSUM
        rpool                                          DEGRADED     0     0     0
          raidz1-0                                     DEGRADED     0     0     0
            ata-ST16000NM003G-2KH113_ZL2C9L19-part3    ONLINE       0     0     0
            replacing-1                                DEGRADED     0     0     0
              ata-ST16000NM001J-2TW113_ZRS1QMPT        REMOVED      0     0     0
              ata-ST16000NM001J-2TW113_ZRS1QMPT-part3  ONLINE       0     0     0  (resilvering)
            ata-ST16000NM003G-2KH113_ZL2CABRZ-part3    ONLINE       0     0     0
            ata-ST16000NM003G-2KH113_ZL2CANF3-part3    ONLINE       0     0     0

Things I did different then outlined in my above post
Code:
# What I wrote above
proxmox-boot-tool init /dev/FIXED_DISK-part2

# What I needed to do instead
proxmox-boot-tool init /dev/FIXED_DISK-part2 grub

Though efibootmgr -v does not mention grub but systemd instead, systemd is not the bootloader being used here.
The first command (without the grub flag) told me that bootctl is not installed and therefore did not do anything.

Thank you all for your help, it was very much appreciated.

Best regards
pixelpoint
 
  • Like
Reactions: waltar

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!