[SOLVED] Root zpool degradated

ServerWrecker

New Member
Sep 12, 2023
2
0
1
Hi guys, we have a small server at home running Proxmox 7.4-16 with root on a mirror zfs pool with two SSD.
One SSD died resulting in a degradated zpool, server anyway is still running and can boot.
I've readed the Proxmox ZFS documentation about how to restore the mirror but we have some doubt....this is why I'm asking here :)

As first thing we have removed the faulty disk from pool using zpool offline rpool sdd ata-SanDisk_SSD_PLUS_480GB_21020S451215
and then phisically removed from the server.

Now rpool status is
Code:
pool: rpool
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0B in 00:11:42 with 0 errors on Sat Sep  9 13:12:26 2023
config:

        NAME                                           STATE     READ WRITE CKSUM
        rpool                                          DEGRADED     0     0     0
          mirror-0                                     DEGRADED     0     0     0
            ata-SanDisk_SDSSDA480G_173518803521-part3  ONLINE       0     0     0
            ata-SanDisk_SSD_PLUS_480GB_21020S451215    OFFLINE      0     0     0

The remaining disk is partitioned in this way
Code:
Disk /dev/sdc: 447.13 GiB, 480103981056 bytes, 937703088 sectors
Disk model: SanDisk SDSSDA48
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: BE4DA726-6FD3-43D2-83E7-D3AA9A1AE628

Device       Start       End   Sectors   Size Type
/dev/sdc1       34      2047      2014  1007K BIOS boot
/dev/sdc2     2048   1050623   1048576   512M EFI System
/dev/sdc3  1050624 937703054 936652431 446.6G Solaris /usr & Apple ZFS

1694506648872.png

Output of proxmox-boot-tool status is
Code:
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
6739-50F1 is configured with: uefi (versions: 5.15.107-1-pve, 5.15.108-1-pve)
WARN: /dev/disk/by-uuid/6739-E63F does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping

Which is the right way to proceed now to add a new 480Gb disk to the pool?
I think I have to remove the old faulty one from the pool using zpool remove rpool sdd ata-SanDisk_SSD_PLUS_480GB_21020S451215[B] and then plug new disk in the server.

Now what? I have to copy partition scheme from the old good disk to the new one?[/B]

Last question: I have backup of VMs and Proxmox /etc folder but I want also to clone boot disk before proceding in zfs restoring. Do you have any USB bootable utility to recommend that can clone (and restore in case needed) existing disk to an external USB one?
 
Copy Partition Table:

sfdisk -d /dev/SOURCEDISK > part_table.txt

sfdisk /dev/TARGETDISK < part_table.txt


Example:

Get Zpool Status:

zpool status


Copy Partitions from working to new disk:

sfdisk -d /dev/nvme2n1 |sfdisk /dev/nvme3n1

Replace Disk, give ZFS Partition
zpool replace zp_pve 14142759960921488282 /dev/disk/by-id/nvme-SAMSUNG_MZVL21T0HCLR-00B00_S676NF0R309587-part3


Check status, should resilver:
zpool status


Rewite Bootloader:
proxmox-boot-tool format /dev/disk/by-id/nvme-SAMSUNG_MZVL21T0HCLR-00B00_S676NF0R309587-part2
proxmox-boot-tool init /dev/disk/by-id/nvme-SAMSUNG_MZVL21T0HCLR-00B00_S676NF0R309587-part2
proxmox-boot-tool status

Clean /etc/kernel/proxmox-boot-uuids

proxmox-boot-tool status
proxmox-boot-tool refresh
proxmox-boot-tool clean



2. You can clone the whole Disk with dd

Copy directly to disk:

dd if=/dev/SOURCEDISK of=/dev/TARGETDISK

Copy to Image File:

dd if=/dev/SOURCEDISK of=/PATH/TO/FILE.image

Copy to compressed Image:

dd if=/dev/SOURCEDISK | gzip > /PATH/TO/FILE.image.gz
 
Code:
# sgdisk /dev/<working disk> -R /dev/<new disk>
sgdisk /dev/sda -R /dev/sdc

sgdisk -G /dev/sdc

# use IDs here for more stability in the zpool
zpool replace -f rpool \
  /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1-part3 \
  /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi2-part3

proxmox-boot-tool format /dev/sdc2
proxmox-boot-tool init /dev/sdc2
 
And if you want the SSD not to fail that fast you could buy proper enterprise SSD with power-loss protection and higher TBW/DWPD like it is highly recommended when using ZFS. And especially avoid QLC SSDs.
 
  • Like
Reactions: s.lendl
Hi guys, pool backed up and then restored with new disk, thank you all for your indications

And if you want the SSD not to fail that fast you could buy proper enterprise SSD with power-loss protection and higher TBW/DWPD like it is highly recommended when using ZFS. And especially avoid QLC SSDs.
About this, I know that using server grade disks will be better but this is a low cost-low activity server and make no sense to spend hundreds of bucks on this machine for storage.
Old good SSD is running by almost 2 years at the moment and wear indication is at 9%.
Failing one was newer, died after only 3 month with an electronic failure (disk can't be more recognized) so I think that was a case of 'infant mortality' that sometimes happens in cheap electronics.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!