[SOLVED] help requested- zpool replace- mirrored disk unavailable

thelonghop

Member
May 6, 2021
15
0
6
54
I just noticed this and not sure what I need to do. Does the disk physically need to be replaced? I'm still learning Linux and not sure how to resolve this. Help appreciated!

1627694659705.png
 
Can you do a zpool status rpool on the Proxmox console? That gives a little more information (and you'll need the console to perform actions on this anyway). If you can, please use ssh and copy the text (instead of using the Shell button and using screenshots).
 
  • Like
Reactions: thelonghop
Am I reading this right? Are the numbers/labels identical? Did you mirror one vdev with itself? In that case removing one part of that mirror would fix it. The problem is that I don't know how to tell ZFS to only remove the second one as they both have the same path.

EDIT: What is the output of zpool status rpool -g? This might show how to select the second part of the mirror by number instead.
EDIT2: Turns out that there is a one digit difference between the drives in the long chain of digits.
 
Last edited:
  • Like
Reactions: thelonghop
Thanks for the response!

# zpool status rpool
pool: rpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
scan: resilvered 11.1G in 00:01:19 with 0 errors on Fri Jul 30 23:36:15 2021
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-eui.0000000001000000e4d25cc4f32c5401-part3 ONLINE 0 0 0
nvme-eui.0000000001000000e4d25cc6f32c5401-part3 UNAVAIL 6 173K 0

errors: No known data errors


I had to stare at them for a bit as well, but the numbers are different. I have two physical, but identical nvme drives installed.

Code:
# zpool status rpool -g
  pool: rpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: resilvered 11.1G in 00:01:19 with 0 errors on Fri Jul 30 23:36:15 2021
config:

    NAME                      STATE     READ WRITE CKSUM
    rpool                     DEGRADED     0     0     0
      7450737692324333241     DEGRADED     0     0     0
        1820375655777362786   ONLINE       0     0     0
        12526451830035177050  UNAVAIL      6  173K     0

errors: No known data errors
 
Last edited:
Ah right, 1 digit is different in the names of the drives. You can remove the faulty drive using zpool detach rpool /dev/disk/by-id/nvme-eui.0000000001000000e4d25cc6f32c5401-part3.
If there are indeed 173000 write errors, it does sound like something is wrong with the drive or the connection to it. I can't tell from here what is wrong with it. Maybe remove it from the mirror and do a long SMART test to have the drive check itself? Or just erase it and put it back, to see if it was just a bad connection?
You can add a replacement by using the attach command but make sure to specify the drive that you want to mirror, otherwise you get a RAID0 (no safety) instead of RAID1: zpool attach rpool /dev/disk/by-id/nvme-eui.0000000001000000e4d25cc4f32c5401-part3 /dev/disk/by-id/YOUR-NEW-DRIVE
 
Last edited:
Ah right, 1 digit is different in the names of the drives. You can remove the faulty drive using zpool detach rpool /dev/disk/by-id/nvme-eui.0000000001000000e4d25cc6f32c5401-part3.
If there are indeed 173000 write errors, it does sound like something is wrong with the drive or the connection to it. I can't tell from here what is wrong with it. Maybe remove it from the mirror and do a long SMART test to have the drive check itself? Or just erase it and put it back, to see if it was just a bad connection?
You can add a replacement by using the attach command but make sure to specify the drive that you want to mirror, otherwise you get a RAID0 (no safety) instead of RAID1: zpool attact rpool /dev/disk/by-id/nvme-eui.0000000001000000e4d25cc4f32c5401-part3 /dev/disk/by-id/YOUR-NEW-DRIVE
I ran that command and have it removed from the zpool. The second disk doesn't even show up in lsblk. I'm not sure where else to look for it to do a SMART test.
 
Last edited:
Sounds like it is the connector, or the drive is completely dead. Shutdown the system and try removing the drive and putting it back? Do you see the drive in the system's BIOS? Or do you have a motherboard where using some PCIe slot disables a M.2 slot and you just added a PCIe add-in card? Any other changes made before the drive disappeared?
 
  • Like
Reactions: thelonghop
Alright. Did a self-test in the BIOS for each drive and both passed. Removed and reinstalled the drives, did the self-test with each individually installed and the other removed, all passed. So there doesn't seem to be anything physically wrong with the drives.
 
I left a monitor connected to the server. After boot it shows this-

[52.766252] nvme nvme1: failed to set APST feature (-19)

Also, not sure if it's related but during boot is this message:

[FAILED] Failed to start Load Kernal Modules.
See 'systemctl status systemd-modules-load.service' for details.

here's that output-
Code:
#systemctl status systemd-modules-load.service
● systemd-modules-load.service - Load Kernel Modules
   Loaded: loaded (/lib/systemd/system/systemd-modules-load.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sat 2021-07-31 15:29:47 CDT; 3min 32s ago
     Docs: man:systemd-modules-load.service(8)
           man:modules-load.d(5)
  Process: 1400 ExecStart=/lib/systemd/systemd-modules-load (code=exited, status=1/FAILURE)
 Main PID: 1400 (code=exited, status=1/FAILURE)


Jul 31 15:29:47 pve systemd-modules-load[1400]: Inserted module 'vfio'
Jul 31 15:29:47 pve systemd-modules-load[1400]: Inserted module 'vfio_pci'
Jul 31 15:29:47 pve systemd-modules-load[1400]: Failed to insert module 'kvmgt': No such device
Jul 31 15:29:47 pve systemd-modules-load[1400]: Failed to find module 'exngt'
Jul 31 15:29:47 pve systemd-modules-load[1400]: Inserted module 'iscsi_tcp'
Jul 31 15:29:47 pve systemd-modules-load[1400]: Inserted module 'ib_iser'
Jul 31 15:29:47 pve systemd-modules-load[1400]: Inserted module 'vhost_net'
Jul 31 15:29:47 pve systemd[1]: systemd-modules-load.service: Main process exited, code=exited, status=1/FAILURE
Jul 31 15:29:47 pve systemd[1]: systemd-modules-load.service: Failed with result 'exit-code'.
Jul 31 15:29:47 pve systemd[1]: Failed to start Load Kernel Modules.
 
Last edited:
Here's the output of /etc/modules-load.c/modules.conf

Code:
 /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
# Modules required for PCI passthrough
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

# Modules required for Intel GVT
kvmgt
exngt
vfio-mdev

I have an 11th gen Intel CPU and I've heard some of its features haven't been added to the kernel yet, so that might be the issue with this. Thinking it's not related to the nvme issue, but I'm obviously not fully informed.
 
Alright. Did a self-test in the BIOS for each drive and both passed. Removed and reinstalled the drives, did the self-test with each individually installed and the other removed, all passed. So there doesn't seem to be anything physically wrong with the drives.
Were you experimenting with PCI passthrough at the time? This can cause the Proxmox host to lose connection to devices such as NVME (M.2 via PCIe).

Jul 31 15:29:47 pve systemd-modules-load[1400]: Failed to insert module 'kvmgt': No such device
Jul 31 15:29:47 pve systemd-modules-load[1400]: Failed to find module 'exngt'
The kvmgt and exngt modules claim that you have no supported hardware (which could be because it is too new), and this would explain why systemd-modules-load.service failed but no the NVME.

Is the second NVME again (or still) part of the rpool mirror? Did you clear the old errors and did a resilver or scrub not show any new errors?
 
  • Like
Reactions: thelonghop
Were you experimenting with PCI passthrough at the time? This can cause the Proxmox host to lose connection to devices such as NVME (M.2 via PCIe).


The kvmgt and exngt modules claim that you have no supported hardware (which could be because it is too new), and this would explain why systemd-modules-load.service failed but no the NVME.

Is the second NVME again (or still) part of the rpool mirror? Did you clear the old errors and did a resilver or scrub not show any new errors?
No, wasn't experimenting with anything. Yeah, one NVME is still in the rpool and it doesn't show any errors.
 
I can only explain this if you are doing PCI(e) passthrough and the second NVME is in the same IOMMU group, or if you are hiding the PCIe lanes with pci-stub or vfio-pci. Otherwise, I'm at a loss. Can you show a lspci -k (in Proxmox or both)? Both NVME drives should appear as PCI devices using the nvme driver.
 
  • Like
Reactions: thelonghop
I can only explain this if you are doing PCI(e) passthrough and the second NVME is in the same IOMMU group, or if you are hiding the PCIe lanes with pci-stub or vfio-pci. Otherwise, I'm at a loss. Can you show a lspci -k (in Proxmox or both)? Both NVME drives should appear as PCI devices using the nvme driver.
That might be it? If I'm reading this right, the kernel driver for one of them is vfio-pci

Code:
# lspci -k
00:00.0 Host bridge: Intel Corporation Device 4c53 (rev 01)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:02.0 VGA compatible controller: Intel Corporation Device 4c8b (rev 04)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:06.0 PCI bridge: Intel Corporation Device 4c09 (rev 01)
    Kernel driver in use: pcieport
00:08.0 System peripheral: Intel Corporation Device 4c11 (rev 01)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:14.0 USB controller: Intel Corporation Device 43ed (rev 11)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
    Kernel driver in use: xhci_hcd
    Kernel modules: xhci_pci
00:14.2 RAM memory: Intel Corporation Device 43ef (rev 11)
00:14.3 Network controller: Intel Corporation Device 43f0 (rev 11)
    Subsystem: Intel Corporation Device 0074
    Kernel driver in use: iwlwifi
    Kernel modules: iwlwifi
00:16.0 Communication controller: Intel Corporation Device 43e0 (rev 11)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:17.0 SATA controller: Intel Corporation Device 43d2 (rev 11)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
    Kernel driver in use: ahci
    Kernel modules: ahci
00:1b.0 PCI bridge: Intel Corporation Device 43c4 (rev 11)
    Kernel driver in use: pcieport
00:1c.0 PCI bridge: Intel Corporation Device 43bc (rev 11)
    Kernel driver in use: pcieport
00:1f.0 ISA bridge: Intel Corporation Device 4387 (rev 11)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:1f.3 Audio device: Intel Corporation Device 43c8 (rev 11)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 9d18
    Kernel driver in use: snd_hda_intel
    Kernel modules: snd_hda_intel
00:1f.4 SMBus: Intel Corporation Device 43a3 (rev 11)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device 43a4 (rev 11)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a8 (rev 03)
    Subsystem: Intel Corporation Device 390d
    Kernel driver in use: nvme
02:00.0 Non-Volatile memory controller: Intel Corporation Device f1a8 (rev 03)
    Subsystem: Intel Corporation Device 390d
    Kernel driver in use: vfio-pci
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. Device 8125 (rev 04)
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
    Kernel driver in use: r8125
    Kernel modules: r8169, r8125
 
When the driver is vfio-pci, the device is not accessible to the Proxmox host.
Is there any VM that has a hostpci setting with 02:00.0? Can you show us a lspci -nn, so we can check the numeric ID that might be using in a vfio-pci.ids=...? Check your kernel command line (cat /proc/cmdline, determined by either /etc/default/grub or /etc/kernel/cmdline) and all files in /etc/modprobe.d/.
 
  • Like
Reactions: thelonghop
That's it! It looks like I selected that when I was trying to passthrough hwa for Jellyfin. I removed that pci from the VM, rebooted pve and now it's available. Ran zpool attach rpool and now it's attached and resilvered with no issues. Thanks for the help @avw !
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!