zpool status rpool
on the Proxmox console? That gives a little more information (and you'll need the console to perform actions on this anyway). If you can, please use ssh and copy the text (instead of using the Shell button and using screenshots).zpool status rpool -g
? This might show how to select the second part of the mirror by number instead.# zpool status rpool -g
pool: rpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
scan: resilvered 11.1G in 00:01:19 with 0 errors on Fri Jul 30 23:36:15 2021
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
7450737692324333241 DEGRADED 0 0 0
1820375655777362786 ONLINE 0 0 0
12526451830035177050 UNAVAIL 6 173K 0
errors: No known data errors
zpool detach rpool /dev/disk/by-id/nvme-eui.0000000001000000e4d25cc6f32c5401-part3
.zpool attach rpool /dev/disk/by-id/nvme-eui.0000000001000000e4d25cc4f32c5401-part3 /dev/disk/by-id/YOUR-NEW-DRIVE
I ran that command and have it removed from the zpool. The second disk doesn't even show up in lsblk. I'm not sure where else to look for it to do a SMART test.Ah right, 1 digit is different in the names of the drives. You can remove the faulty drive usingzpool detach rpool /dev/disk/by-id/nvme-eui.0000000001000000e4d25cc6f32c5401-part3
.
If there are indeed 173000 write errors, it does sound like something is wrong with the drive or the connection to it. I can't tell from here what is wrong with it. Maybe remove it from the mirror and do a long SMART test to have the drive check itself? Or just erase it and put it back, to see if it was just a bad connection?
You can add a replacement by using the attach command but make sure to specify the drive that you want to mirror, otherwise you get a RAID0 (no safety) instead of RAID1:zpool attact rpool /dev/disk/by-id/nvme-eui.0000000001000000e4d25cc4f32c5401-part3 /dev/disk/by-id/YOUR-NEW-DRIVE
#systemctl status systemd-modules-load.service
● systemd-modules-load.service - Load Kernel Modules
Loaded: loaded (/lib/systemd/system/systemd-modules-load.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Sat 2021-07-31 15:29:47 CDT; 3min 32s ago
Docs: man:systemd-modules-load.service(8)
man:modules-load.d(5)
Process: 1400 ExecStart=/lib/systemd/systemd-modules-load (code=exited, status=1/FAILURE)
Main PID: 1400 (code=exited, status=1/FAILURE)
Jul 31 15:29:47 pve systemd-modules-load[1400]: Inserted module 'vfio'
Jul 31 15:29:47 pve systemd-modules-load[1400]: Inserted module 'vfio_pci'
Jul 31 15:29:47 pve systemd-modules-load[1400]: Failed to insert module 'kvmgt': No such device
Jul 31 15:29:47 pve systemd-modules-load[1400]: Failed to find module 'exngt'
Jul 31 15:29:47 pve systemd-modules-load[1400]: Inserted module 'iscsi_tcp'
Jul 31 15:29:47 pve systemd-modules-load[1400]: Inserted module 'ib_iser'
Jul 31 15:29:47 pve systemd-modules-load[1400]: Inserted module 'vhost_net'
Jul 31 15:29:47 pve systemd[1]: systemd-modules-load.service: Main process exited, code=exited, status=1/FAILURE
Jul 31 15:29:47 pve systemd[1]: systemd-modules-load.service: Failed with result 'exit-code'.
Jul 31 15:29:47 pve systemd[1]: Failed to start Load Kernel Modules.
/etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
# Modules required for PCI passthrough
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
# Modules required for Intel GVT
kvmgt
exngt
vfio-mdev
Were you experimenting with PCI passthrough at the time? This can cause the Proxmox host to lose connection to devices such as NVME (M.2 via PCIe).Alright. Did a self-test in the BIOS for each drive and both passed. Removed and reinstalled the drives, did the self-test with each individually installed and the other removed, all passed. So there doesn't seem to be anything physically wrong with the drives.
TheJul 31 15:29:47 pve systemd-modules-load[1400]: Failed to insert module 'kvmgt': No such device
Jul 31 15:29:47 pve systemd-modules-load[1400]: Failed to find module 'exngt'
kvmgt
and exngt
modules claim that you have no supported hardware (which could be because it is too new), and this would explain why systemd-modules-load.service
failed but no the NVME.No, wasn't experimenting with anything. Yeah, one NVME is still in the rpool and it doesn't show any errors.Were you experimenting with PCI passthrough at the time? This can cause the Proxmox host to lose connection to devices such as NVME (M.2 via PCIe).
Thekvmgt
andexngt
modules claim that you have no supported hardware (which could be because it is too new), and this would explain whysystemd-modules-load.service
failed but no the NVME.
Is the second NVME again (or still) part of the rpool mirror? Did you clear the old errors and did a resilver or scrub not show any new errors?
lspci -k
(in Proxmox or both)? Both NVME drives should appear as PCI devices using the nvme
driver.That might be it? If I'm reading this right, the kernel driver for one of them is vfio-pciI can only explain this if you are doing PCI(e) passthrough and the second NVME is in the same IOMMU group, or if you are hiding the PCIe lanes with pci-stub or vfio-pci. Otherwise, I'm at a loss. Can you show alspci -k
(in Proxmox or both)? Both NVME drives should appear as PCI devices using thenvme
driver.
# lspci -k
00:00.0 Host bridge: Intel Corporation Device 4c53 (rev 01)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:02.0 VGA compatible controller: Intel Corporation Device 4c8b (rev 04)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:06.0 PCI bridge: Intel Corporation Device 4c09 (rev 01)
Kernel driver in use: pcieport
00:08.0 System peripheral: Intel Corporation Device 4c11 (rev 01)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:14.0 USB controller: Intel Corporation Device 43ed (rev 11)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
00:14.2 RAM memory: Intel Corporation Device 43ef (rev 11)
00:14.3 Network controller: Intel Corporation Device 43f0 (rev 11)
Subsystem: Intel Corporation Device 0074
Kernel driver in use: iwlwifi
Kernel modules: iwlwifi
00:16.0 Communication controller: Intel Corporation Device 43e0 (rev 11)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:17.0 SATA controller: Intel Corporation Device 43d2 (rev 11)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
Kernel driver in use: ahci
Kernel modules: ahci
00:1b.0 PCI bridge: Intel Corporation Device 43c4 (rev 11)
Kernel driver in use: pcieport
00:1c.0 PCI bridge: Intel Corporation Device 43bc (rev 11)
Kernel driver in use: pcieport
00:1f.0 ISA bridge: Intel Corporation Device 4387 (rev 11)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:1f.3 Audio device: Intel Corporation Device 43c8 (rev 11)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 9d18
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
00:1f.4 SMBus: Intel Corporation Device 43a3 (rev 11)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device 43a4 (rev 11)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a8 (rev 03)
Subsystem: Intel Corporation Device 390d
Kernel driver in use: nvme
02:00.0 Non-Volatile memory controller: Intel Corporation Device f1a8 (rev 03)
Subsystem: Intel Corporation Device 390d
Kernel driver in use: vfio-pci
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. Device 8125 (rev 04)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7d18
Kernel driver in use: r8125
Kernel modules: r8169, r8125
vfio-pci
, the device is not accessible to the Proxmox host. hostpci
setting with 02:00.0? Can you show us a lspci -nn
, so we can check the numeric ID that might be using in a vfio-pci.ids=...
? Check your kernel command line (cat /proc/cmdline
, determined by either /etc/default/grub
or /etc/kernel/cmdline
) and all files in /etc/modprobe.d/
.