Replacing NVME physical disks in PROXMOX 8.2.2

wowagsm2

New Member
Mar 22, 2024
2
0
1
We have a Supermicro PIO-620U-TNR-FT019 server with direct connection to the NVME drive motherboard:

Code:
rpool mirror-0
nvme-INTEL_SSDPE2KX010T8
nvme-INTEL_SSDPE2KX010T8

rpool3.2 mirror-0
nvme-INTEL_SSDPE2KE032T8
nvme-INTEL_SSDPE2KE032T8

rpool3.2_2 mirror-0
nvme-INTEL_SSDPE2KE032T8
nvme-INTEL_SSDPE2KE032T8

rpool3.2_3 mirror-0
nvme-INTEL_SSDPE2KE032T8
nvme-INTEL_SSDPE2KE032T8

rpool6.4 mirror-0
nvme-INTEL_SSDPE2KE064T8
nvme-INTEL_SSDPE2KE064T8

We needed to replace the rpool3.2_2 array with a new one with new disks. To do this, all VMs from this pool were migrated to the web interfaces. rpool3.2_2 was deleted in Datacenter - Storage, and rpool3.2_2 was deleted in Datacenter - node02 - Disks - ZFS (destroy with two default flags). Next, we removed the physical disks from bins 2 and 3 (the disks are signed by serial numbers so that they cannot be confused with the PROXMOX interface), after that, a lot of errors were displayed on the physical server screen, and in the GUI Datacenter - node02 - Disks itself, all 6 NVME disks of 3.2Tb disappeared. At the same time, the status of the pools rpool3.2 and rpool3.2_3 was set to "suspended". When trying to view the contents in the console, it hung (ls /rpoo3.2/)

Syslog:

Code:
Apr 08 17:28:45 node02 kernel: pcieport 0000:64:02.0: pciehp: Slot(0-3): Link Down
Apr 08 17:28:46 node02 kernel: pcieport 0000:64:04.0: pciehp: Slot(0-5): Link Down
Apr 08 17:28:46 node02 kernel: zio pool=rpool3.2_3 vdev=/dev/disk/by-id/nvme-INTEL_SSDPE2KE032T8_PHLN234201893P2FGN_1-part1 error=5 type=1 offset=270336 size=8192 flags=721601
Apr 08 17:28:46 node02 kernel: zio pool=rpool3.2_3 vdev=/dev/disk/by-id/nvme-INTEL_SSDPE2KE032T8_PHLN234201893P2FGN_1-part1 error=5 type=1 offset=3200621486080 size=8192 flags=721601
Apr 08 17:28:46 node02 zed[64833]: eid=221 class=statechange pool='rpool3.2_3' vdev=nvme-INTEL_SSDPE2KE032T8_PHLN234201893P2FGN_1-part1 vdev_state=REMOVED
Apr 08 17:28:46 node02 zed[64838]: eid=222 class=removed pool='rpool3.2_3' vdev=nvme-INTEL_SSDPE2KE032T8_PHLN234201893P2FGN_1-part1 vdev_state=REMOVED
Apr 08 17:28:46 node02 zed[64843]: eid=223 class=config_sync pool='rpool3.2_3'
Apr 08 17:28:46 node02 kernel: pcieport 0000:30:02.0: pciehp: Slot(1): Link Down
Apr 08 17:28:46 node02 kernel: pcieport 0000:30:02.0: pciehp: Slot(1): Card not present
Apr 08 17:28:46 node02 kernel: zio pool=rpool3.2 vdev=/dev/disk/by-id/nvme-nvme.8086-50484c4e323433313030474c33503246474e-494e54454c205353445045324b453033325438-00000001-part1 error=5 type=1 offset=270336 size=8192 flags=721601
Apr 08 17:28:46 node02 zed[64927]: eid=224 class=statechange pool='rpool3.2' vdev=nvme-nvme.8086-50484c4e323433313030474c33503246474e-494e54454c205353445045324b453033325438-00000001-part1 vdev_state=REMOVED
Apr 08 17:28:46 node02 zed[64931]: eid=225 class=removed pool='rpool3.2' vdev=nvme-nvme.8086-50484c4e323433313030474c33503246474e-494e54454c205353445045324b453033325438-00000001-part1 vdev_state=REMOVED
Apr 08 17:28:46 node02 zed[64934]: vdev nvme-nvme.8086-50484c4e323433313030474c33503246474e-494e54454c205353445045324b453033325438-00000001-part1 set '/sys/bus/pci/slots/1/attention' LED to 1
Apr 08 17:28:46 node02 zed[64940]: eid=226 class=config_sync pool='rpool3.2'
Apr 08 17:28:47 node02 kernel: pcieport 0000:64:02.0: pciehp: Slot(0-3): Card present
Apr 08 17:28:47 node02 kernel: pcieport 0000:64:04.0: pciehp: Slot(0-5): Card present
Apr 08 17:28:47 node02 kernel: pci 0000:67:00.0: [8086:0a54] type 00 class 0x010802 PCIe Endpoint
Apr 08 17:28:47 node02 kernel: pci 0000:67:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
Apr 08 17:28:47 node02 kernel: pci 0000:67:00.0: ROM [mem 0x00000000-0x0000ffff pref]
Apr 08 17:28:47 node02 kernel: pci 0000:67:00.0: Max Payload Size set to 512 (was 128, max 512)
Apr 08 17:28:47 node02 kernel: pci 0000:67:00.0: enabling Extended Tags
Apr 08 17:28:47 node02 kernel: pci 0000:67:00.0: Adding to iommu group 4
Apr 08 17:28:47 node02 kernel: pcieport 0000:64:04.0: bridge window [io  0x1000-0x0fff] to [bus 67] add_size 1000
Apr 08 17:28:47 node02 kernel: pcieport 0000:64:04.0: bridge window [io  size 0x1000]: can't assign; no space
Apr 08 17:28:47 node02 kernel: pcieport 0000:64:04.0: bridge window [io  size 0x1000]: failed to assign
Apr 08 17:28:47 node02 kernel: pcieport 0000:64:04.0: bridge window [io  size 0x1000]: can't assign; no space

Apr 08 17:28:47 node02 kernel: pcieport 0000:64:04.0: bridge window [io  size 0x1000]: failed to assign
Apr 08 17:28:47 node02 kernel: pci 0000:67:00.0: ROM [mem 0xc5c00000-0xc5c0ffff pref]: assigned
Apr 08 17:28:47 node02 kernel: pci 0000:67:00.0: BAR 0 [mem 0xc5c10000-0xc5c13fff 64bit]: assigned
Apr 08 17:28:47 node02 kernel: pcieport 0000:64:04.0: PCI bridge to [bus 67]
Apr 08 17:28:47 node02 kernel: pcieport 0000:64:04.0:   bridge window [mem 0xc5c00000-0xc5cfffff]
Apr 08 17:28:47 node02 kernel: pcieport 0000:64:04.0:   bridge window [mem 0x204000400000-0x2040005fffff 64bit pref]
Apr 08 17:28:47 node02 kernel: nvme nvme4: pci function 0000:67:00.0
Apr 08 17:28:47 node02 kernel: nvme 0000:67:00.0: enabling device (0140 -> 0142)
Apr 08 17:28:47 node02 kernel: pcieport 0000:30:02.0: pciehp: Slot(1): Card present
Apr 08 17:28:47 node02 kernel: pcieport 0000:30:02.0: pciehp: Slot(1): Link Up
Apr 08 17:28:48 node02 kernel: pcieport 0000:64:02.0: broken device, retraining non-functional downstream link at 2.5GT/s
Apr 08 17:28:49 node02 kernel: pcieport 0000:64:02.0: retraining failed
Apr 08 17:28:49 node02 kernel: nvme nvme4: Shutdown timeout set to 15 seconds
Apr 08 17:28:49 node02 kernel: nvme nvme4: 128/0/0 default/read/poll queues
Apr 08 17:28:49 node02 kernel: nvme nvme4: Ignoring bogus Namespace Identifiers

Among other things, it was noticed that after removing the specified disks, the emergency indication of disk operation was turned on on completely different disks, probably by commands (set '/sys/bus/pci/slots/1/attention' LED to 1).

If the disks are removed in the off state and the server is turned on, then it boots normally and the remaining disks and pools are displayed correctly. We are concerned about this situation, due to the fact that failure of one disk may lead to failure of other pools.