zfs issue - a device was removed

Loïc LM

New Member
Mar 28, 2023
3
1
3
Hello,

I recently deployed Proxmox 8 on 2 Minisforum workstations.
Here the hardware config:
Minisforum Mini Workstation MS-01 Core i5-12600H
2 x Crucial P3 1To M.2 PCIe Gen3 NVMe SSD
2 x Crucial RAM 48Go DDR5 5600MHz

Software:
Proxmox 8
kernel: 6.8.12-2-pv
pve-manager : 8.2.7

Proxmox is installed on a zfs pool (RAID1) using the 2 Crucial NVMe SSD.

No cluster config, each node is independant.

After few weeks of running, I've received this alert below from one PVE server:
ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue1
time: 2024-08-20 00:29:40+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF8557-part3
vguid: 0x9BE317680434AEC5
pool: rpool (0x18AE03D40E302B68)


I tried to reboot the PVE server but the SSD was still considered as REMOVED.
So I decided to replace it, I did it with success with a brand new one, supposing that it was an hardware failure.

Now I recently received again the alert, not only from one PVE server, but from my both PVE servers with 24h delay!
I cannot magine it is a SSD harware failure at the same time!
And I cannot belive that I have also an hardware issue on my both Minisforum workstations at the same time!

Alert from PVE server 1:
ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue1
time: 2024-10-21 20:49:17+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF8557-part3
vguid: 0x9BE317680434AEC5
pool: rpool (0x18AE03D40E302B68)


Alert from PVE server 2:
ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue2
time: 2024-10-22 19:11:16+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF75C6-part3
vguid: 0xCB81D508174CE412
pool: rpool (0xC88FA9B89DABF1F7)


So my conclusiong is that it could related to a Proxmox and/or ZFS issue???

Can you help me to find the root cause?

Some outputs:
root@rescue1:~# zpool status -v rpool
pool: rpool

state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0B in 00:00:08 with 0 errors on Sun Oct 13 00:24:09 2024
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-CT1000P3SSD8_231645EF8557-part3 REMOVED 0 0 0
nvme-CT1000P3SSD8_242749BF81B8-part3 ONLINE 0 0 0

root@rescue2:~# zpool status -v rpool
pool: rpool

state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0B in 00:00:07 with 0 errors on Sun Oct 13 00:24:08 2024
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-CT1000P3SSD8_231645EF75C6-part3 REMOVED 0 0 0
nvme-CT1000P3SSD8_231645EF80A6-part3 ONLINE 0 0 0


Thanks
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!