Hello,
I recently deployed Proxmox 8 on 2 Minisforum workstations.
Here the hardware config:
Minisforum Mini Workstation MS-01 Core i5-12600H
2 x Crucial P3 1To M.2 PCIe Gen3 NVMe SSD
2 x Crucial RAM 48Go DDR5 5600MHz
Software:
Proxmox 8
kernel: 6.8.12-2-pv
pve-manager : 8.2.7
Proxmox is installed on a zfs pool (RAID1) using the 2 Crucial NVMe SSD.
No cluster config, each node is independant.
After few weeks of running, I've received this alert below from one PVE server:
ZFS has detected that a device was removed.
impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue1
time: 2024-08-20 00:29:40+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF8557-part3
vguid: 0x9BE317680434AEC5
pool: rpool (0x18AE03D40E302B68)
I tried to reboot the PVE server but the SSD was still considered as REMOVED.
So I decided to replace it, I did it with success with a brand new one, supposing that it was an hardware failure.
Now I recently received again the alert, not only from one PVE server, but from my both PVE servers with 24h delay!
I cannot magine it is a SSD harware failure at the same time!
And I cannot belive that I have also an hardware issue on my both Minisforum workstations at the same time!
Alert from PVE server 1:
ZFS has detected that a device was removed.
impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue1
time: 2024-10-21 20:49:17+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF8557-part3
vguid: 0x9BE317680434AEC5
pool: rpool (0x18AE03D40E302B68)
Alert from PVE server 2:
ZFS has detected that a device was removed.
impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue2
time: 2024-10-22 19:11:16+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF75C6-part3
vguid: 0xCB81D508174CE412
pool: rpool (0xC88FA9B89DABF1F7)
So my conclusiong is that it could related to a Proxmox and/or ZFS issue???
Can you help me to find the root cause?
Some outputs:
root@rescue1:~# zpool status -v rpool
pool: rpool
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0B in 00:00:08 with 0 errors on Sun Oct 13 00:24:09 2024
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-CT1000P3SSD8_231645EF8557-part3 REMOVED 0 0 0
nvme-CT1000P3SSD8_242749BF81B8-part3 ONLINE 0 0 0
root@rescue2:~# zpool status -v rpool
pool: rpool
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0B in 00:00:07 with 0 errors on Sun Oct 13 00:24:08 2024
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-CT1000P3SSD8_231645EF75C6-part3 REMOVED 0 0 0
nvme-CT1000P3SSD8_231645EF80A6-part3 ONLINE 0 0 0
Thanks
I recently deployed Proxmox 8 on 2 Minisforum workstations.
Here the hardware config:
Minisforum Mini Workstation MS-01 Core i5-12600H
2 x Crucial P3 1To M.2 PCIe Gen3 NVMe SSD
2 x Crucial RAM 48Go DDR5 5600MHz
Software:
Proxmox 8
kernel: 6.8.12-2-pv
pve-manager : 8.2.7
Proxmox is installed on a zfs pool (RAID1) using the 2 Crucial NVMe SSD.
No cluster config, each node is independant.
After few weeks of running, I've received this alert below from one PVE server:
ZFS has detected that a device was removed.
impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue1
time: 2024-08-20 00:29:40+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF8557-part3
vguid: 0x9BE317680434AEC5
pool: rpool (0x18AE03D40E302B68)
I tried to reboot the PVE server but the SSD was still considered as REMOVED.
So I decided to replace it, I did it with success with a brand new one, supposing that it was an hardware failure.
Now I recently received again the alert, not only from one PVE server, but from my both PVE servers with 24h delay!
I cannot magine it is a SSD harware failure at the same time!
And I cannot belive that I have also an hardware issue on my both Minisforum workstations at the same time!
Alert from PVE server 1:
ZFS has detected that a device was removed.
impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue1
time: 2024-10-21 20:49:17+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF8557-part3
vguid: 0x9BE317680434AEC5
pool: rpool (0x18AE03D40E302B68)
Alert from PVE server 2:
ZFS has detected that a device was removed.
impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue2
time: 2024-10-22 19:11:16+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF75C6-part3
vguid: 0xCB81D508174CE412
pool: rpool (0xC88FA9B89DABF1F7)
So my conclusiong is that it could related to a Proxmox and/or ZFS issue???
Can you help me to find the root cause?
Some outputs:
root@rescue1:~# zpool status -v rpool
pool: rpool
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0B in 00:00:08 with 0 errors on Sun Oct 13 00:24:09 2024
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-CT1000P3SSD8_231645EF8557-part3 REMOVED 0 0 0
nvme-CT1000P3SSD8_242749BF81B8-part3 ONLINE 0 0 0
root@rescue2:~# zpool status -v rpool
pool: rpool
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0B in 00:00:07 with 0 errors on Sun Oct 13 00:24:08 2024
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-CT1000P3SSD8_231645EF75C6-part3 REMOVED 0 0 0
nvme-CT1000P3SSD8_231645EF80A6-part3 ONLINE 0 0 0
Thanks
Last edited: