zfs issue - a device was removed

Loïc LM

Member
Mar 28, 2023
6
1
8
Hello,

I recently deployed Proxmox 8 on 2 Minisforum workstations.
Here the hardware config:
Minisforum Mini Workstation MS-01 Core i5-12600H
2 x Crucial P3 1To M.2 PCIe Gen3 NVMe SSD
2 x Crucial RAM 48Go DDR5 5600MHz

Software:
Proxmox 8
kernel: 6.8.12-2-pv
pve-manager : 8.2.7

Proxmox is installed on a zfs pool (RAID1) using the 2 Crucial NVMe SSD.

No cluster config, each node is independant.

After few weeks of running, I've received this alert below from one PVE server:
ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue1
time: 2024-08-20 00:29:40+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF8557-part3
vguid: 0x9BE317680434AEC5
pool: rpool (0x18AE03D40E302B68)


I tried to reboot the PVE server but the SSD was still considered as REMOVED.
So I decided to replace it, I did it with success with a brand new one, supposing that it was an hardware failure.

Now I recently received again the alert, not only from one PVE server, but from my both PVE servers with 24h delay!
I cannot magine it is a SSD harware failure at the same time!
And I cannot belive that I have also an hardware issue on my both Minisforum workstations at the same time!

Alert from PVE server 1:
ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue1
time: 2024-10-21 20:49:17+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF8557-part3
vguid: 0x9BE317680434AEC5
pool: rpool (0x18AE03D40E302B68)


Alert from PVE server 2:
ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 18
class: statechange
state: REMOVED
host: rescue2
time: 2024-10-22 19:11:16+0200
vpath: /dev/disk/by-id/nvme-CT1000P3SSD8_231645EF75C6-part3
vguid: 0xCB81D508174CE412
pool: rpool (0xC88FA9B89DABF1F7)


So my conclusiong is that it could related to a Proxmox and/or ZFS issue???

Can you help me to find the root cause?

Some outputs:
root@rescue1:~# zpool status -v rpool
pool: rpool

state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0B in 00:00:08 with 0 errors on Sun Oct 13 00:24:09 2024
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-CT1000P3SSD8_231645EF8557-part3 REMOVED 0 0 0
nvme-CT1000P3SSD8_242749BF81B8-part3 ONLINE 0 0 0

root@rescue2:~# zpool status -v rpool
pool: rpool

state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0B in 00:00:07 with 0 errors on Sun Oct 13 00:24:08 2024
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-CT1000P3SSD8_231645EF75C6-part3 REMOVED 0 0 0
nvme-CT1000P3SSD8_231645EF80A6-part3 ONLINE 0 0 0


Thanks
 
Last edited:
You've 2 hardware with the same BUG. Maybe check for BIOS upgrade ? This is smelling a firmware bug.

You're using consumers SSDs.... Maybe forget ZFS, and only use lvm on this kind of configuration
 
  • Like
Reactions: Kingneutron
Have you checked SMART values?

By definition, if these NVMes do not go on disappearing by the same manner in non-ZFS setup, it has nothing to do with them.

BTW I also think ZFS is a poor choice for most use cases, especially this is P3 with 220T endurance only and ZFS as well as PVE suffer from bad amplification.
 
I'm also experiecing this problem on my Minisforum MS-01. Same processor as OP.
Kernel Version​

Linux 6.8.12-5-pve (2024-12-03T10:26Z)​
Boot Mode​

EFI​
Manager Version​

pve-manager/8.3.2/3e76eec21c4a14a7​
2 x Lexar 1TB NM790 nvme.
These are the only disks in this node. I tried populate them in different slots. No difference.
I only have VMs on this node.

Same exact zpool status as OP
root@pve-infra:~# zpool status -v rpool
pool: rpool
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: resilvered 2.37G in 00:00:04 with 0 errors on Thu Jan 16 19:28:37 2025
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme1n1p3 REMOVED 0 0 0
nvme0n1p3 ONLINE 0 0 0

Sympton and potential clue I found:
  1. If I reboot Proxmox host, the removed drive will come back. Resilver will be auto triggered and everything go back go green afterwards. SMART all good for both drives.
  2. I have a daily scheduled backup of my VMs to PBS running on another machine. Based on email alert notifications, I noticed the disk is ALWAYS dropped while finishing backup jobs. I tried triggering backup job manually but can't reproduce. Below is the system log from last night's scheduled backup. There seems to be something special about VM 120. I actually already moved that VM to another node few days ago. I kept it here as a hot spare. I'm going to remove it to see if remedies the problem.


Code:
Jan 16 23:50:02 pve-infra pvescheduler[94887]: INFO: starting new backup job: vzdump --mailto [email]@gmail.com --all 1 --quiet 1 --mode snapshot --fleecing 0 --storage PBS --notes-template '{{guestname}}' --mailnotification always
Jan 16 23:50:02 pve-infra pvescheduler[94887]: INFO: Starting Backup of VM 100 (qemu)
Jan 16 23:51:07 pve-infra pvescheduler[94887]: INFO: Finished Backup of VM 100 (00:01:05)
Jan 16 23:51:07 pve-infra pvescheduler[94887]: INFO: Starting Backup of VM 101 (qemu)
Jan 16 23:51:47 pve-infra pvescheduler[94887]: INFO: Finished Backup of VM 101 (00:00:40)
Jan 16 23:51:47 pve-infra pvescheduler[94887]: INFO: Starting Backup of VM 102 (qemu)
Jan 16 23:52:49 pve-infra pvescheduler[94887]: INFO: Finished Backup of VM 102 (00:01:02)
Jan 16 23:52:49 pve-infra pvescheduler[94887]: INFO: Starting Backup of VM 110 (qemu)
Jan 16 23:53:45 pve-infra pvescheduler[94887]: INFO: Finished Backup of VM 110 (00:00:56)
Jan 16 23:53:45 pve-infra pvescheduler[94887]: INFO: Starting Backup of VM 120 (qemu)
Jan 16 23:53:45 pve-infra kernel: VFIO - User Level meta-driver version: 0.3
Jan 16 23:53:45 pve-infra kernel: zio pool=rpool vdev=/dev/nvme1n1p3 error=5 type=1 offset=270336 size=8192 flags=721601
Jan 16 23:53:45 pve-infra kernel: zio pool=rpool vdev=/dev/nvme1n1p3 error=5 type=1 offset=1022200651776 size=8192 flags=721601
Jan 16 23:53:45 pve-infra zed[96738]: eid=13 class=statechange pool='rpool' vdev=nvme1n1p3 vdev_state=REMOVED
Jan 16 23:53:45 pve-infra zed[96743]: eid=14 class=removed pool='rpool' vdev=nvme1n1p3 vdev_state=REMOVED
Jan 16 23:53:45 pve-infra postfix/pickup[73003]: AA70B19405: uid=0 from=<root>
Jan 16 23:53:45 pve-infra zed[96782]: eid=15 class=config_sync pool='rpool'
Jan 16 23:53:45 pve-infra postfix/cleanup[96773]: AA70B19405: message-id=<20250117075345.AA70B19405@pve-infra.[email].com>
Jan 16 23:53:45 pve-infra postfix/qmgr[1482]: AA70B19405: from=<root@pve-infra.[email].com>, size=762, nrcpt=1 (queue active)
Jan 16 23:53:45 pve-infra postfix/pickup[73003]: AFD3F18167: uid=65534 from=<root>
Jan 16 23:53:45 pve-infra proxmox-mail-forward[96784]: notified via target `mail-to-root`
Jan 16 23:53:45 pve-infra postfix/cleanup[96773]: AFD3F18167: message-id=<20250117075345.AA70B19405@pve-infra.[email].com>
Jan 16 23:53:45 pve-infra postfix/local[96783]: AA70B19405: to=<root@pve-infra.[email].com>, orig_to=<root>, relay=local, delay=0.03, delays=0.01/0/0/0.01, dsn=2.0.0, status=sent (delivered to command: /usr/bin/proxmox-mail-forward)
Jan 16 23:53:45 pve-infra postfix/qmgr[1482]: AA70B19405: removed
Jan 16 23:53:45 pve-infra postfix/qmgr[1482]: AFD3F18167: from=<root@pve-infra.[email].com>, size=945, nrcpt=1 (queue active)
Jan 16 23:53:45 pve-infra systemd[1]: Started 120.scope.
Jan 16 23:53:46 pve-infra kernel: tap120i0: entered promiscuous mode
Jan 16 23:53:46 pve-infra kernel: vmbr2: port 6(fwpr120p0) entered blocking state
Jan 16 23:53:46 pve-infra kernel: vmbr2: port 6(fwpr120p0) entered disabled state
Jan 16 23:53:46 pve-infra kernel: fwpr120p0: entered allmulticast mode
Jan 16 23:53:46 pve-infra kernel: fwpr120p0: entered promiscuous mode
Jan 16 23:53:46 pve-infra kernel: vmbr2: port 6(fwpr120p0) entered blocking state
Jan 16 23:53:46 pve-infra kernel: vmbr2: port 6(fwpr120p0) entered forwarding state
Jan 16 23:53:46 pve-infra kernel: fwbr120i0: port 1(fwln120i0) entered blocking state
Jan 16 23:53:46 pve-infra kernel: fwbr120i0: port 1(fwln120i0) entered disabled state
Jan 16 23:53:46 pve-infra kernel: fwln120i0: entered allmulticast mode
Jan 16 23:53:46 pve-infra kernel: fwln120i0: entered promiscuous mode
Jan 16 23:53:46 pve-infra kernel: fwbr120i0: port 1(fwln120i0) entered blocking state
Jan 16 23:53:46 pve-infra kernel: fwbr120i0: port 1(fwln120i0) entered forwarding state
Jan 16 23:53:46 pve-infra kernel: fwbr120i0: port 2(tap120i0) entered blocking state
Jan 16 23:53:46 pve-infra kernel: fwbr120i0: port 2(tap120i0) entered disabled state
Jan 16 23:53:46 pve-infra kernel: tap120i0: entered allmulticast mode
Jan 16 23:53:46 pve-infra kernel: fwbr120i0: port 2(tap120i0) entered blocking state
Jan 16 23:53:46 pve-infra kernel: fwbr120i0: port 2(tap120i0) entered forwarding state
Jan 16 23:53:47 pve-infra pvescheduler[94887]: VM 120 started with PID 96805.
Jan 16 23:54:15 pve-infra postfix/smtp[96787]: connect to smtp.gmail.com[2607:f8b0:400e:c0a::6d]:587: Connection timed out
Jan 16 23:54:16 pve-infra postfix/smtp[96787]: AFD3F18167: replace: header From: root <root@pve-infra.[email].com>: From: pve-infra <pve1-alert@something.com>
Jan 16 23:54:17 pve-infra postfix/smtp[96787]: AFD3F18167: to=<[email]@gmail.com>, relay=smtp.gmail.com[172.253.117.109]:587, delay=32, delays=0.01/0.01/31/1.3, dsn=2.0.0, status=sent (250 2.0.0 OK  1737100457 d9443c01a7336-21c2d3deb37sm10431995ad.187 - gsmtp)
Jan 16 23:54:17 pve-infra postfix/qmgr[1482]: AFD3F18167: removed
Jan 16 23:54:29 pve-infra kernel:  zd16: p1 p2
Jan 16 23:54:29 pve-infra kernel: tap120i0: left allmulticast mode
Jan 16 23:54:29 pve-infra kernel: fwbr120i0: port 2(tap120i0) entered disabled state
Jan 16 23:54:29 pve-infra kernel: fwbr120i0: port 1(fwln120i0) entered disabled state
Jan 16 23:54:29 pve-infra kernel: vmbr2: port 6(fwpr120p0) entered disabled state
Jan 16 23:54:29 pve-infra kernel: fwln120i0 (unregistering): left allmulticast mode
Jan 16 23:54:29 pve-infra kernel: fwln120i0 (unregistering): left promiscuous mode
Jan 16 23:54:29 pve-infra kernel: fwbr120i0: port 1(fwln120i0) entered disabled state
Jan 16 23:54:29 pve-infra kernel: fwpr120p0 (unregistering): left allmulticast mode
Jan 16 23:54:29 pve-infra kernel: fwpr120p0 (unregistering): left promiscuous mode
Jan 16 23:54:29 pve-infra kernel: vmbr2: port 6(fwpr120p0) entered disabled state
Jan 16 23:54:29 pve-infra qmeventd[1063]: read: Connection reset by peer
Jan 16 23:54:29 pve-infra pvescheduler[94887]: INFO: Finished Backup of VM 120 (00:00:44)
Jan 16 23:54:29 pve-infra pvescheduler[94887]: INFO: Backup job finished successfully
 
Check system temps and syslog for SMART temperature values changing, nvme might be overheating.

I have 2x NM790s in 2 different boxes (Qotom firewall appliance and Beelink) and 0 issues for the past year - but I have both of them heatsinked.
 
Temp should be fine. They are directly under a fan.
Last night went by without disk drop. Removing VM 120 in my case seems have done the trick. Will observe more.
Here is the log from last night. Note the big difference in terms of lacking all those vmbr2 related activities. VM120 was off at the time of those backup.

Code:
an 17 23:50:03 pve-infra pvescheduler[298449]: INFO: starting new backup job: vzdump --storage PBS --all 1 --notes-template '{{guestname}}' --mode snapshot --quiet 1 --mailto [email]@gmail.com --fleecing 0 --mailnotification always
Jan 17 23:50:03 pve-infra pvescheduler[298449]: INFO: Starting Backup of VM 100 (qemu)
Jan 17 23:51:22 pve-infra pvescheduler[298449]: INFO: Finished Backup of VM 100 (00:01:19)
Jan 17 23:51:22 pve-infra pvescheduler[298449]: INFO: Starting Backup of VM 101 (qemu)
Jan 17 23:52:10 pve-infra pvescheduler[298449]: INFO: Finished Backup of VM 101 (00:00:48)
Jan 17 23:52:10 pve-infra pvescheduler[298449]: INFO: Starting Backup of VM 102 (qemu)
Jan 17 23:53:15 pve-infra pvescheduler[298449]: INFO: Finished Backup of VM 102 (00:01:05)
Jan 17 23:53:15 pve-infra pvescheduler[298449]: INFO: Starting Backup of VM 110 (qemu)
Jan 17 23:54:28 pve-infra pvescheduler[298449]: INFO: Finished Backup of VM 110 (00:01:13)
Jan 17 23:54:28 pve-infra pvescheduler[298449]: INFO: Backup job finished successfully
 
Did you have any news?

got the same on 2 of 3 Minisforum

Code:
Zeile  619: Mar 26 09:45:49 pve252 kernel: nvme nvme1: I/O tag 19 (0013) opcode 0x2 (I/O Cmd) QID 7 timeout, aborting req_op:READ(0) size:8192
    Zeile  620: Mar 26 09:45:49 pve252 kernel: nvme nvme1: I/O tag 634 (727a) opcode 0x2 (I/O Cmd) QID 15 timeout, aborting req_op:READ(0) size:131072
    Zeile  621: Mar 26 09:45:52 pve252 kernel: nvme nvme1: I/O tag 774 (f306) opcode 0x1 (I/O Cmd) QID 5 timeout, aborting req_op:WRITE(1) size:126976
    Zeile  622: Mar 26 09:45:52 pve252 kernel: nvme nvme1: I/O tag 775 (9307) opcode 0x1 (I/O Cmd) QID 5 timeout, aborting req_op:WRITE(1) size:126976
    Zeile  623: Mar 26 09:45:52 pve252 kernel: nvme nvme1: I/O tag 20 (1014) opcode 0x1 (I/O Cmd) QID 7 timeout, aborting req_op:WRITE(1) size:126976
    Zeile  624: Mar 26 09:45:52 pve252 kernel: nvme nvme1: I/O tag 123 (f07b) opcode 0x1 (I/O Cmd) QID 12 timeout, aborting req_op:WRITE(1) size:126976
    Zeile  625: Mar 26 09:45:52 pve252 kernel: nvme nvme1: I/O tag 41 (c029) opcode 0x1 (I/O Cmd) QID 13 timeout, aborting req_op:WRITE(1) size:118784
    Zeile  626: Mar 26 09:45:52 pve252 kernel: nvme nvme1: I/O tag 42 (a02a) opcode 0x1 (I/O Cmd) QID 13 timeout, aborting req_op:WRITE(1) size:126976
    Zeile  627: Mar 26 09:46:19 pve252 kernel: nvme nvme1: I/O tag 19 (0013) opcode 0x2 (I/O Cmd) QID 7 timeout, reset controller
    Zeile  913: Mar 26 09:47:43 pve252 kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
    Zeile  914: Mar 26 09:47:43 pve252 kernel: nvme nvme1: Abort status: 0x371
    Zeile  915: Mar 26 09:47:43 pve252 kernel: nvme nvme1: Abort status: 0x371
    Zeile  916: Mar 26 09:47:43 pve252 kernel: nvme nvme1: Abort status: 0x371
    Zeile  917: Mar 26 09:47:43 pve252 kernel: nvme nvme1: Abort status: 0x371
    Zeile  918: Mar 26 09:47:43 pve252 kernel: nvme nvme1: Abort status: 0x371
    Zeile  919: Mar 26 09:47:43 pve252 kernel: nvme nvme1: Abort status: 0x371
    Zeile  920: Mar 26 09:47:43 pve252 kernel: nvme nvme1: Abort status: 0x371
    Zeile  921: Mar 26 09:47:43 pve252 kernel: nvme nvme1: Abort status: 0x371
    Zeile  922: Mar 26 09:48:03 pve252 kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
    Zeile  923: Mar 26 09:48:03 pve252 kernel: nvme nvme1: Disabling device after reset failure: -19
    Zeile  924: Mar 26 09:48:03 pve252 kernel: I/O error, dev nvme1n1, sector 1551801392 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
    Zeile  925: Mar 26 09:48:03 pve252 kernel: I/O error, dev nvme1n1, sector 1355429960 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
    Zeile  926: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=1 offset=793447522304 size=8192 flags=1572992
    Zeile  927: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=1 offset=692905349120 size=131072 flags=1573248
    Zeile  928: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=352195444736 size=126976 flags=1572992
    Zeile  929: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=825747046400 size=4096 flags=1572992
    Zeile  930: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=352730890240 size=118784 flags=1572992
    Zeile  931: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=352195969024 size=126976 flags=1572992
    Zeile  932: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=352195575808 size=126976 flags=1572992
    Zeile  933: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=352196624384 size=126976 flags=1572992
    Zeile  934: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=352195706880 size=126976 flags=1572992
    Zeile  935: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=352195837952 size=126976 flags=1572992
    Zeile  936: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=825747054592 size=4096 flags=1572992
    Zeile  937: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=352196100096 size=126976 flags=1572992
    Zeile  938: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=352196231168 size=126976 flags=1572992
    Zeile  939: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=2 offset=352196755456 size=126976 flags=1572992
    Zeile  940: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=5 offset=0 size=0 flags=1049728
    Zeile  941: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=5 offset=0 size=0 flags=1049728
    Zeile  942: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=5 offset=0 size=0 flags=1049728
    Zeile  943: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=5 offset=0 size=0 flags=1049728
    Zeile  944: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=5 offset=0 size=0 flags=1049728
    Zeile  945: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=5 offset=0 size=0 flags=1049728
    Zeile  946: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=5 offset=0 size=0 flags=1049728
    Zeile  947: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=5 offset=0 size=0 flags=1049728
    Zeile  948: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=5 offset=0 size=0 flags=1049728
    Zeile  949: Mar 26 09:48:03 pve252 kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384c31b09ce0-part3 error=5 type=5 offset=0 size=0 flags=1049728
    Zeile  950: Mar 26 09:48:03 pve252 zed[3129411]: eid=110 class=statechange pool='rpool' vdev=nvme-eui.0025384c31b09ce0-part3 vdev_state=REMOVED
    Zeile  951: Mar 26 09:48:03 pve252 zed[3129418]: eid=111 class=removed pool='rpool' vdev=nvme-eui.0025384c31b09ce0-part3 vdev_state=REMOVED
    Zeile 1003: Mar 26 10:04:47 pve252 smartd[917]: Device: /dev/nvme1, removed NVMe device: Resource temporarily unavailable

a view minutes later from the other ssd the smart data looks like this:

Code:
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        48 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    19,150,739 [9.80 TB]
Data Units Written:                 13,151,274 [6.73 TB]
Host Read Commands:                 190,213,181
Host Write Commands:                134,668,376
Controller Busy Time:               513
Power Cycles:                       7
Power On Hours:                     90
Unsafe Shutdowns:                   2
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               48 Celsius
Temperature Sensor 2:               50 Celsius

so i think is not a heat problem?

last time a shutdown and cold boot brings back the ssd
 
Hello,
I've just updated the BIOS of my 2 Minisforum Mini Workstation MS-01 to the version 1.26 and the SSD that was indicate as REMOVED is now online again!
There has been an automatic resilvering operation and now the zfs pool health is ok again :)
Now I need to wait few days/weeks to see if everything is stable but I suspect that the BIOS update was needed.