ZFS pool disk unavailable on every reboot

Blenda · Jul 7, 2024

Hello,

I recently set up a PVE node with a Z90M-ITX motherboard, the motherboard has 2 m.2 slots which I have populated with two WD_BLACK SN850X. This is my "vmpool" meant to be used as fast VM storage.

Code:

pool: vmpool
state: ONLINE
scan: resilvered 1.08G in 00:00:06 with 0 errors on Sat Jul  6 13:32:02 2024
config:
        NAME                                           STATE     READ WRITE CKSUM
        vmpool                                         ONLINE       0     0     0
          mirror-0                                     ONLINE       0     0     0
            nvme-eui.e8238fa6bf530001001b448b47968231  ONLINE       0     0     0
            nvme-eui.e8238fa6bf530001001b448b479682ed  ONLINE       0     0     0

errors: No known data errors

In addition I have a mirror pool for the OS with two Samsung SSD 870 which works just fine, in addition to a RAIDZ 4x18TB pool using a LSI 9300-8i in IT mode, also working flawlessly.

My problem is that every damn reboot PVE sends me an e-mail notifying me about the vmpool beeing in a dregraded state.

Code:

ZFS has detected that a device was removed.

 impact: Fault tolerance of the pool may be compromised.
    eid: 6
  class: statechange
  state: UNAVAIL
   host: pve
   time: 2024-07-04 13:37:21+0200
  vpath: /dev/nvme1n1p1
  vphys: pci-0000:03:00.0-nvme-1
  vguid: 0xE6A25AD4A10495B4
  devid: nvme-WD_BLACK_SN850X_1000GB_24040E803557-part1
   pool: vmpool (0x6C84CDE73BEB0949)

And it changes between /dev/nvme1n1p1 and /dev/nvme0n1p1.

In addtion pve reports errors for the first VM that has auto-boot on start enabled:

zfs error: cannot open 'vmpool': no such pool

So it looks like the machine thinks one of the nvme m.2 disks are unavailable, and the zpool is unavailable meaning pve can't start the VMs that should be starting. It's worth noting that the second VM to start always starts. So it resolves itself after X seconds (though the first VM that should have started, never will start).

When I ssh into the machine (instantly) after reboot zpool status vmpool always shows a good state.

Does anyone have any ideas on how I could debug this further?

mihanson · Jan 21, 2025

Blenda said:
My problem is that every damn reboot PVE sends me an e-mail notifying me about the vmpool beeing in a dregraded state.

Code:

ZFS has detected that a device was removed. impact: Fault tolerance of the pool may be compromised. eid: 6 class: statechange state: UNAVAIL host: pve time: 2024-07-04 13:37:21+0200 vpath: /dev/nvme1n1p1 vphys: pci-0000:03:00.0-nvme-1 vguid: 0xE6A25AD4A10495B4 devid: nvme-WD_BLACK_SN850X_1000GB_24040E803557-part1 pool: vmpool (0x6C84CDE73BEB0949)

And it changes between /dev/nvme1n1p1 and /dev/nvme0n1p1.

In addtion pve reports errors for the first VM that has auto-boot on start enabled:

So it looks like the machine thinks one of the nvme m.2 disks are unavailable, and the zpool is unavailable meaning pve can't start the VMs that should be starting. It's worth noting that the second VM to start always starts. So it resolves itself after X seconds (though the first VM that should have started, never will start).

When I ssh into the machine (instantly) after reboot zpool status vmpool always shows a good state.

Does anyone have any ideas on how I could debug this further?

Unfortunately, I don't have an answer for you on this, but I wanted to put it out there that I have this exact same issue with my Proxmox Backup Server. My ZFS pool is 12 x 1TB SSD (mirrored) and I get this same email/errors on each boot.

Search

Search

ZFS pool disk unavailable on every reboot

Blenda

New Member

mihanson

Well-Known Member

We value your privacy