ZFS issue after upgrade to PVE 8

magicfinger · Apr 28, 2024

After upgrading my 3 proxmox hosts from PVE 7 to PVE 8, I have issues with ZFS during boot that leads to approx 4-8 email notifications that the proxmox host sends out:

Code:

ZFS has detected that a device was removed.
 impact: Fault tolerance of the pool may be compromised.
    eid: 9
  class: statechange
  state: UNAVAIL
...

It happens on all 3 proxmox hosts exactly from the time after the upgrade to pve 8, so it cannot be hardware related but is probably some configuration issue.

The relevant system logs that lead to this email are:

Code:

29.631480+0200 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
29.631522+0200 systemd[1]: zfs-import-scan.service - Import ZFS pools by device scanning was skipped because of an unmet condition check (ConditionFileNotEmpty=!/etc/zfs/zpool.cache>
33.138671+0200 zpool[1434]: cannot import 'vmpool': one or more devices is currently unavailable
33.142203+0200 zpool[1434]: The devices below are missing or corrupted, use '-m' to import the pool anyway:
33.142203+0200 zpool[1434]:             mirror-1 [log]
33.142203+0200 zpool[1434]:               nvme0n1p1
33.142203+0200 zpool[1434]:               nvme1n1p1
33.142203+0200 zpool[1434]: cachefile import failed, retrying
33.143130+0200 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.
33.143214+0200 systemd[1]: Reached target zfs-import.target - ZFS pool import target.
33.160434+0200 systemd[1]: Starting zfs-mount.service - Mount ZFS filesystems...
33.161613+0200 systemd[1]: Starting zfs-volume-wait.service - Wait for ZFS Volume (zvol) links in /dev...
33.297713+0200 zvol_wait[2026]: Testing 19 zvol links
33.365155+0200 zvol_wait[2026]: All zvol links are now present.
33.365530+0200 systemd[1]: Finished zfs-volume-wait.service - Wait for ZFS Volume (zvol) links in /dev.
33.365608+0200 systemd[1]: Reached target zfs-volumes.target - ZFS volumes are ready.
33.415172+0200 systemd[1]: Finished zfs-mount.service - Mount ZFS filesystems.
33.415230+0200 systemd[1]: Reached target local-fs.target - Local File Systems.
...
34.887409+0200 zed[2140]: ZFS Event Daemon 2.2.3-pve2 (PID 2140)
34.888256+0200 zed[2140]: Processing events since eid=0
35.051202+0200 zed[2221]: eid=7 class=statechange pool='vmpool' vdev=nvme1n1p1 vdev_state=UNAVAIL
35.051203+0200 zed[2222]: eid=2 class=config_sync pool='rpool'
35.051203+0200 zed[2220]: eid=5 class=config_sync pool='rpool'
35.051221+0200 zed[2224]: eid=3 class=pool_import pool='rpool'
35.051222+0200 zed[2223]: eid=6 class=statechange pool='vmpool' vdev=nvme0n1p1 vdev_state=UNAVAIL
35.095549+0200 zed[2241]: eid=8 class=vdev.no_replicas pool='vmpool'
35.095915+0200 zed[2243]: eid=9 class=statechange pool='vmpool' vdev=nvme0n1p1 vdev_state=UNAVAIL
35.097410+0200 zed[2254]: eid=10 class=statechange pool='vmpool' vdev=nvme1n1p1 vdev_state=UNAVAIL
35.099247+0200 zed[2268]: eid=12 class=zpool pool='vmpool'
35.099441+0200 zed[2266]: eid=11 class=vdev.no_replicas pool='vmpool'
35.099910+0200 zed[2275]: eid=13 class=statechange pool='vmpool' vdev=nvme1n1p1 vdev_state=UNAVAIL

The issue seems to be only regarding the ZIL SLOG partitions of the vmpool. It is always about partitions "nvme0n1p1" and "nvme1n1p1" which are used in ZFS as "mirror-1 [log]".

After the boot, the vmpool is working normally and I see read and write operations even on the ZIL SLOG.
And "zpool status" shows the pool as Online without errors:

Code:

  pool: vmpool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:18 with 0 errors on Sun Apr 14 00:25:15 2024
config:

    NAME           STATE     READ WRITE CKSUM
    vmpool         ONLINE       0     0     0
      mirror-0     ONLINE       0     0     0
        sda5       ONLINE       0     0     0
        sdb5       ONLINE       0     0     0
    logs   
      mirror-1     ONLINE       0     0     0
        nvme0n1p1  ONLINE       0     0     0
        nvme1n1p1  ONLINE       0     0     0

errors: No known data errors

The number of these last log lines (with the "vdev_state=UNAVAIL") that result in the emails is different between reboots (approx 4-10).

Any ideas how to get rid of these errors?

magicfinger · Apr 29, 2024

I had one out of 13 reboots which did not result in these errors.
In the logs of that boot, there is no line from "zpool" like this:

Code:

zpool[1434]: cannot import 'vmpool': one or more devices is currently unavailable

Could it be that it is some kind of race-condition of what is started first?

magicfinger · May 2, 2024

Just in case someone else is running into similar boot error messages, I am posting my solution here:

The boot errors about the zfs pool "vmpool" disappeared after I upgraded the zfs root pool ("rpool") using "zfs upgrade rpool". Upgrading the "vmpool" was not enough to get rid of the errors, even though the errors are specifically regarding the "vmpool". Previously, I did not upgrade the root pool because of various warnings about this in the forum here. --> WARNING: Check your boot configuration before upgrading the root pool to check if it is safe!

Anyway, for me it solved the issue on all 3 proxmox hosts. I still don't understand how the not-upgraded root pool could result in these boot errors about different pool.

Search

Search

ZFS issue after upgrade to PVE 8

magicfinger

New Member

magicfinger

New Member

magicfinger

New Member