[SOLVED] ZFS unmounts after reboot - and proposed fix

Psilospiral

Member
Jun 25, 2019
38
11
13
53
Greetings Forum:

I am running PVE 5.4-11 on a r720xd with a single CT for testing: TKL File Server.

I noticed while copying files from another NAS to the TKL File Server CT from a Win client on the LAN that the shares would strangely become unavailable. After a little digging and research in the forums, I issued:
Code:
root@pve-r720xd1:~# zfs list -r -o name,mountpoint,mounted
NAME                       MOUNTPOINT                  MOUNTED
r720xd1                    /r720xd1                         no
r720xd1/subvol-108-disk-0  /r720xd1/subvol-108-disk-0       no

and realized the ZFS filesystem was no longer mounted. After issuing:
Code:
zfs mount -O -a

the zfs pools came back online, were available to TKL File Server, and the shares were again available to the Win client with no reboots needed. Excellent!

I thought this originally occurred because I ejected one of the hot swap drives and reinserted it for testing hot swap functionality. After successfully remounting the pool, I then decided to rebuild the pool with wwn- designations instead of /dev/sd[x]. For some time I thought this solved the issue, but after a reboot this weekend the exact same situation occurred again: ZFS became unmounted after a power blink. (I still haven't moved the server to my UPS).

SO, I am not sure why the ZFS filesystem becomes unmounted in PVE occasionally after reboot. But I do know that issuing
Code:
zfs mount -O -a

after a reboot takes care of the problem every time.

Where can I include
Code:
zfs mount -O -a

in PVE config so that I can make 100% sure my ZFS file system will be (re)mounted with each reboot??? (OR is there another place I can probe to discover why the ZFS file system is becoming unmounted in the first place?)
 
* Hmm - please post the journal during the reboot - or rather the portion mentioning ZFS (usually you'll also have a failed unit in that case (since the ZFS pools failed to mount)
* What's the status of the various ZFS import services:
** `systemctl status -l zfs-import-cache.service`
** `systemctl status -l zfs-import-scan.service`
** `systemctl status -l zfs-import-cache.service`
** `systemctl status -l zfs-import.service`
** `systemctl status -l zfs-mount.service`

else try to update the cache-file (usually `zpool set cachefile=/etc/zfs/zpool.cache <poolname>`) and update your initramfs afterwards (`update-initramfs -k all -u`

This normally takes care of these problems

Hope this helps!
 
Stoiko:

Thank you for the quick reply.

Journal during the reboot:
Code:
root@pve-r720xd1:~# dmesg|grep zfs
[   42.171357] audit: type=1400 audit(1566225268.314:12): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="/usr/bin/lxc-start" name="/r720xd1/subvol-108-disk-0/" pid=3328 comm="mount.zfs" fstype="zfs" srcname="r720xd1/subvol-108-disk-0" flags="rw, strictatime"

Status of the various ZFS import services:
** `systemctl status -l zfs-import-cache.service`
Code:
root@pve-r720xd1:~# systemctl status -l zfs-import-cache.service
● zfs-import-cache.service - Import ZFS pools by cache file
   Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2019-08-19 10:34:12 EDT; 4h 14min ago
     Docs: man:zpool(8)
  Process: 1473 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN (code=exited, status=1/FAILURE)
 Main PID: 1473 (code=exited, status=1/FAILURE)
      CPU: 5ms

Aug 19 10:34:12 pve-r720xd1 systemd[1]: Starting Import ZFS pools by cache file...
Aug 19 10:34:12 pve-r720xd1 zpool[1473]: invalid or corrupt cache file contents: invalid or missing cache file
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Aug 19 10:34:12 pve-r720xd1 systemd[1]: Failed to start Import ZFS pools by cache file.
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Unit entered failed state.
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.

** `systemctl status -l zfs-import-scan.service`
Code:
root@pve-r720xd1:~# systemctl status -l zfs-import-scan.service
● zfs-import-scan.service - Import ZFS pools by device scanning
   Loaded: loaded (/lib/systemd/system/zfs-import-scan.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:zpool(8)

** `systemctl status -l zfs-import-cache.service`
Code:
root@pve-r720xd1:~# systemctl status -l zfs-import-cache.service
● zfs-import-cache.service - Import ZFS pools by cache file
   Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2019-08-19 10:34:12 EDT; 4h 16min ago
     Docs: man:zpool(8)
  Process: 1473 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN (code=exited, status=1/FAILURE)
 Main PID: 1473 (code=exited, status=1/FAILURE)
      CPU: 5ms

Aug 19 10:34:12 pve-r720xd1 systemd[1]: Starting Import ZFS pools by cache file...
Aug 19 10:34:12 pve-r720xd1 zpool[1473]: invalid or corrupt cache file contents: invalid or missing cache file
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Aug 19 10:34:12 pve-r720xd1 systemd[1]: Failed to start Import ZFS pools by cache file.
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Unit entered failed state.
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.

** `systemctl status -l zfs-import.service`
Code:
root@pve-r720xd1:~# systemctl status -l zfs-import.service
Unit zfs-import.service could not be found.

** `systemctl status -l zfs-mount.service`
Code:
root@pve-r720xd1:~# systemctl status -l zfs-mount.service
● zfs-mount.service - Mount ZFS filesystems
   Loaded: loaded (/lib/systemd/system/zfs-mount.service; enabled; vendor preset: enabled)
   Active: active (exited) since Mon 2019-08-19 10:34:12 EDT; 4h 17min ago
     Docs: man:zfs(8)
  Process: 1493 ExecStart=/sbin/zfs mount -a (code=exited, status=0/SUCCESS)
 Main PID: 1493 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 7372)
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/zfs-mount.service

Aug 19 10:34:12 pve-r720xd1 systemd[1]: Starting Mount ZFS filesystems...
Aug 19 10:34:12 pve-r720xd1 systemd[1]: Started Mount ZFS filesystems.

else try to update the cache-file (usually `zpool set cachefile=/etc/zfs/zpool.cache <poolname>`)
Code:
root@pve-r720xd1:~# zpool get cachefile
NAME     PROPERTY   VALUE      SOURCE
r720xd1  cachefile  none       local

and update your initramfs afterwards (`update-initramfs -k all -u`
Code:
root@pve-r720xd1:~# update-initramfs -k all -u
update-initramfs: Generating /boot/initrd.img-4.15.18-18-pve
update-initramfs: Generating /boot/initrd.img-4.15.18-12-pve

I only created the zfs pool with:
Code:
zpool create r720xd1 -o ashift=12 raidz2 -f wwn-0x50000394b8ca446c wwn-0x5000039608cada4d wwn-0x5000cca01ad40654 wwn-0x5000cca01ade665c wwn-0x5000cca01ade8ea8 wwn-0x5000cca01adf83f0 wwn-0x5000cca01adfd29c wwn-0x5000cca01ae0270c wwn-0x5000cca01ae0506c wwn-0x5000cca01ae060cc spare wwn-0x5000039608c93991 wwn-0x5000039608ca8f99
and then immediately began configuring the CT with TKL File Server by adding the /srv/storage mountpoint. I am guessing I have to install a zfs import service on PVE after creating the ZFS pool??? What should be the contents of zpool.cache. When I cat mine, it is empty.

Thank you for all your help.
 
Stoiko:

Also:
Code:
root@pve-r720xd1:/etc/zfs# journalctl |grep zfs
Aug 19 15:04:55 pve-r720xd1 systemd-modules-load[717]: Inserted module 'zfs'
Aug 19 15:05:05 pve-r720xd1 systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Aug 19 15:05:05 pve-r720xd1 systemd[1]: zfs-import-cache.service: Unit entered failed state.
Aug 19 15:05:05 pve-r720xd1 systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Aug 19 15:05:21 pve-r720xd1 audit[3467]: AVC apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="/usr/bin/lxc-start" name="/r720xd1/subvol-108-disk-0/" pid=3467 comm="mount.zfs" fstype="zfs" srcname="r720xd1/subvol-108-disk-0" flags="rw, strictatime"
Aug 19 15:05:21 pve-r720xd1 kernel: audit: type=1400 audit(1566241521.990:12): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="/usr/bin/lxc-start" name="/r720xd1/subvol-108-disk-0/" pid=3467 comm="mount.zfs" fstype="zfs" srcname="r720xd1/subvol-108-disk-0" flags="rw, strictatime"
You may have guessed that my TKL File Server appliance is CT108.
 
Journal during the reboot:
`dmesg` is not the journal - it only contains kernel messages
Aug 19 10:34:12 pve-r720xd1 zpool[1473]: invalid or corrupt cache file contents: invalid or missing cache file
seems this might be the problem

root@pve-r720xd1:~# zpool get cachefile NAME PROPERTY VALUE SOURCE r720xd1 cachefile none local
You have no cache file set - set one with:
`zpool set cachefile=/etc/zfs/zpool.cache <poolname>`

and then regenerate the initramfs

This should fix the issue

Hope this helps!
 
Stoiko:

You are correct. My lack of proper cache file was the root of the issue causing problems with mounting of my ZFS pool at boot. I entered:
Code:
zpool set cachcefile=/etc/zfs/zpool.cache r720xd1
update-initramfs -k all -u
then rebooted. I noticed the root /r720xd1 was not mounted, but the subvol tied to my TKL File Server was - partial success!

I decided to check the systemctl status of each service you mentioned originally after setting the cache file to investigate more....

All were good except two:
Code:
systemctl status -l zfs-import-scan.service        inactive (dead)

root@pve-r720xd1:~# systemctl status -l zfs-import-scan.service
● zfs-import-scan.service - Import ZFS pools by device scanning
   Loaded: loaded (/lib/systemd/system/zfs-import-scan.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:zpool(8)

and
Code:
root@pve-r720xd1:~# systemctl status -l zfs-mount.service
● zfs-mount.service - Mount ZFS filesystems
   Loaded: loaded (/lib/systemd/system/zfs-mount.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2019-08-20 18:34:00 EDT; 3h 9min ago
     Docs: man:zfs(8)
  Process: 10320 ExecStart=/sbin/zfs mount -a (code=exited, status=1/FAILURE)
 Main PID: 10320 (code=exited, status=1/FAILURE)
      CPU: 23ms

Aug 20 18:34:00 pve-r720xd1 systemd[1]: Starting Mount ZFS filesystems...
Aug 20 18:34:00 pve-r720xd1 zfs[10320]: cannot mount '/r720xd1': directory is not empty
Aug 20 18:34:00 pve-r720xd1 systemd[1]: zfs-mount.service: Main process exited, code=exited, status=1/FAILURE
Aug 20 18:34:00 pve-r720xd1 systemd[1]: Failed to start Mount ZFS filesystems.
Aug 20 18:34:00 pve-r720xd1 systemd[1]: zfs-mount.service: Unit entered failed state.
Aug 20 18:34:00 pve-r720xd1 systemd[1]: zfs-mount.service: Failed with result 'exit-code'.

After some more forum probing, I learned the ZFS setting 'overlay' will allow for mounting of a pool that is not empty. And sure enough:
Code:
zfs get overlay r720xd1
NAME     PROPERTY  VALUE    SOURCE
r720xd1  overlay   off      default
So I:
Code:
zfs set overlay=on r720xd1
and rebooted... Then checking:
Code:
root@pve-r720xd1:~# systemctl status -l zfs-mount.service
● zfs-mount.service - Mount ZFS filesystems
   Loaded: loaded (/lib/systemd/system/zfs-mount.service; enabled; vendor preset: enabled)
   Active: active (exited) since Tue 2019-08-20 21:49:30 EDT; 3min 36s ago
     Docs: man:zfs(8)
  Process: 1998 ExecStart=/sbin/zfs mount -a (code=exited, status=0/SUCCESS)
 Main PID: 1998 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 7372)
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/zfs-mount.service

Aug 20 21:49:30 pve-r720xd1 systemd[1]: Starting Mount ZFS filesystems...
Aug 20 21:49:30 pve-r720xd1 systemd[1]: Started Mount ZFS filesystems.

SUCCESS!

At this point, I have only one service that is loaded, but dead:
Code:
root@pve-r720xd1:~# systemctl status -l zfs-import-scan.service
● zfs-import-scan.service - Import ZFS pools by device scanning
   Loaded: loaded (/lib/systemd/system/zfs-import-scan.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:zpool(8)

Is this anything I need to address?

THANK YOU for addressing the root of the problem instead of the patch I was attempting with 'mount -O -a' !!!
 
you don't know how much you have helped me i am stuck on this matter for 10 days now.
thank you very very much