[SOLVED] ZFS unmounts after reboot - and proposed fix

Psilospiral

Member
Jun 25, 2019
38
10
13
52
Greetings Forum:

I am running PVE 5.4-11 on a r720xd with a single CT for testing: TKL File Server.

I noticed while copying files from another NAS to the TKL File Server CT from a Win client on the LAN that the shares would strangely become unavailable. After a little digging and research in the forums, I issued:
Code:
root@pve-r720xd1:~# zfs list -r -o name,mountpoint,mounted
NAME                       MOUNTPOINT                  MOUNTED
r720xd1                    /r720xd1                         no
r720xd1/subvol-108-disk-0  /r720xd1/subvol-108-disk-0       no

and realized the ZFS filesystem was no longer mounted. After issuing:
Code:
zfs mount -O -a

the zfs pools came back online, were available to TKL File Server, and the shares were again available to the Win client with no reboots needed. Excellent!

I thought this originally occurred because I ejected one of the hot swap drives and reinserted it for testing hot swap functionality. After successfully remounting the pool, I then decided to rebuild the pool with wwn- designations instead of /dev/sd[x]. For some time I thought this solved the issue, but after a reboot this weekend the exact same situation occurred again: ZFS became unmounted after a power blink. (I still haven't moved the server to my UPS).

SO, I am not sure why the ZFS filesystem becomes unmounted in PVE occasionally after reboot. But I do know that issuing
Code:
zfs mount -O -a

after a reboot takes care of the problem every time.

Where can I include
Code:
zfs mount -O -a

in PVE config so that I can make 100% sure my ZFS file system will be (re)mounted with each reboot??? (OR is there another place I can probe to discover why the ZFS file system is becoming unmounted in the first place?)
 
* Hmm - please post the journal during the reboot - or rather the portion mentioning ZFS (usually you'll also have a failed unit in that case (since the ZFS pools failed to mount)
* What's the status of the various ZFS import services:
** `systemctl status -l zfs-import-cache.service`
** `systemctl status -l zfs-import-scan.service`
** `systemctl status -l zfs-import-cache.service`
** `systemctl status -l zfs-import.service`
** `systemctl status -l zfs-mount.service`

else try to update the cache-file (usually `zpool set cachefile=/etc/zfs/zpool.cache <poolname>`) and update your initramfs afterwards (`update-initramfs -k all -u`

This normally takes care of these problems

Hope this helps!
 
Stoiko:

Thank you for the quick reply.

Journal during the reboot:
Code:
root@pve-r720xd1:~# dmesg|grep zfs
[   42.171357] audit: type=1400 audit(1566225268.314:12): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="/usr/bin/lxc-start" name="/r720xd1/subvol-108-disk-0/" pid=3328 comm="mount.zfs" fstype="zfs" srcname="r720xd1/subvol-108-disk-0" flags="rw, strictatime"

Status of the various ZFS import services:
** `systemctl status -l zfs-import-cache.service`
Code:
root@pve-r720xd1:~# systemctl status -l zfs-import-cache.service
● zfs-import-cache.service - Import ZFS pools by cache file
   Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2019-08-19 10:34:12 EDT; 4h 14min ago
     Docs: man:zpool(8)
  Process: 1473 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN (code=exited, status=1/FAILURE)
 Main PID: 1473 (code=exited, status=1/FAILURE)
      CPU: 5ms

Aug 19 10:34:12 pve-r720xd1 systemd[1]: Starting Import ZFS pools by cache file...
Aug 19 10:34:12 pve-r720xd1 zpool[1473]: invalid or corrupt cache file contents: invalid or missing cache file
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Aug 19 10:34:12 pve-r720xd1 systemd[1]: Failed to start Import ZFS pools by cache file.
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Unit entered failed state.
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.

** `systemctl status -l zfs-import-scan.service`
Code:
root@pve-r720xd1:~# systemctl status -l zfs-import-scan.service
● zfs-import-scan.service - Import ZFS pools by device scanning
   Loaded: loaded (/lib/systemd/system/zfs-import-scan.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:zpool(8)

** `systemctl status -l zfs-import-cache.service`
Code:
root@pve-r720xd1:~# systemctl status -l zfs-import-cache.service
● zfs-import-cache.service - Import ZFS pools by cache file
   Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2019-08-19 10:34:12 EDT; 4h 16min ago
     Docs: man:zpool(8)
  Process: 1473 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN (code=exited, status=1/FAILURE)
 Main PID: 1473 (code=exited, status=1/FAILURE)
      CPU: 5ms

Aug 19 10:34:12 pve-r720xd1 systemd[1]: Starting Import ZFS pools by cache file...
Aug 19 10:34:12 pve-r720xd1 zpool[1473]: invalid or corrupt cache file contents: invalid or missing cache file
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Aug 19 10:34:12 pve-r720xd1 systemd[1]: Failed to start Import ZFS pools by cache file.
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Unit entered failed state.
Aug 19 10:34:12 pve-r720xd1 systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.

** `systemctl status -l zfs-import.service`
Code:
root@pve-r720xd1:~# systemctl status -l zfs-import.service
Unit zfs-import.service could not be found.

** `systemctl status -l zfs-mount.service`
Code:
root@pve-r720xd1:~# systemctl status -l zfs-mount.service
● zfs-mount.service - Mount ZFS filesystems
   Loaded: loaded (/lib/systemd/system/zfs-mount.service; enabled; vendor preset: enabled)
   Active: active (exited) since Mon 2019-08-19 10:34:12 EDT; 4h 17min ago
     Docs: man:zfs(8)
  Process: 1493 ExecStart=/sbin/zfs mount -a (code=exited, status=0/SUCCESS)
 Main PID: 1493 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 7372)
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/zfs-mount.service

Aug 19 10:34:12 pve-r720xd1 systemd[1]: Starting Mount ZFS filesystems...
Aug 19 10:34:12 pve-r720xd1 systemd[1]: Started Mount ZFS filesystems.

else try to update the cache-file (usually `zpool set cachefile=/etc/zfs/zpool.cache <poolname>`)
Code:
root@pve-r720xd1:~# zpool get cachefile
NAME     PROPERTY   VALUE      SOURCE
r720xd1  cachefile  none       local

and update your initramfs afterwards (`update-initramfs -k all -u`
Code:
root@pve-r720xd1:~# update-initramfs -k all -u
update-initramfs: Generating /boot/initrd.img-4.15.18-18-pve
update-initramfs: Generating /boot/initrd.img-4.15.18-12-pve

I only created the zfs pool with:
Code:
zpool create r720xd1 -o ashift=12 raidz2 -f wwn-0x50000394b8ca446c wwn-0x5000039608cada4d wwn-0x5000cca01ad40654 wwn-0x5000cca01ade665c wwn-0x5000cca01ade8ea8 wwn-0x5000cca01adf83f0 wwn-0x5000cca01adfd29c wwn-0x5000cca01ae0270c wwn-0x5000cca01ae0506c wwn-0x5000cca01ae060cc spare wwn-0x5000039608c93991 wwn-0x5000039608ca8f99
and then immediately began configuring the CT with TKL File Server by adding the /srv/storage mountpoint. I am guessing I have to install a zfs import service on PVE after creating the ZFS pool??? What should be the contents of zpool.cache. When I cat mine, it is empty.

Thank you for all your help.
 
Stoiko:

Also:
Code:
root@pve-r720xd1:/etc/zfs# journalctl |grep zfs
Aug 19 15:04:55 pve-r720xd1 systemd-modules-load[717]: Inserted module 'zfs'
Aug 19 15:05:05 pve-r720xd1 systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Aug 19 15:05:05 pve-r720xd1 systemd[1]: zfs-import-cache.service: Unit entered failed state.
Aug 19 15:05:05 pve-r720xd1 systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Aug 19 15:05:21 pve-r720xd1 audit[3467]: AVC apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="/usr/bin/lxc-start" name="/r720xd1/subvol-108-disk-0/" pid=3467 comm="mount.zfs" fstype="zfs" srcname="r720xd1/subvol-108-disk-0" flags="rw, strictatime"
Aug 19 15:05:21 pve-r720xd1 kernel: audit: type=1400 audit(1566241521.990:12): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="/usr/bin/lxc-start" name="/r720xd1/subvol-108-disk-0/" pid=3467 comm="mount.zfs" fstype="zfs" srcname="r720xd1/subvol-108-disk-0" flags="rw, strictatime"
You may have guessed that my TKL File Server appliance is CT108.
 
Journal during the reboot:
`dmesg` is not the journal - it only contains kernel messages
Aug 19 10:34:12 pve-r720xd1 zpool[1473]: invalid or corrupt cache file contents: invalid or missing cache file
seems this might be the problem

root@pve-r720xd1:~# zpool get cachefile NAME PROPERTY VALUE SOURCE r720xd1 cachefile none local
You have no cache file set - set one with:
`zpool set cachefile=/etc/zfs/zpool.cache <poolname>`

and then regenerate the initramfs

This should fix the issue

Hope this helps!
 
Stoiko:

You are correct. My lack of proper cache file was the root of the issue causing problems with mounting of my ZFS pool at boot. I entered:
Code:
zpool set cachcefile=/etc/zfs/zpool.cache r720xd1
update-initramfs -k all -u
then rebooted. I noticed the root /r720xd1 was not mounted, but the subvol tied to my TKL File Server was - partial success!

I decided to check the systemctl status of each service you mentioned originally after setting the cache file to investigate more....

All were good except two:
Code:
systemctl status -l zfs-import-scan.service        inactive (dead)

root@pve-r720xd1:~# systemctl status -l zfs-import-scan.service
● zfs-import-scan.service - Import ZFS pools by device scanning
   Loaded: loaded (/lib/systemd/system/zfs-import-scan.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:zpool(8)

and
Code:
root@pve-r720xd1:~# systemctl status -l zfs-mount.service
● zfs-mount.service - Mount ZFS filesystems
   Loaded: loaded (/lib/systemd/system/zfs-mount.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2019-08-20 18:34:00 EDT; 3h 9min ago
     Docs: man:zfs(8)
  Process: 10320 ExecStart=/sbin/zfs mount -a (code=exited, status=1/FAILURE)
 Main PID: 10320 (code=exited, status=1/FAILURE)
      CPU: 23ms

Aug 20 18:34:00 pve-r720xd1 systemd[1]: Starting Mount ZFS filesystems...
Aug 20 18:34:00 pve-r720xd1 zfs[10320]: cannot mount '/r720xd1': directory is not empty
Aug 20 18:34:00 pve-r720xd1 systemd[1]: zfs-mount.service: Main process exited, code=exited, status=1/FAILURE
Aug 20 18:34:00 pve-r720xd1 systemd[1]: Failed to start Mount ZFS filesystems.
Aug 20 18:34:00 pve-r720xd1 systemd[1]: zfs-mount.service: Unit entered failed state.
Aug 20 18:34:00 pve-r720xd1 systemd[1]: zfs-mount.service: Failed with result 'exit-code'.

After some more forum probing, I learned the ZFS setting 'overlay' will allow for mounting of a pool that is not empty. And sure enough:
Code:
zfs get overlay r720xd1
NAME     PROPERTY  VALUE    SOURCE
r720xd1  overlay   off      default
So I:
Code:
zfs set overlay=on r720xd1
and rebooted... Then checking:
Code:
root@pve-r720xd1:~# systemctl status -l zfs-mount.service
● zfs-mount.service - Mount ZFS filesystems
   Loaded: loaded (/lib/systemd/system/zfs-mount.service; enabled; vendor preset: enabled)
   Active: active (exited) since Tue 2019-08-20 21:49:30 EDT; 3min 36s ago
     Docs: man:zfs(8)
  Process: 1998 ExecStart=/sbin/zfs mount -a (code=exited, status=0/SUCCESS)
 Main PID: 1998 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 7372)
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/zfs-mount.service

Aug 20 21:49:30 pve-r720xd1 systemd[1]: Starting Mount ZFS filesystems...
Aug 20 21:49:30 pve-r720xd1 systemd[1]: Started Mount ZFS filesystems.

SUCCESS!

At this point, I have only one service that is loaded, but dead:
Code:
root@pve-r720xd1:~# systemctl status -l zfs-import-scan.service
● zfs-import-scan.service - Import ZFS pools by device scanning
   Loaded: loaded (/lib/systemd/system/zfs-import-scan.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:zpool(8)

Is this anything I need to address?

THANK YOU for addressing the root of the problem instead of the patch I was attempting with 'mount -O -a' !!!
 
you don't know how much you have helped me i am stuck on this matter for 10 days now.
thank you very very much
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!