[SOLVED] zfs: cannot import rpool after reboot

M4dMike · Oct 22, 2017

Hey guys!

Did an update last night:

Code:

apt-get update
apt-get dist-upgrade

after that, ZFS wanted a

Code:

zpool upgrade -a

which required a reboot.

But after the reboot, i was stuck at busybox:

i tried:

Code:

zpool import -c /etc/zfs/zpool.cache -N rpool
exit

and it worked.

Than i performed the

Code:

zpool upgrade -a

and it succeeded.

But the reboot problem still exists, i have to manually import the pool each time.

Additional Info:
PVE-Version: pve-manager/5.0-34/b325d69e (running kernel: 4.13.4-1-pve)

ZFS-Version:
SPL: Loaded module v0.7.2-1
ZFS: Loaded module v0.7.2-1, ZFS pool version 5000, ZFS filesystem version 5

rpool:

Code:

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0h5m with 0 errors on Sun Oct  8 00:29:12 2017
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sds2    ONLINE       0     0     0
            sdt2    ONLINE       0     0     0

errors: No known data errors

Would be awesome if someone could give me a clue.
Thanks guys!

Nemesiz · Oct 22, 2017

have you tried
#update-grub /

M4dMike · Oct 22, 2017

yes, i tried it multiple times. no success.

Cup 'O Joe · Oct 22, 2017

Same here, I have two servers that have the same issue. On one of them when I went to manually import the rpool it notified me that I needed to upgrade the pool - being an idiot, I did (zpool upgrade rpool). So after importing the pool manually and continuing the boot process, the boot hangs (see screenshot).

After looking through some other threads, I tried booting off an old kernel - which worked actually except that the old kernel of course doesn't support the newer pool version... =_=

So... I went to the other server that I had not done anything to yet (but which has the same issue) booted off the previous kernel until the system reached the BusyBox, manually imported the rpool, resumed the boot process, and then logged in. I ran the following:

apt-get update
apt-get upgrade
apt-get dist-upgrade

None of which returned anything new, as a dist-upgrade is what caused the break to begin with. Following that, I also ran:

initramfs-update -u

Which operated without errors against the new kernel (4.13.4-1-pve). I rebooted with the new kernel and the problem was still there.

For the record, I've also tried specifying:

rootdelay=30

In grub just to see if that would remedy the issue, however that came back negative as well. Help is greatly appreciated, and Mike at least now you know you aren't the only one having this issue.

Cup 'O Joe · Oct 22, 2017

Mike, have you tried running initramfs-update -u?

While it didn't work for me, considering you are able to completely boot after manually importing the pool this may be worth a try for you, as well as the rootdelay option in grub (hit "e" when grub shows, manually append "rootdelay=30" to the boot entry, F10 to boot).

Cup 'O Joe · Oct 22, 2017

I've given up for the moment, I have 24 hours before I'll need to use these servers.

My solution has been to have the system always boot with the old kernel for the time being - at least until the dev's have had time to realize the issues this is causing some people and cook up the appropriate patches for it...

To start, backup the file we'll be editing:
sudo cp /etc/default/grub /etc/default/grub.bak

Now use nano, vim, or your preferred text editor:
sudo nano /etc/default/grub

Inside that file, at the end write:
GRUB_DEFAULT="Advanced options for Proxmox Virtual Environment GNU/Linux>Proxmox Virtual Environment GNU/Linux, with Linux 4.10.17-2-pve"
*Make sure the version shown above matches the kernel you want to revert to by running "pveversion -v"*

Save and close the file, then run:
update-grub

For me I can now boot with the old kernel without a hanging boot process or the rpool import hang. This of course will only work on the server that I did not run the pool upgrade on - otherwise by default those upgraded pools with the old kernel will result in the pools mounting readonly. I am going to look for a temporary workaround for that - while downgrading said pools doesn't seem feasible, I may be able to disable the new features enabled by the upgrade to get RW access to the pools. If I do come up with results on that front, I'll report them here.

# EDIT #
Did a bit of reading up and checking around, ZFS features cannot be disabled after they have been enabled ("zpool upgrade <pool>" enables features), the specific feature keeping the pool from being mountable in my case is org.zfsonlinux:userobj_accounting. My only recourse is to recover from backups at this point for the one server that I did the pool upgrade on.

fabian · Oct 23, 2017

ZFS features should not affect bootability unless you then also switch your root dataset to use them (e.g., for stuff like checksum algorithms which Grub might not yet support).

if the rpool is not automatically imported in the initramfs, but a manual import works, it is most likely a timing issue (disks are not yet ready when it attempts to import). if setting the rootdelay kernel option does not work, adding a sleep to the initrd phase before the zfs script attempts to import the pool might:

edit '/etc/default/zfs', and change the 'ZFS_INITRD_PRE_MOUNTROOT_SLEEP' and/or 'ZFS_INITRD_POST_MODPROBE_SLEEP settings
run 'update-initramfs -u' to update the initramfs (add '-k KERNELVERSION' if you are not currently running the kernel you actualyl want to boot)

fabian · Oct 23, 2017

Cup 'O Joe said:
Same here, I have two servers that have the same issue. On one of them when I went to manually import the rpool it notified me that I needed to upgrade the pool - being an idiot, I did (zpool upgrade rpool). So after importing the pool manually and continuing the boot process, the boot hangs (see screenshot).

View attachment 6059

the (potentially) interesting part of that trace is cut off..

guletz · Oct 23, 2017

fabian said:
edit '/etc/default/zfs', and change the 'ZFS_INITRD_PRE_MOUNTROOT_SLEEP' and/or 'ZFS_INITRD_POST_MODPROBE_SLEEP settings

Hello,

In my case, a 5 seconds delay for both(PRE... and POST...) solve the problem at boot. Thx. a lot @fabian !

Cup 'O Joe · Oct 23, 2017

Thanks for your reply Fabian, I'm sorry that I've been unclear. :E

On either of the servers, under the new kernel I cannot boot successfully even if I manually import the pool. It always hangs at the point shown in the screenshot - I've left it running for over an hour and nothing changed.

The problem I mentioned with upgrading the rpool is that now it's no longer possible to mount it as read/write under an old kernel, as the newer pool version isn't supported under the older kernel. So the new kernel won't boot, and while the old one will, I cannot mount rpool as RW on that server due to the pool upgrade. Thanks again for your reply.

fabian · Oct 23, 2017

Cup 'O Joe said:
Thanks for your reply Fabian, I'm sorry that I've been unclear. :E

On either of the servers, under the new kernel I cannot boot successfully even if I manually import the pool. It always hangs at the point shown in the screenshot - I've left it running for over an hour and nothing changed.

The problem I mentioned with upgrading the rpool is that now it's no longer possible to mount it as read/write under an old kernel, as the newer pool version isn't supported under the older kernel. So the new kernel won't boot, and while the old one will, I cannot mount rpool as RW on that server due to the pool upgrade. Thanks again for your reply.

the question is - why can't you boot under the new kernel? we'd need a full error message / stack trace to proceed further..

M4dMike · Oct 23, 2017

Thanks for all the help guys!

@Cup 'O Joe good to know that i am not alone

@fabian i will try that as soon as possible.

cipwurzel · Oct 23, 2017

Setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=10 solved the problem.

It seem's that my SAS-HBA comes up to slow. System SSD for the rpool are connected direct on the mainboard

Just attached a bootlog (serial console)

Cup 'O Joe · Oct 23, 2017

Okay, even though I've already settled for just restoring from backup, I'd like to get you what you need to look at in case it maybe of help to others. What would be the best way to do that? Sorry for the noob question, would a boot.log or a journalctl dump be appropriate?

fabian · Oct 23, 2017

Cup 'O Joe said:
Okay, even though I've already settled for just restoring from backup, I'd like to get you what you need to look at in case it maybe of help to others. What would be the best way to do that? Sorry for the noob question, would a boot.log or a journalctl dump be appropriate?

if the stack trace is contained in the journal, then yes, a full journal dump might be helpful. otherwise, you can try getting console output via a serial console or try the systemd debug shell (https://freedesktop.org/wiki/Software/systemd/Debugging)

M4dMike · Oct 26, 2017

hey guys,

as others already reported, ZFS_INITRD_PRE_MOUNTROOT_SLEEP did the trick.
thanks for all your help!

Cup 'O Joe · Oct 28, 2017

Sorry for the wait there. So what I ended up doing was to run a fresh install on the server - then upgrade it through apt-get, reboot, and then run dist-upgrade to pull in the new kernel again. Interestingly enough, I still have to manually import the rpool and it still hangs at the same point. I enabled multiple boot logs for journalctl and then rebooted. Here is the report from journalctl where the stop happens:

Oct 28 22:10:33 SAMSON systemd-journald[1573]: Journal started
Oct 28 22:10:33 SAMSON systemd-journald[1573]: Runtime journal (/run/log/journal/bc2f684e31ee4daf95e45c62410a95b1) is 8.0M, max 321.3M, 313.3M free.
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'iscsi_tcp'
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'ib_iser'
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'vhost_net'
Oct 28 22:10:33 SAMSON systemd-udevd[1657]: Process '/bin/mount -t fusectl fusectl /sys/fs/fuse/connections' failed with exit code 32.
Oct 28 22:10:33 SAMSON systemd[1]: Starting Flush Journal to Persistent Storage...
Oct 28 22:10:33 SAMSON systemd[1]: Started Set the console keyboard layout.
Oct 28 22:10:33 SAMSON systemd-journald[1573]: Time spent on flushing to /var is 24.323ms for 877 entries.

Running with "systemd.debug-shell=1" specified in the kernel entry in grub resulted in:

Error #1: Failed to start udev Wait for Complete Device Initialization.

Error #2: Depend Dependency failed for Import ZFS pools by cache file.
Starting Activation of LVM2 logical volumes...
Starting Mount ZFS filesystems...
(3 of 3) A start job is running for Mount ZFS filesystems (3min / no limit)
(1 of 3) A start job is running for /dev/zvol/rpool/swap (3min 15s / 4min 30s)
(2 of 3) A start job is running for Activation of LVM2 logical volumes (3min 16s / no limit)
(3 of 3) A start job is running or Mount ZFS filesystems (3min 14s / no limit)

In case it may help, I'm running the OS off of 2 SSD's in Mirror, connected via an LSI SAS 9211-8i HBA.

fabian · Oct 30, 2017

Cup 'O Joe said:
Sorry for the wait there. So what I ended up doing was to run a fresh install on the server - then upgrade it through apt-get, reboot, and then run dist-upgrade to pull in the new kernel again. Interestingly enough, I still have to manually import the rpool and it still hangs at the same point. I enabled multiple boot logs for journalctl and then rebooted. Here is the report from journalctl where the stop happens:

Oct 28 22:10:33 SAMSON systemd-journald[1573]: Journal started
Oct 28 22:10:33 SAMSON systemd-journald[1573]: Runtime journal (/run/log/journal/bc2f684e31ee4daf95e45c62410a95b1) is 8.0M, max 321.3M, 313.3M free.
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'iscsi_tcp'
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'ib_iser'
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'vhost_net'
Oct 28 22:10:33 SAMSON systemd-udevd[1657]: Process '/bin/mount -t fusectl fusectl /sys/fs/fuse/connections' failed with exit code 32.
Oct 28 22:10:33 SAMSON systemd[1]: Starting Flush Journal to Persistent Storage...
Oct 28 22:10:33 SAMSON systemd[1]: Started Set the console keyboard layout.
Oct 28 22:10:33 SAMSON systemd-journald[1573]: Time spent on flushing to /var is 24.323ms for 877 entries.

Running with "systemd.debug-shell=1" specified in the kernel entry in grub resulted in:

Error #1: Failed to start udev Wait for Complete Device Initialization.

Error #2: Depend Dependency failed for Import ZFS pools by cache file.
Starting Activation of LVM2 logical volumes...
Starting Mount ZFS filesystems...
(3 of 3) A start job is running for Mount ZFS filesystems (3min / no limit)
(1 of 3) A start job is running for /dev/zvol/rpool/swap (3min 15s / 4min 30s)
(2 of 3) A start job is running for Activation of LVM2 logical volumes (3min 16s / no limit)
(3 of 3) A start job is running or Mount ZFS filesystems (3min 14s / no limit)

In case it may help, I'm running the OS off of 2 SSD's in RAIDZ-1, connected via an LSI SAS 9211-8i HBA.

thanks, that is the first halfway decent log.. so there is an udev issue apparently, can you try to boot with an added "udev.log-priority=debug" and then provide the journal of the unit "systemd-udevd". e.g., find the boot index with "journalctl --list-boots" (0 is the current, -1 the one before, and so on), and then do "journalctl -bBOOTID -u systemd-udevd" , where BOOTID is the index (note, no space between -b and the ID).

Cup 'O Joe · Nov 1, 2017

Okay so when running with the udev.log-priority tag in the kernel entry, the system boots to:

"Unloaded link configuration context. Unloaded link configuration context. Unloaded link configuration context..."

Repeated over and over, and then,

"worker [403] exited"

With all of the number below present in the same message.
398
399
400
401
402
404
405

I performed a hard shutdown, rebooted under the older kernel, then ran "journalctl -b-1 -u systemd-udevd" which resulted in a single line:

Nov 01 07:50:21 GOLIATH systemd-udevd[1727]: Could not generate persistent MAC address for vmbr0: No such file or directory

For reference, this is after a fresh install. The only configuration made to a network adapter was that done via the PVE installer at initial installation and no issues exist running under the older kernel. The two network adapters present in the system are X2 onboard Intel Corporation I350 Gigabit connections, on a Supermicro X9DR3-F motherboard running the latest bios version.

fabian · Nov 6, 2017

that message is unrelated (and harmless). I still cannot reproduce this issue on any of our machines, but it seems like at least one Debian user is affected as well. could you try commenting the swap line in /etc/fstab, regenerating the initramfs and reboot to obtain a udev log again?

[SOLVED] zfs: cannot import rpool after reboot

Active Member

Renowned Member

Active Member

Member

Member

Member

Proxmox Staff Member

Proxmox Staff Member

Distinguished Member

Member

Proxmox Staff Member

Active Member

Active Member

Attachments

Member

Proxmox Staff Member

Active Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

We value your privacy