[SOLVED] zfs: cannot import rpool after reboot

M4dMike

Member
Oct 10, 2017
8
0
21
30
Vienna, Austria
Hey guys!

Did an update last night:
Code:
apt-get update
apt-get dist-upgrade

after that, ZFS wanted a
Code:
zpool upgrade -a
which required a reboot.

But after the reboot, i was stuck at busybox:
20171021_230732rysb8.jpg

i tried:
Code:
zpool import -c /etc/zfs/zpool.cache -N rpool
exit
and it worked.

Than i performed the
Code:
zpool upgrade -a
and it succeeded.

But the reboot problem still exists, i have to manually import the pool each time.

Additional Info:
PVE-Version: pve-manager/5.0-34/b325d69e (running kernel: 4.13.4-1-pve)

ZFS-Version:
SPL: Loaded module v0.7.2-1
ZFS: Loaded module v0.7.2-1, ZFS pool version 5000, ZFS filesystem version 5

rpool:
Code:
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0h5m with 0 errors on Sun Oct  8 00:29:12 2017
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sds2    ONLINE       0     0     0
            sdt2    ONLINE       0     0     0

errors: No known data errors

Would be awesome if someone could give me a clue.
Thanks guys!
 

Cup 'O Joe

Member
Oct 22, 2017
11
0
6
30
Same here, I have two servers that have the same issue. On one of them when I went to manually import the rpool it notified me that I needed to upgrade the pool - being an idiot, I did (zpool upgrade rpool). So after importing the pool manually and continuing the boot process, the boot hangs (see screenshot).

IMG_20171022_171622.jpg

After looking through some other threads, I tried booting off an old kernel - which worked actually except that the old kernel of course doesn't support the newer pool version... =_=

So... I went to the other server that I had not done anything to yet (but which has the same issue) booted off the previous kernel until the system reached the BusyBox, manually imported the rpool, resumed the boot process, and then logged in. I ran the following:

apt-get update
apt-get upgrade
apt-get dist-upgrade

None of which returned anything new, as a dist-upgrade is what caused the break to begin with. Following that, I also ran:

initramfs-update -u

Which operated without errors against the new kernel (4.13.4-1-pve). I rebooted with the new kernel and the problem was still there.

For the record, I've also tried specifying:

rootdelay=30

In grub just to see if that would remedy the issue, however that came back negative as well. Help is greatly appreciated, and Mike at least now you know you aren't the only one having this issue. :p
 

Cup 'O Joe

Member
Oct 22, 2017
11
0
6
30
Mike, have you tried running initramfs-update -u?

While it didn't work for me, considering you are able to completely boot after manually importing the pool this may be worth a try for you, as well as the rootdelay option in grub (hit "e" when grub shows, manually append "rootdelay=30" to the boot entry, F10 to boot).
 

Cup 'O Joe

Member
Oct 22, 2017
11
0
6
30
I've given up for the moment, I have 24 hours before I'll need to use these servers.

My solution has been to have the system always boot with the old kernel for the time being - at least until the dev's have had time to realize the issues this is causing some people and cook up the appropriate patches for it...

To start, backup the file we'll be editing:
sudo cp /etc/default/grub /etc/default/grub.bak

Now use nano, vim, or your preferred text editor:
sudo nano /etc/default/grub

Inside that file, at the end write:
GRUB_DEFAULT="Advanced options for Proxmox Virtual Environment GNU/Linux>Proxmox Virtual Environment GNU/Linux, with Linux 4.10.17-2-pve"
*Make sure the version shown above matches the kernel you want to revert to by running "pveversion -v"*

Save and close the file, then run:
update-grub

For me I can now boot with the old kernel without a hanging boot process or the rpool import hang. This of course will only work on the server that I did not run the pool upgrade on - otherwise by default those upgraded pools with the old kernel will result in the pools mounting readonly. I am going to look for a temporary workaround for that - while downgrading said pools doesn't seem feasible, I may be able to disable the new features enabled by the upgrade to get RW access to the pools. If I do come up with results on that front, I'll report them here.

# EDIT #
Did a bit of reading up and checking around, ZFS features cannot be disabled after they have been enabled ("zpool upgrade <pool>" enables features), the specific feature keeping the pool from being mountable in my case is org.zfsonlinux:userobj_accounting. My only recourse is to recover from backups at this point for the one server that I did the pool upgrade on.
 
Last edited:

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,609
1,431
164
ZFS features should not affect bootability unless you then also switch your root dataset to use them (e.g., for stuff like checksum algorithms which Grub might not yet support).

if the rpool is not automatically imported in the initramfs, but a manual import works, it is most likely a timing issue (disks are not yet ready when it attempts to import). if setting the rootdelay kernel option does not work, adding a sleep to the initrd phase before the zfs script attempts to import the pool might:

  • edit '/etc/default/zfs', and change the 'ZFS_INITRD_PRE_MOUNTROOT_SLEEP' and/or 'ZFS_INITRD_POST_MODPROBE_SLEEP settings
  • run 'update-initramfs -u' to update the initramfs (add '-k KERNELVERSION' if you are not currently running the kernel you actualyl want to boot)
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,609
1,431
164
Same here, I have two servers that have the same issue. On one of them when I went to manually import the rpool it notified me that I needed to upgrade the pool - being an idiot, I did (zpool upgrade rpool). So after importing the pool manually and continuing the boot process, the boot hangs (see screenshot).

View attachment 6059

the (potentially) interesting part of that trace is cut off..
 

Cup 'O Joe

Member
Oct 22, 2017
11
0
6
30
Thanks for your reply Fabian, I'm sorry that I've been unclear. :E

On either of the servers, under the new kernel I cannot boot successfully even if I manually import the pool. It always hangs at the point shown in the screenshot - I've left it running for over an hour and nothing changed.

The problem I mentioned with upgrading the rpool is that now it's no longer possible to mount it as read/write under an old kernel, as the newer pool version isn't supported under the older kernel. So the new kernel won't boot, and while the old one will, I cannot mount rpool as RW on that server due to the pool upgrade. Thanks again for your reply.
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,609
1,431
164
Thanks for your reply Fabian, I'm sorry that I've been unclear. :E

On either of the servers, under the new kernel I cannot boot successfully even if I manually import the pool. It always hangs at the point shown in the screenshot - I've left it running for over an hour and nothing changed.

The problem I mentioned with upgrading the rpool is that now it's no longer possible to mount it as read/write under an old kernel, as the newer pool version isn't supported under the older kernel. So the new kernel won't boot, and while the old one will, I cannot mount rpool as RW on that server due to the pool upgrade. Thanks again for your reply.

the question is - why can't you boot under the new kernel? we'd need a full error message / stack trace to proceed further..
 

cipwurzel

Member
Sep 8, 2017
15
4
23
Setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=10 solved the problem.

It seem's that my SAS-HBA comes up to slow. System SSD for the rpool are connected direct on the mainboard

Just attached a bootlog (serial console)
 

Attachments

  • start.log
    15.7 KB · Views: 21
Last edited:
  • Like
Reactions: Bran-Ko and chrone

Cup 'O Joe

Member
Oct 22, 2017
11
0
6
30
Okay, even though I've already settled for just restoring from backup, I'd like to get you what you need to look at in case it maybe of help to others. What would be the best way to do that? Sorry for the noob question, would a boot.log or a journalctl dump be appropriate?
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,609
1,431
164
Okay, even though I've already settled for just restoring from backup, I'd like to get you what you need to look at in case it maybe of help to others. What would be the best way to do that? Sorry for the noob question, would a boot.log or a journalctl dump be appropriate?

if the stack trace is contained in the journal, then yes, a full journal dump might be helpful. otherwise, you can try getting console output via a serial console or try the systemd debug shell (https://freedesktop.org/wiki/Software/systemd/Debugging)
 

Cup 'O Joe

Member
Oct 22, 2017
11
0
6
30
Sorry for the wait there. So what I ended up doing was to run a fresh install on the server - then upgrade it through apt-get, reboot, and then run dist-upgrade to pull in the new kernel again. Interestingly enough, I still have to manually import the rpool and it still hangs at the same point. I enabled multiple boot logs for journalctl and then rebooted. Here is the report from journalctl where the stop happens:

Oct 28 22:10:33 SAMSON systemd-journald[1573]: Journal started
Oct 28 22:10:33 SAMSON systemd-journald[1573]: Runtime journal (/run/log/journal/bc2f684e31ee4daf95e45c62410a95b1) is 8.0M, max 321.3M, 313.3M free.
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'iscsi_tcp'
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'ib_iser'
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'vhost_net'
Oct 28 22:10:33 SAMSON systemd-udevd[1657]: Process '/bin/mount -t fusectl fusectl /sys/fs/fuse/connections' failed with exit code 32.
Oct 28 22:10:33 SAMSON systemd[1]: Starting Flush Journal to Persistent Storage...
Oct 28 22:10:33 SAMSON systemd[1]: Started Set the console keyboard layout.
Oct 28 22:10:33 SAMSON systemd-journald[1573]: Time spent on flushing to /var is 24.323ms for 877 entries.

Running with "systemd.debug-shell=1" specified in the kernel entry in grub resulted in:

Error #1: Failed to start udev Wait for Complete Device Initialization.

Error #2: Depend Dependency failed for Import ZFS pools by cache file.
Starting Activation of LVM2 logical volumes...
Starting Mount ZFS filesystems...
(3 of 3) A start job is running for Mount ZFS filesystems (3min / no limit)
(1 of 3) A start job is running for /dev/zvol/rpool/swap (3min 15s / 4min 30s)
(2 of 3) A start job is running for Activation of LVM2 logical volumes (3min 16s / no limit)
(3 of 3) A start job is running or Mount ZFS filesystems (3min 14s / no limit)

In case it may help, I'm running the OS off of 2 SSD's in Mirror, connected via an LSI SAS 9211-8i HBA.
 
Last edited:

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,609
1,431
164
Sorry for the wait there. So what I ended up doing was to run a fresh install on the server - then upgrade it through apt-get, reboot, and then run dist-upgrade to pull in the new kernel again. Interestingly enough, I still have to manually import the rpool and it still hangs at the same point. I enabled multiple boot logs for journalctl and then rebooted. Here is the report from journalctl where the stop happens:

Oct 28 22:10:33 SAMSON systemd-journald[1573]: Journal started
Oct 28 22:10:33 SAMSON systemd-journald[1573]: Runtime journal (/run/log/journal/bc2f684e31ee4daf95e45c62410a95b1) is 8.0M, max 321.3M, 313.3M free.
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'iscsi_tcp'
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'ib_iser'
Oct 28 22:10:33 SAMSON systemd-modules-load[1564]: Inserted module 'vhost_net'
Oct 28 22:10:33 SAMSON systemd-udevd[1657]: Process '/bin/mount -t fusectl fusectl /sys/fs/fuse/connections' failed with exit code 32.
Oct 28 22:10:33 SAMSON systemd[1]: Starting Flush Journal to Persistent Storage...
Oct 28 22:10:33 SAMSON systemd[1]: Started Set the console keyboard layout.
Oct 28 22:10:33 SAMSON systemd-journald[1573]: Time spent on flushing to /var is 24.323ms for 877 entries.

Running with "systemd.debug-shell=1" specified in the kernel entry in grub resulted in:

Error #1: Failed to start udev Wait for Complete Device Initialization.

Error #2: Depend Dependency failed for Import ZFS pools by cache file.
Starting Activation of LVM2 logical volumes...
Starting Mount ZFS filesystems...
(3 of 3) A start job is running for Mount ZFS filesystems (3min / no limit)
(1 of 3) A start job is running for /dev/zvol/rpool/swap (3min 15s / 4min 30s)
(2 of 3) A start job is running for Activation of LVM2 logical volumes (3min 16s / no limit)
(3 of 3) A start job is running or Mount ZFS filesystems (3min 14s / no limit)

In case it may help, I'm running the OS off of 2 SSD's in RAIDZ-1, connected via an LSI SAS 9211-8i HBA.

thanks, that is the first halfway decent log.. so there is an udev issue apparently, can you try to boot with an added "udev.log-priority=debug" and then provide the journal of the unit "systemd-udevd". e.g., find the boot index with "journalctl --list-boots" (0 is the current, -1 the one before, and so on), and then do "journalctl -bBOOTID -u systemd-udevd" , where BOOTID is the index (note, no space between -b and the ID).
 

Cup 'O Joe

Member
Oct 22, 2017
11
0
6
30
Okay so when running with the udev.log-priority tag in the kernel entry, the system boots to:

"Unloaded link configuration context. Unloaded link configuration context. Unloaded link configuration context..."

Repeated over and over, and then,

"worker [403] exited"

With all of the number below present in the same message.
398
399
400
401
402
404
405

I performed a hard shutdown, rebooted under the older kernel, then ran "journalctl -b-1 -u systemd-udevd" which resulted in a single line:

Nov 01 07:50:21 GOLIATH systemd-udevd[1727]: Could not generate persistent MAC address for vmbr0: No such file or directory

For reference, this is after a fresh install. The only configuration made to a network adapter was that done via the PVE installer at initial installation and no issues exist running under the older kernel. The two network adapters present in the system are X2 onboard Intel Corporation I350 Gigabit connections, on a Supermicro X9DR3-F motherboard running the latest bios version.
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,609
1,431
164
that message is unrelated (and harmless). I still cannot reproduce this issue on any of our machines, but it seems like at least one Debian user is affected as well. could you try commenting the swap line in /etc/fstab, regenerating the initramfs and reboot to obtain a udev log again?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!