[SOLVED] Unable to boot from ZFS rpool after upgrading to PVE 4.2

May 3, 2016
2
3
3
Frankfurt am Main
Dear Proxmox community,
Dear Proxmox support team,

Today I upgraded to PVE 4.2 and I'm now having boot issues because of device mappings mismatch. I have installed the root file system on ZFS during the initial installation back in 2015. Now, when trying to boot the PVE I receive:
Code:
Message: cannot import `rpool`: one or more devices is currently unavailable
This is the output of the zpool status command before I started the upgrade:
Code:
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h23m with 0 errors on Sat Nov 21 11:42:06 2015
config:

   NAME  STATE  READ WRITE CKSUM
   rpool  ONLINE  0  0  0
    mirror-0  ONLINE  0  0  0
    sda2  ONLINE  0  0  0
    sdb2  ONLINE  0  0  0
   logs
    sdh1  ONLINE  0  0  0

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: none requested
config:

   NAME  STATE  READ WRITE CKSUM
   tank  ONLINE  0  0  0
    mirror-0  ONLINE  0  0  0
    sdc  ONLINE  0  0  0
    sdd  ONLINE  0  0  0
    mirror-1  ONLINE  0  0  0
    sde  ONLINE  0  0  0
    sdf  ONLINE  0  0  0
   logs
    sdh2  ONLINE  0  0  0

errors: No known data errors
Now, the problem seems to be that due to some reason my SSD cache device that previously appeared as /dev/sdh is now appearing as /dev/sdb. This in turn leads to the error message above because, additionally, what was /dev/sdb previously is now appearing as /dev/sdc. Please notice that I've added the SSD recently before upgrading to 4.2, but not during the initial PVE 4.0 installation.

In order to get the issue resolved I first tried to simply remove the SSD cache device, but this leads to another problem and I'm being dropped to a GRUB rescue prompt telling me:
Code:
error: no such device: 73d765c196502f2c.
Entering rescue mode...
I'm not sure, but it seems that GRUB has been written to the SSD cache device during the upgrade or during a kernel update before the upgrade to 4.2.

Any ideas on how to resolve the issue described?

Any help will be greatly appreciated!

Best regards,

Anymemm
 
grub unfortunately does not handle missing devices gracefully. is this message
Message: cannot import `rpool`: one or more devices is currently unavailable
followed by an initramfs prompt where you can enter commands?

if yes, you should be able to run "zpool import" to see what zfs detects at that stage, and if it is just log devices that are missing, you should be able to import the rpool using "zpool import -m rpool", "exit" and continue booting.

if not, please describe exactly at what point in the boot process this message occurs (i.e., before or after the grub menu, before the initramfs or after, ...)
 
Hello Fabian,
grub unfortunately does not handle missing devices gracefully. is this message followed by an initramfs prompt where you can enter commands?

I was receiving this message after the initramfs, but this was before I had removed the SSD cache device. Due to the fact that device mappings were messed up at this point of time I wasn't able to import the rpool, and didn't want to force the import. Now, doing the aftermath, I think that the reason for the messed up device mappings was that I had hot-added the SSD cache. During normal operations the SSD received the device name /dev/sdh and I was uncauticious enough to use the /dev/sdh1 and /dev/sdh2 partitions as log devices for both zpools. But then, after the reboot, the SSD received another device name. I'm almost sure that this problem would never have occured if I had used /dev/disk/by-id/<ID>-part1 and /dev/disk/by-id/<ID>-part2 equivalents when adding the cache devices to the zpools.

Anyway, this is how I managed to recover from this "I have a shiny new PVE 4.2 installed, but I'm not able to boot..." situation:

First I removed the SSD cache device and tried to boot the PVE 4.2 installer using debug mode in order to remove the cache devices from both zpools and then re-install GRUB. For an unknown reason, after reaching the first shell prompt, it was simply unpossible to input anything via keyboard - no matter if I was using HP iLO Java console or the directly attached USB keyboard - I was simply stuck at that prompt. Pretty strange and discouraging, but I had to find some way out of this...

After a short research on the Internet I found a SystemRescueCD with ZFS 0.6.5 built into it. Thanks to the Funtoo Linux folks out there for providing it to the communuity. After booting the SystemRescueCD I went like this until I finally managed to do what I hoped to be able to do with the PVE 4.2 installer:
Code:
root@sysresccd /root % zpool import -a
cannot import 'rpool': pool may be in use from other system, it was last accessed by pve (hostid: 0xa8c02302) on Tue May  3 08:49:50 2016
use '-f' to import anyway
cannot import 'tank': pool may be in use from other system, it was last accessed by pve (hostid: 0xa8c02302) on Tue May  3 08:47:44 2016
use '-f' to import anyway
root@sysresccd /root % zpool import -a -f
The devices below are missing, use '-m' to import the pool anyway:
    sdh1 [log]

cannot import 'rpool': one or more devices is currently unavailable
The devices below are missing, use '-m' to import the pool anyway:
    sdh2 [log]

cannot import 'tank': one or more devices is currently unavailable
root@sysresccd /root % zpool import -a -f -m
cannot mount '/': directory is not empty
root@sysresccd /root % zpool status
  pool: rpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
   invalid.  Sufficient replicas exist for the pool to continue
   functioning in a degraded state.
action: Replace the device using 'zpool replace'.
  see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h23m with 0 errors on Sat Nov 21 10:42:06 2015
config:

   NAME  STATE  READ WRITE CKSUM
   rpool  DEGRADED  0  0  0
    mirror-0  ONLINE  0  0  0
    sda2  ONLINE  0  0  0
    sdb2  ONLINE  0  0  0
   logs
    10008429943656705133  UNAVAIL  0  0  0  was /dev/sdh1

errors: No known data errors

  pool: tank
state: DEGRADED
status: One or more devices could not be used because the label is missing or
   invalid.  Sufficient replicas exist for the pool to continue
   functioning in a degraded state.
action: Replace the device using 'zpool replace'.
  see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

   NAME  STATE  READ WRITE CKSUM
   tank  DEGRADED  0  0  0
    mirror-0  ONLINE  0  0  0
    sdc  ONLINE  0  0  0
    sdd  ONLINE  0  0  0
    mirror-1  ONLINE  0  0  0
    sde  ONLINE  0  0  0
    sdf  ONLINE  0  0  0
   logs
    4585512667122472357  UNAVAIL  0  0  0  was /dev/sdh2

errors: No known data errors
So, both zpools have been imported, but are in degraded state because of the missing SSD cache devices, but removing them is pretty simple:
Code:
root@sysresccd /root % zpool remove rpool 10008429943656705133
root@sysresccd /root % zpool remove tank 4585512667122472357
root@sysresccd /root % zpool status
  pool: rpool
state: ONLINE
  scan: scrub repaired 0 in 0h23m with 0 errors on Sat Nov 21 10:42:06 2015
config:

   NAME  STATE  READ WRITE CKSUM
   rpool  ONLINE  0  0  0
    mirror-0  ONLINE  0  0  0
    sda2  ONLINE  0  0  0
    sdb2  ONLINE  0  0  0

errors: No known data errors

  pool: tank
state: ONLINE
  scan: none requested
config:

   NAME  STATE  READ WRITE CKSUM
   tank  ONLINE  0  0  0
    mirror-0  ONLINE  0  0  0
    sdc  ONLINE  0  0  0
    sdd  ONLINE  0  0  0
    mirror-1  ONLINE  0  0  0
    sde  ONLINE  0  0  0
    sdf  ONLINE  0  0  0

errors: No known data errors
OK, now both zpools seem to be doing fine. Let's try to re-install GRUB on both rpool disks as per onlime's post:
Code:
root@sysresccd /root % mkdir /mnt/pve
root@sysresccd /root % zfs set mountpoint=/mnt/pve rpool/ROOT/pve-1
root@sysresccd /root % zfs mount rpool/ROOT/pve-1
root@sysresccd /root % mount -t proc /proc /mnt/pve/proc
root@sysresccd /root % mount --rbind /dev /mnt/pve/dev
root@sysresccd /root % mount --rbind /sys /mnt/pve/sys
root@sysresccd /root % chroot /mnt/pve /bin/bash
root@sysresccd:/# source /etc/profile
root@sysresccd:/# grub-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
root@sysresccd:/# grub-install /dev/sdb
Installing for i386-pc platform.
Installation finished. No error reported.
root@sysresccd:/# update-grub
update-grub  update-grub2
root@sysresccd:/# update-grub2
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.4.6-1-pve
Found initrd image: /boot/initrd.img-4.4.6-1-pve
Found linux image: /boot/vmlinuz-4.2.8-1-pve
Found initrd image: /boot/initrd.img-4.2.8-1-pve
Found linux image: /boot/vmlinuz-4.2.6-1-pve
Found initrd image: /boot/initrd.img-4.2.6-1-pve
Found linux image: /boot/vmlinuz-4.2.3-2-pve
Found initrd image: /boot/initrd.img-4.2.3-2-pve
Found linux image: /boot/vmlinuz-4.2.2-1-pve
Found initrd image: /boot/initrd.img-4.2.2-1-pve
Found memtest86+ image: /ROOT/pve-1@/boot/memtest86+.bin
Found memtest86+ multiboot image: /ROOT/pve-1@/boot/memtest86+_multiboot.bin
done
root@sysresccd:/# update-initramfs -u
update-initramfs: Generating /boot/initrd.img-4.4.6-1-pve
root@sysresccd:/# exit
root@sysresccd /root % umount /mnt/pve/sys/fs/fuse/connections
root@sysresccd /root % umount /mnt/pve/sys/kernel/config
root@sysresccd /root % umount /mnt/pve/sys/kernel/debug
root@sysresccd /root % umount /mnt/pve/sys/kernel/security
root@sysresccd /root % umount /mnt/pve/sys
root@sysresccd /root % umount /mnt/pve/dev/shm
root@sysresccd /root % umount /mnt/pve/dev/pts
root@sysresccd /root % umount /mnt/pve/dev/mqueue
root@sysresccd /root % umount /mnt/pve/dev
It is very important to move away the /etc/zfs/zpool.cache file because otherwise subsequent attempts to boot PVE 4.2 would fail with a kernel stack trace being shown and systemd waiting forever for the zpools to get imported:
Code:
root@sysresccd /root % mkdir /mnt/pve/root/backup && mv /etc/zfs/zpool.cache /mnt/pve/root/backup/zpool.cache
After unmounting rpool/ROOT/pve-1 it is important to set the mount point to / like it was before:
Code:
root@sysresccd /root % umount /mnt/pve
root@sysresccd /root % zfs set mountpoint=/ rpool/ROOT/pve-1
Now, finally I was able to boot PVE 4.2 and after doing a scrub on both zpools they now look like this:
Code:
  pool: rpool
state: ONLINE
  scan: scrub repaired 0 in 0h2m with 0 errors on Tue May  3 17:23:00 2016
config:

  NAME  STATE  READ WRITE CKSUM
  rpool  ONLINE  0  0  0
  mirror-0  ONLINE  0  0  0
  sda2  ONLINE  0  0  0
  sdb2  ONLINE  0  0  0

errors: No known data errors

  pool: tank
state: ONLINE
  scan: scrub repaired 0 in 2h9m with 0 errors on Tue May  3 19:29:49 2016
config:

  NAME  STATE  READ WRITE CKSUM
  tank  ONLINE  0  0  0
  mirror-0  ONLINE  0  0  0
  ata-WDC_WD30EFRX-68AX9N0_WD-WMC123456789  ONLINE  0  0  0
  ata-WDC_WD30EFRX-68AX9N0_WD-WMC223456789  ONLINE  0  0  0
  mirror-1  ONLINE  0  0  0
  ata-WDC_WD30EFRX-68AX9N0_WD-WMC323456789  ONLINE  0  0  0
  ata-WDC_WD30EFRX-68AX9N0_WD-WMC423456789  ONLINE  0  0  0

errors: No known data errors

So, after about 6 hours of downtime, I finally managed to recover from this silly device mapping mismatch and I'm now enjoying the new PVE 4.2 whose Web UI, BTW, is looking pretty modern now! :)

Any comments and ideas on how to avoid such a scenario in future will be highly appreciated!

Also, it will be good if the PVE installation ISO/CD receives some sort of installer-independent rescue system, just for the case of emergency. As experience shows: Murphy is out there to get you!

Best regards,

Anymemm
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!