[SOLVED] ZFS-related tasks hang after update

Deller

Active Member
Oct 22, 2016
6
0
41
Hello,

since I performed an update of my Proxmox server (Proxmox on ZFS mirror with 2 disks) yesterday evening, it does not boot properly anymore.

e.g. the unit zfs-import-scan.service hangs with
Code:
pve kernel: INFO: task zpool:2858 blocked for more than 120 seconds.
pve kernel:  Tainted: P  O  4.4.21-1-pve #1
When I change the kernel configuration at startup and append a "1" to boot into the systemd rescue target, everything is fine at first.
Once I execute zpool import, the process hangs again. Continue booting from the rescue target to the default target, the Web UI ist started normally and the server is accessible via SSH.
But everything related to ZFS storage literally "hangs" again.
Booting an older kernel also does not help.

In the meanwhile I have installed Proxmox 4.3-e7cdc165-2 on an USB-Stick. If I boot Proxmox from it, I can access the ZFS Pool without problems, that is why I suppose it has something to do with the update, especially as there are many ZFS-related packages.

Following packages have been installed/updated:

Install:
Code:
libzfs2linux:amd64 (0.6.5.8-pve11~bpo80, automatic),
pve-kernel-4.4.21-1-pve:amd64 (4.4.21-70, automatic),
zfs-zed:amd64 (0.6.5.8-pve11~bpo80, automatic),
libuutil1linux:amd64 (0.6.5.8-pve11~bpo80, automatic),
zfsutils-linux:amd64 (0.6.5.8-pve11~bpo80, automatic),
libzpool2linux:amd64 (0.6.5.8-pve11~bpo80, automatic),
libnvpair1linux:amd64 (0.6.5.8-pve11~bpo80, automatic)
Upgrade:
Code:
libpve-common-perl:amd64 (4.0-75, 4.0-76),
zfs-initramfs:amd64 (0.6.5.7-pve10~bpo80, 0.6.5.8-pve11~bpo80),
libpve-storage-perl:amd64 (4.0-66, 4.0-67),
pve-manager:amd64 (4.3-3, 4.3-7),
smartmontools:amd64 (6.3+svn4002-2+b2, 6.5+svn4324-1~pve80),
qemu-server:amd64 (4.0-91, 4.0-92),
zfsutils:amd64 (0.6.5.7-pve10~bpo80, 0.6.5.8-pve11~bpo80),
spl:amd64 (0.6.5.7-pve6~bpo80, 0.6.5.8-pve7~bpo80),
lxcfs:amd64 (2.0.4-pve1, 2.0.4-pve2),
pve-qemu-kvm:amd64 (2.6.2-2, 2.7.0-3),
libnvpair1:amd64 (0.6.5.7-pve10~bpo80, 0.6.5.8-pve11~bpo80),
libuutil1:amd64 (0.6.5.7-pve10~bpo80, 0.6.5.8-pve11~bpo80),
libzpool2:amd64 (0.6.5.7-pve10~bpo80, 0.6.5.8-pve11~bpo80),
proxmox-ve:amd64 (4.3-66, 4.3-70),
pve-docs:amd64 (4.3-5, 4.3-12),
libzfs2:amd64 (0.6.5.7-pve10~bpo80, 0.6.5.8-pve11~bpo80),
pve-firmware:amd64 (1.1-9, 1.1-10)

Any hints how to get out of this mess? Or at least how I can backup my VMs via the USB-Stick Proxmox?

Thank you!
 
Hi Dietmar, thank you for the quick reply!

Unfortunately the suggested "zpool import -N rpool" was no good, the zpool task still hangs after a few seconds of activity on the HDDs.
The scrub of the pool completed without errors though.
"zpool status" still states the pool as online and without errors, "zfs list" still shows all datasets.

But once I issue a "zpool import", nothing ZFS-related is working anymore.

I can remember a "zpool import -R /mnt/zfs rpool" working from the USB-Stick Proxmox installation yesterday, after that I could navigate through the datasets and inspect their contents. But even that is not working anymore today.

Right now I am performing a "zdb -e -p /dev/disk/by-id rpool", perhaps this does give me some insight.

Any more hints what I could try?
 
I can remember a "zpool import -R /mnt/zfs rpool" working from the USB-Stick Proxmox installation yesterday, after that I could navigate through the datasets and inspect their contents. But even that is not working anymore today.

Maybe a hardware problem with the disks? Can you see and disk related error messages in the logs?
 
In the logs I see no disk-related problems.
I also ran a quick memtest, shows also no problem.
The server is merely 9 months old, the HDDs are even newer, but you never know...
zdb also reported no errors, as far as I could see.
 
Have you checked with SMART? Are there some errors on the disks?

smartctl -a /dev/XXX
 
Hi fireon,

I just checked, there are no SMART errors whatsoever, both HDDs report that they are healthy.
I even conducted a self-test, which also showed no errors.
 
Strange. I upgraded here an ZFSbox to actually "nosubscriptionrepo". It is working fine. Did you use "pveupgrade" (apt dist-upgrade) for your Update Server?
 
currently trying to reproduce this / narrow the cause down - do you have VMs with ZFS pools inside? if yes, are any of the vdevs of those pools ZFS zvols on the host?
 
@fireon: I did the update via GUI, which ran a apt-get dist-upgrade, also against the nosubscriptionrepo. After the reboot the boot process hang with timeouts....

@fabian: I just saw the other thread "Failed To Import Pool 'rpool'", which made me curious, as I also experienced the "A start job is running for Import ZFS pools by devic...", but my server runs into the "task hangs" problem afterwards.

But I indeed have a VM with FreeNAS, and it should use a zvol from the Proxmox ZFS. Although I do not mount the FreeNAS pool inside Proxmox.

Unfortunately I won't be at home until Thursday evening, so I can't try anything before then :(
 
Last edited:
@fabian: I just saw the other thread "Failed To Import Pool 'rpool'", which made me curious, as I also experienced the "A start job is running for Import ZFS pools by devic...", but my server runs into the "task hangs" problem afterwards.

But I indeed have a VM with FreeNAS, and it should use a zvol from the Proxmox ZFS. Although I do not mount the FreeNAS pool inside Proxmox.

Unfortunately I won't be at home until Thursday evening, so I can't try anything before then :(

There are basically two ways to fix this:
- boot from a live environment with ZFS support, chroot into your PVE installation, disable the zfs-import-scan service, reboot into PVE, install updates, reenable zfs-import-scan service, reboot again.
- boot into emergency mode (in the grub menu, hit "e", and add " emergency" to the end of the line starting with "linux", then press ctrl+x), enter your root password when prompted, and disable the zfs-import-scan.service ("systemctl disable zfs-import-scan"), continue the boot ("exit" or ctrl+d). now you should be able to install the update with the fix, and then re-enable the zfs-import-scan service ("systemctl enable zfs-import-scan") and reboot the host.

the second is easier (I just described it in more detail).
 
@fabian: That did it, thanks a lot! So was it just the import via "-d /dev/disk/by-id" or was there something else involved?
 
@fabian: That did it, thanks a lot! So was it just the import via "-d /dev/disk/by-id" or was there something else involved?

the two forum threads that popped up because of the missing "-d /dev/disk/by-id" actually led to a discovery of an Ubuntu kernel bug (https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1636517), which will be fixed in the next round of updates. it's a rare enough setup and a command that makes little sense on a running system, so it seems nobody noticed ;) the "-d /dev/disk/by-id" works around the issue because there are no zvols with ZFS on them in that directory, but it also makes device scanning on boot potentially faster (and usually you want to import pools with /dev/disk/by-id names for readability)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!