[SOLVED] ZFS rpool unavailable - message: cannot import pool no such pool or dataset

Joris L.

Well-Known Member
May 16, 2020
299
17
58
51
Antwerp, Belgium
commandline.be
This thread to document what happened and how this was fixed, just in case this can help someone.

Running pve-manager/8.4.1/2a5fa54a8503f96d (running kernel: 6.11.11-2-pve) in a single host set-up for a lab environment.​
The system typically is stable, until it is not. After a few 'hangs' this resulted in a non bootable system.​

As a good measure I've now switched back to booting the default 6.8.x kernel, somehow the system keeps reverting back to the 6.11 kernel which should now be fixed.

The on-screen message cannot import pool no such pool or dataset was followed by a suggestion to restore from backup or try a manual import

Status

Right now, the system is functioning well again and no data was lost.
What made the system bootable again is not 100% determined, neither is the root cause 100% determined.​
I will spare you the few hours of sweating and trial and errors.​

Observations
  • zpool status (obviously) failed because there no pool found
  • manually executing zpool import did not result in an online rpool, a single partition was reported as off-line
  • the pool was reported a UNAVAILABLE
  • GAP while this was not observed initially one of the disks was simply not showing up anymore
Solution
  • booted the system with a simple live system
  • use gdisk to restore the GPT partition table backup
  • poweroff the machine, power off the PSU (mentioned for completeness)
    • leave powered off for 10-20 seconds (done to clear memory state, optional)
  • power on the machine
  • boot is okay now
    • the missing disk is now visible again, the unreachable partition is now available again
Script

To ensure any damage to the GPT partitions can be recovered I wrote a simple script to backup GPT partition tables

consider mkdir gptbackup

Code:
#!/bin/bash

disks=`lsblk -p --nodeps --noheadings | grep -v zd | cut -d" "  -f1`
h=`hostname -s`
p="gptbackup"
dat=`date --iso-8601`

for d in $disks
do
        x=`echo "$d" | cut -d"/" -f3`
        sgdisk -b "$p"/"$h"_"$dat"_"$x" "$d"
done

there should now be a backup in the gptbackup folder for the GPT tables for the physical disks

Concerns and considerations

Unless there is malicious activity on this system (not impossible yet considered unlikely) running the 6.11 kernel does have real world risks, better don't.
This systems is not (yet) running with ECC-RAM and suffered from multiple "freeze" situations, this can 'mangle' partition tables.​
The fact one disk was simple 'not visible' across multiple reboots does remain a cause for concern which I have no answer for.​
In case the 6.8 kernel also shows similar situation I'll update in this thread.​
For now I assume this is resolved with switching the the default kernel again.​
 
Last edited:
UPDATE no improvement, similar system "semi-hang" occurance twice already, using proxmox-default-kernel (6.8) (KSM enabled)

only TTY1 responded, other TTYn were not visible neither accessible, sysrq keys responded

the error message "ZFS rpool unavailable - message: cannot import pool no such pool or dataset" reappeared

oddly this appears memory related since leaving the system off for some time fixes the issue and boot resumes

odd issue with VM reporting "Jun 17 12:16:14 pve kernel: Buffer I/O error on dev zd202, logical block 10501392, async page read"

oddly this is not just visible on proxmox VE it is also visible for the VM storage devices as "kernel: I/O error, dev vda, sector NNNNNNNN at 0x0:(READ) flags 0x0 phys_seg 1 prio class 2" for the /boot partition on the VM

as odd is running fsck /boot/efi results in asking to recover the boot sector due to "differences between boot sector and its backup" suggesting this is mostly harmless (Differences: (offset:original/backup) 65:01/00

though I lack time and resources to make it all stick this is a first for me
  • Provmox VE semi-hang (exact cause still unknown)
  • Recovered from ZFS pool break by restoring the partition table from backup for a single partition (or so i thought)
  • Out of precaution kept the machine powered off for 20 seconds
  • Reboot showed a functional Proxmox VE again (one disk which was 'gone' is now back again)
Now I learn
  • leaving the machine off for some time (30 seconds or more) and booting it again fixes the same as above, not other actions required
  • a vm is showing I/O error for a boot partition
  • this eror is visible outside of the VM
  • running fsck suggests the boot/efi partition is defect
    • odd is the partition repair is required by restoring the backup of the partition table (ironically same as for the first repair attempt)
As I vaguely remember prior issues with KSM I wonder if it is best to disable this feature or tweak it to be less aggressive / risky.
To me the above hints at something 'funky' with the memory management which is likely a bug or weakness at play.
 
Last edited:
UPDATE based on researching logs and error messages journalctl -b all -p warning -f I've come to the "educated assumption" the issue is likely either entirely unknown at this time or it is caused by the Debian VM /dev/sda I/O errors replicating onto the Proxmox VE ZFS leading to system instability.

Main reason being the I/O errors for the /boot partition are 'refracted' onto Proxmox ZFS, to me this is demonstrable because the error only shows after decrypting the VM for boot.

Futhermore, when leaving the Debian VM enabled, later when 'Buffer I/O error ... async ...' repeats there is also mention of zed as a source for err=52 which is a class=data error To my understanding this can potentially be resolved by means of a resilver.
 
Last edited: