ZFS no pools available yet ONLINE import status, I/O error on Import attempt

colinstu

New Member
Jul 29, 2023
4
1
3
I have a near 2 year old Proxmox install (installed w/latest version at the time and have been keeping it up to date since).
It was built/installed from the beginning with two NVMe SSDs in a ZFS Mirror / RAID1. (With Root on ZFS too). This was all setup right through proxmox and I haven't had any issues.
Nearly a week ago I decided to enable/setup MSISupported (in registry) for a Windows11 VM for an AMD GPU and audio that's been successfully using passthrough since near the beginning. I only gave this new setting a shot because I noticed an occasional audio dropout and the Internet seemed to suggest this would be a fix.
Things seem to be going fine, up until last night however.

I discovered that all my other VMs/CTs were no longer accessible from the internet or network, could no longer hit proxmox web UI.
So I check out the local console on the host, and discover a stack trace has been output to the screen and it's hardlocked. Would not respond to keyboard or even a press of the power button, so I held power until it powered off.

Upon booting back up, my zfs pool (named rpool) can no longer be imported by proxmox. It loads the zfs module, runs command: /sbin/zpool import -c /etc/zfs/zpool.cache -N 'rpool' and returns back right away with Cannot Import 'rpool' I/O error , Destroy and re-create the pool from a backup source. Says the same for cachefile too.
GREAT.
At this point I don't know if this is due to the prev/above change, or due to the (random?) lock (which may've been due to AMD GPU reset bug? I was not aware of this or its fix), OR due to the power down after that lock and being brought back up.

I'm dumped off at the initramfs prompt.
zpool list and info (with and without -v) both return 'no pools available'
'zpool import' returns: the name of my pool (rpool), and even says it's ONLINE, it lists the mirror, and both nvme devices (appearing as nvme-eui-....), again, all of them appear ONLINE.
I checked 'ls -l' under /dev/disk/by-id/ and verified that symlinks still exist for all the corresponding entries, everything seems to line up. (Note that in addition to my nvme-eui entries, there's also nvme-Samsung entries, both ones that end and don't end with _1. All of them have corresponding _part1, _part2, _part3 entries too. (three partitions are stock I believe, and the 3rd one being the main/biggest I believe).

Under a separate live CD (using ubuntu server 24.04 LTS and flipping from the install to a new TTY screen), I get all the exact same output for zpool list, status, and import as well. 'lsblk' still returns my nvme devices and their partitions. (though they are not mounted here. and I can't run lsblk from initramfs).
I've run 'smartctl' and verified that all the SMART output there appears fine/correct and it does. I've run 'nvme list' and that seems to return fine/correct stuff. 'nvme smart-log' again, nothing alarming being returned here either for its SMART output too. 'zbd -C -e -p /dev/nvme0n1 rpool' returns a configuration and all that appears to be in order as well (both children/disks listed, paths matching prev output, nothing else alarming etc).

I tried booting into FreeBSD 14.1 earlier but was running into unrelated startup issues (BSD stuff). I can share more but doesn't seem at all related to any of this.
Gparted (live CD) showed both devices just fine too.

I've updated the BIOS on mobo, updated all the settings to match/be correct from prior to update, no change.

I've removed the GPU card entirely and tested all this as well, no change.

I've removed all but one stick of RAM, no change. I've run memtest, no errors being found.

I've been chronicalling all/most of this over here so far / comments / help from other folks, for more pics/output please review these threads.
Initial call for help:
https://furry.engineer/@colinstu/113315894821471446
Next (this) morning after reviewing responses / current status:
https://furry.engineer/@colinstu/113317908970470143
more replies/comments follow in those threads.

I am at a loss at this point. What can I do, is there still any way I can repair or fix my existing zfs pool 'rpool'?

All of my VMs and LXCs are backed up to an 8TB WD SATA drive also present in the machine. My proxmox config/zfs itself however were not backed up.

1) If I can recover w/o reinstalling, this would be the most ideal. What can I try?
2) If I can't recover the pool itself, how can I backup or save all/most of any relevant proxmox config/data that's not included in the guests themselves (host data)? I very much appreciate the steps to do this, or linking to guides/steps elsewhere if they exist.
3) Once #2 is complete above, then I suppose I would feel safe reinstalling Proxmox at that point. Any specific steps on this being different vs a fresh install w/no intent on restorations? Just want to make 100% sure that I don't somehow blow away or mess up my backup drive(s). Eventually I should be able to restore all my VM/LXC backups?

Also is there anything I can check on the old pool to find out what caused this lockup in the first place? Anything I can check that caused this zfs pool issue in the first place?

I appreciate any/all help. Thank you.
 

Attachments

  • 2024-10-17 01.00.53r.jpg
    2024-10-17 01.00.53r.jpg
    930.7 KB · Views: 8
  • 2024-10-17 01.03.24.jpg
    2024-10-17 01.03.24.jpg
    206.6 KB · Views: 8
Last edited:
I believe this is the output for that one I ran a few hours ago. (command was cut off... if this isn't it I'll gather that too)

Also including it for 'zdb -C -e -p'

(Also apologize for this pics, some sort of pikvm is definitely coming in my future).

I see you have tried quite a lot, but I have not seen you attempted -o readonly=on for any of the imports, correct?

Obviously, all this now only makes sense on that e.g. live Ubuntu, so forget the zpool.cache, etc.
 
(Also apologize for this pics, some sort of pikvm is definitely coming in my future).

Also, you don't need this, from any LIVE system, I suppose you have network (or can dump it on USB), so just do your:

command | tee output.txt

And you get to both see what it did and have it saved to a file you can paste in here.
 
I just popped into your linked discussions, you already got some good pieces of advice, I am not sure what you ran and you did not from there.

I would completely disregard the attempt to get it working with initramfs if that pool is not normally importable on a separate system even.

As I have no idea in what way ZFS shipped with PVE might be different from Ubuntu stock, so what I would do is make myself tiny install of PVE somewhere aside, just the basic ext4. In my past experience, the Installer rescue was useless, but getting small extra install of pve helps, then boot into it. (BTW This is why I would prefer to e.g. have root NOT on ZFS even if you do use ZFS for everything else, you just don't have to do this step.)

So this is now better than that live Ubuntu in that your tooling is exactly as it should be.

Try import the rpool there. (This is why I suggested ext4 install for that rescue system install, you won't mix up things.)

You can go on increasingly aggressive (save some forensics approach) on importing that pool, but in the end you may be just happy to be able to mount it at all, even in some incosistent state, to copy out what you cannot get from backups (more on that below).

So when I look at your priorities:

1) If I can recover w/o reinstalling, this would be the most ideal. What can I try?

You are already past that (this is not initramfs issue), also reinstalling host (when you can just implant configs and VMs from backups) is fast. There's nothing valuable on the root that is not in fresh ISO install except configs.

2) If I can't recover the pool itself, how can I backup or save all/most of any relevant proxmox config/data that's not included in the guests themselves (host data)? I very much appreciate the steps to do this, or linking to guides/steps elsewhere if they exist.

I would try (all on the rescue system now) to go on with the importing rabbithole [1], so first and foremost:

zpool import -o readonly=on rpool

It will probably ask you to add -f by itself, if it thinks it's not been exported.

If you got lucky, you will have access to copy the configs out already with this.

If you are getting error 5, that's EIO [2], what would be nice checking is dmesg, but i think you already did and it was rubbish and it will be rubbish as long as you don't have kernel with debug symbols which Proxmox do not ship (and are not planning to). Without a debug kernel, it's really useless to try (for me at the least) to pinpoint what that EIO really is caused by.

So if no luck, at this point, since this was a mirror, I would be considering to go on with importing only one of the drives connected, in case something goes terribly wrong, you should have identical copy on the second. More cautious approach would be to dd [2] image of the drive away and work on that instead, but I think you do not care for that (WRT data loss), especially you have backups, and it's not practical.

I have not seen anything broken in that zdb -d -e rpool of yours, but not sure if it was complete (it would normally show inconsistencies you would be able to read out easily).

So assuming the readonly import alone was not helping, step it up (check the man pages from [1] what those switches do):

zpool import -F rpool

If no luck, you can press on (you might damage things, but this is why you are importing it degraded):

zpool import -FX rpool

Failing all that, my last resort would be [4]:

echo 0 > /sys/module/zfs/parameters/spa_load_verify_metadata
echo 0 > /sys/module/zfs/parameters/spa_load_verify_data
zpool import -FX rpool

Hopefully at any of the points above, you got lucky.

WHAT TO GET OUT:

Well, apart from files to make fresh install easier for you (e.g. /etc/network/interfaces), you want to get out the config (something you should have been backing up):

https://forum.proxmox.com/threads/backup-cluster-config-pmxcfs-etc-pve.154569/
https://forum.proxmox.com/threads/r...ffline-extract-configurations-etc-pve.155374/

3) Once #2 is complete above, then I suppose I would feel safe reinstalling Proxmox at that point. Any specific steps on this being different vs a fresh install w/no intent on restorations? Just want to make 100% sure that I don't somehow blow away or mess up my backup drive(s).

Install without the backup drives connected.

Eventually I should be able to restore all my VM/LXC backups?

Implant your config.db as per links above and happy days looking for having the drives set the same way.

Also is there anything I can check on the old pool to find out what caused this lockup in the first place? Anything I can check that caused this zfs pool issue in the first place?

Not to my knowledge, but as mentioned, you can keep a dd copy of that (if you have the capacity) for further forensics. This is a valuable anectodal experience.

I appreciate any/all help. Thank you.

So I kind of dumped it here on you, but gotta go, so hopefully some if it helps. Proceed carefully, read the linked docs before each step.



[1] https://openzfs.github.io/openzfs-docs/man/master/8/zpool-import.8.html
[2] https://www.kernel.org/doc/Documentation/i2c/fault-codes
[3] https://manpages.debian.org/bookworm/coreutils/dd.1.en.html
[4] https://openzfs.github.io/openzfs-docs/Performance and Tuning/Module Parameters.html#spa-load-verify-data
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!