[SOLVED] Second ZFS pool failed to import on boot

gbas · Sep 27, 2023

fiona said:
Hi,

might be the same issue as mentioned here: https://forum.proxmox.com/threads/128738/post-588785
Do you also see the Device not ready; aborting initialisation, CSTS=0x0 message?

Hi, could very well be, but to be honest I do not know if I got that specific message, and I cannot reproduce it anymore to verify as I am at 7.4 now for the time being. I thought I'd just mention my finding in this ticket to maybe help someone else who might be searching too. It is not a Proxmox issue, its a kernel issue. Kind regards Bas

senkis · Oct 4, 2023

fiona said:
Hi,

might be the same issue as mentioned here: https://forum.proxmox.com/threads/128738/post-588785
Do you also see the Device not ready; aborting initialisation, CSTS=0x0 message?

I have the same issue and the error appears sometimes, but in my case it is intermittent, it would run for around a day and then start failing.

SSD: Lexar NM790 2TB
Kernel: Linux pve 6.2.16-14-pve

adams13 · Oct 12, 2023

I think, there is a known bug in the linux kernel, it affects Lexar NM790 (4TB) SSDs:
https://bugzilla.kernel.org/show_bug.cgi?id=217863

Probably one has to modify and compile the kernel.

Otherwise the bug was reported to be fixed in kernel 6.5.5.

fiona · Oct 12, 2023

adams13 said:
I think, there is a known bug in the linux kernel, it affects Lexar NM790 (4TB) SSDs:
https://bugzilla.kernel.org/show_bug.cgi?id=217863

Probably one has to modify and compile the kernel.

Otherwise the bug was reported to be fixed in kernel 6.5.5.

The commit: https://git.kernel.org/pub/scm/linu.../?id=6cc834ba62998c65c42d0c63499bdd35067151ec
is CC-ing stable@vger.kernel.org, so there is a good chance it might come in via the Ubuntu tree for 6.2. Also, the next Proxmox VE point release in Q4 2023 is planned to use kernel 6.5, and there should be testing versions of a 6.5 kernel package released ahead of that (likely in the coming weeks).

axes · Dec 31, 2023

I'm a proxmox newbie and relatively new to really using linux as well. I just did my first real fresh install of PVE a couple of days ago and in the process of looking around I noticed that I seemed to have the same exact issue as the original poster here (@mstefan) described: my second ZFS pool was failing to import, but it was still showing as ONLINE!

This was having the effect that it was showing a failed status on the zfs-import@POOLNAME.service, and this in turn was making my overall system state (by running "systemctl status" with no service name) display as "degraded".

I think I may have a couple of clues as to what may be going on. After reading a bunch of threads, articles, and documentation, I reached the conclusion that since the second pool is already in the cache file and there is another service named "zfs-import-cache.service", this means that "zfs-import@POOLNAME.service" is redundant and should not exist. The zfs-import-cache.service should already be importing all pools that are listed in the cache file, so when zfs-import@POOLNAME.service tries to import the second pool again, it fails because it had already been imported by zfs-import-cache.service. Notice how there is no zfs-import@rpool.service for the root pool.

This would explain why removing the second pool from the cache file, placing it in another cache file, and removing the symlink all seem to resolve the issue. In my case, I tested this with a "systemctl disable zfs-import@POOLNAME.service" and reboot. This also worked.

Finally, I looked around the Web GUI and retraced my steps in setting up this pool. I had first gone to pve -> disks -> zfs -> create: zfs to create the pool, but then I also went to datacenter -> storage -> add -> zfs and added my new pool in there as well. I need to read the actual correct procedure for this to see if I did something wrong but I'm out of time at the moment. For now I removed my pool from the storage list and this removed the zfs-import@POOLNAME.service from the list of services.

adams13 · Dec 31, 2023

axes said:
I'm a proxmox newbie and relatively new to really using linux as well. I just did my first real fresh install of PVE a couple of days ago and in the process of looking around I noticed that I seemed to have the same exact issue as the original poster here (@mstefan) described: my second ZFS pool was failing to import, but it was still showing as ONLINE!

This was having the effect that it was showing a failed status on the zfs-import@POOLNAME.service, and this in turn was making my overall system state (by running "systemctl status" with no service name) display as "degraded".

I think I may have a couple of clues as to what may be going on. After reading a bunch of threads, articles, and documentation, I reached the conclusion that since the second pool is already in the cache file and there is another service named "zfs-import-cache.service", this means that "zfs-import@POOLNAME.service" is redundant and should not exist. The zfs-import-cache.service should already be importing all pools that are listed in the cache file, so when zfs-import@POOLNAME.service tries to import the second pool again, it fails because it had already been imported by zfs-import-cache.service. Notice how there is no zfs-import@rpool.service for the root pool.

This would explain why removing the second pool from the cache file, placing it in another cache file, and removing the symlink all seem to resolve the issue. In my case, I tested this with a "systemctl disable zfs-import@POOLNAME.service" and reboot. This also worked.

Finally, I looked around the Web GUI and retraced my steps in setting up this pool. I had first gone to pve -> disks -> zfs -> create: zfs to create the pool, but then I also went to datacenter -> storage -> add -> zfs and added my new pool in there as well. I need to read the actual correct procedure for this to see if I did something wrong but I'm out of time at the moment. For now I removed my pool from the storage list and this removed the zfs-import@POOLNAME.service from the list of services.

Hi @axes, probably "datacenter -> storage -> add -> zfs" is meant to add some zfs pools that are not part of the existing nodes but just an "independent" storage that can be used by any node. I did not add anything under "datacenter -> storage", my zfs pool completely inside the "pve" node. However, it is listed under "datacenter -> storage".

axes · Dec 31, 2023

adams13 said:
Hi @axes, probably "datacenter -> storage -> add -> zfs" is meant to add some zfs pools that are not part of the existing nodes but just an "independent" storage that can be used by any node. I did not add anything under "datacenter -> storage", my zfs pool completely inside the "pve" node. However, it is listed under "datacenter -> storage".

Hi @adams13, that's odd. If I hadn't added my second pool in "datacenter -> storage", I don't think it would have shown in there. I need to read more of the documentation and articles on configuring storage in pve. I still think that the zfs-import@POOLNAME.service is redundant though, if the pool is already in the cache file (as I believe it usually is by default). The pool import only needs to be in one, either the cache file to be imported by the cache import service or in its own service, but not both.

adams13 · Dec 31, 2023

fiona said:
The commit: https://git.kernel.org/pub/scm/linu.../?id=6cc834ba62998c65c42d0c63499bdd35067151ec
is CC-ing stable@vger.kernel.org, so there is a good chance it might come in via the Ubuntu tree for 6.2. Also, the next Proxmox VE point release in Q4 2023 is planned to use kernel 6.5, and there should be testing versions of a 6.5 kernel package released ahead of that (likely in the coming weeks).

Yes, "4 TB Lexar NM790" does work with the new Proxmox 8.1 having kernel 6.5.11!

Search

Search

[SOLVED] Second ZFS pool failed to import on boot

gbas

New Member

senkis

New Member

adams13

New Member

fiona

Proxmox Staff Member

axes

New Member

adams13

New Member

axes

New Member

adams13

New Member

We value your privacy