I'm having bad issues with my ZFS pool, after a power failure there is no way for it to come back online, and I'm seriously struggling to properly troubleshoot my issues, so any pointer would be highly appreciated.
For a brief period after the power failure, zfs list returned all datasets but trying to access any data made the console unresponsive. After rebooting nothing worked anymore.
Now the first thing after logging into the console is a kernel panic message, something along the lines of zfs: adding existent segment to range tree.
Running ps aux | grep zpool shows a zpool import -d /dev/disk/by-id/ -o cachefile=none tank running since boot, and impossible to kill -9.
The /etc/zfs/zpool.cache is a 0 bytes file with last modification at the time of the power failure.
The 3x 2tb HDDs all show up fine in BIOS, returning healty SMART data and correctly displaying their partitions with ls -la /dev/sd*.
Unfortunately zfs list hangs forever, and so does zpool status -v.
I tried unplugging one drive at a time hoping to at least see a degraded pool, but regardless of the unplugged drive the commands above still make the console unresponsive.
I also tried to change RAM and SATA cables, but those were to no avail either.
Hopefully I managed to properly summarize the issue here, but of course feel free to ask for any other detail about my environment.
Really hoping to get some pointers on the right commands to run in order to properly troubleshoot the issue.
Thanks a lot
The cache file can be recreated and zpool status only shows imported pools (IIRC), which is not working automatically anymore, so maybe all is not lost.
What does zpool import -a show you?
The cache file can be recreated and zpool status only shows imported pools (IIRC), which is not working automatically anymore, so maybe all is not lost.
What does zpool import -a show you?
After buying a 8Tb drive and ddrescueing the drives I started poking around a bit.
I realized I can easily mount the pool readonly with zpool import -o readonly=on tank, and it will show 1 disk as DEGRADED.
If I instead mount with zpool import -o readonly=on -F tank the pool mounts fine with all 3 drives showing no errors.
When I try and mount it without the readonly flag though, I get a kernel panic and any subsequent zpool command will hang indefinitely. dmesg outputs the following
I think that drive failure can be ruled out at this point, but I'm still skeptical about the mainboard/SATA controller and RAM.
The latter has been extensively tested, and in the end I managed to have a 100% stable system with 6Gb installed.
I've been reading a lot on the subject and it is recommended avoiding dedup=on with so little RAM, but I didn't know at the time of setting up the datasets. In any case after further investigating this potential misconfiguration I can confirm that the DDT takes up about 300Mb for now, so it should not be the issue either.
Do I have any options to restore the pool?
At this point with my data safely backed up I just find it quite frustrating to see a healty, unmountable pool.
To me, it sounds like the metadata of the pool is not healthy, but ZFS does not detect this (until it crashes). Can you zpool scrub a readonly pool and does it find no errors? Since you can read the files, at least you can copy (better not use send/receive) the files to another pool. Maybe someone more knowledgeable/experienced than me can say something about it?
Unfortunately scrub cannot be run on a readonly pool, so it's not an option.
Running zdb -e -d tank seem to report no problems
Dataset mos [META], ID 0, cr_txg 4, 1.91G, 1139 objects
Dataset tank/data@202104010100 [ZPL], ID 85, cr_txg 3036110, 24.6G, 178007 objects
...
Dataset tank/data [ZPL], ID 597, cr_txg 2208197, 25.1G, 179747 objects
...
Dataset tank [ZPL], ID 54, cr_txg 1, 192K, 14 objects
Verified large_blocks feature refcount of 0 is correct
Verified large_dnode feature refcount of 0 is correct
Verified sha512 feature refcount of 0 is correct
Verified skein feature refcount of 0 is correct
Verified edonr feature refcount of 0 is correct
Verified userobj_accounting feature refcount of 141 is correct
Verified encryption feature refcount of 0 is correct
Verified project_quota feature refcount of 141 is correct
Verified device_removal feature refcount of 0 is correct
Verified indirect_refcount feature refcount of 0 is correct
while zdb -e -bcsvL tank happens to be quite a long task...
Traversing all blocks to verify checksums ...
65.5M completed ( 5MB/s) estimated time remaining: 135hr 13min 07sec
In truth as soon as I got it mounted ro, I rushed to zfs send dataset > /path/to/backup/dataset.zfs and I succesfully exported about 1.5Tb of data with that method. After that I also ran a couple hours of rsync operations, and the pool/drives showed no issues at all.
After a lot more poking around, I managed to narrow the problem a bit more:
Code:
$ zdb -e -bc pool1
Traversing all blocks to verify checksums and verify nothing leaked ...
loading concrete vdev 0, metaslab 12 of 349 ...error: zfs: removing nonexistent segment from range tree (offset=31e59c4000 size=2000)
Aborted
Hopefully someone with more knowledge will be able to weigh in, from what I've been reading about metaslabs and space maps, they should not be anything crucial for the pool. At this point I'm just hoping there's some command which can trigger a rebuild of sorts?
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.