zfs troubleshooting

nxet

Member
Sep 13, 2020
17
3
8
124
I'm having bad issues with my ZFS pool, after a power failure there is no way for it to come back online, and I'm seriously struggling to properly troubleshoot my issues, so any pointer would be highly appreciated.

For a brief period after the power failure, zfs list returned all datasets but trying to access any data made the console unresponsive. After rebooting nothing worked anymore.
Now the first thing after logging into the console is a kernel panic message, something along the lines of zfs: adding existent segment to range tree.
Running ps aux | grep zpool shows a zpool import -d /dev/disk/by-id/ -o cachefile=none tank running since boot, and impossible to kill -9.
The /etc/zfs/zpool.cache is a 0 bytes file with last modification at the time of the power failure.
The 3x 2tb HDDs all show up fine in BIOS, returning healty SMART data and correctly displaying their partitions with ls -la /dev/sd*.

Unfortunately zfs list hangs forever, and so does zpool status -v.
I tried unplugging one drive at a time hoping to at least see a degraded pool, but regardless of the unplugged drive the commands above still make the console unresponsive.
I also tried to change RAM and SATA cables, but those were to no avail either.

Hopefully I managed to properly summarize the issue here, but of course feel free to ask for any other detail about my environment.
Really hoping to get some pointers on the right commands to run in order to properly troubleshoot the issue.
Thanks a lot
 
The cache file can be recreated and zpool status only shows imported pools (IIRC), which is not working automatically anymore, so maybe all is not lost.
What does zpool import -a show you?
 
The cache file can be recreated and zpool status only shows imported pools (IIRC), which is not working automatically anymore, so maybe all is not lost.
What does zpool import -a show you?
no pools available to import
 
After buying a 8Tb drive and ddrescueing the drives I started poking around a bit.
I realized I can easily mount the pool readonly with zpool import -o readonly=on tank, and it will show 1 disk as DEGRADED.
If I instead mount with zpool import -o readonly=on -F tank the pool mounts fine with all 3 drives showing no errors.
When I try and mount it without the readonly flag though, I get a kernel panic and any subsequent zpool command will hang indefinitely. dmesg outputs the following

[ 665.496320] PANIC: zfs: adding existent segment to range tree (offset=31e59c4000 size=2000)
[ 665.496329] Showing stack for process 5498
[ 665.496333] CPU: 3 PID: 5498 Comm: dmu_objset_find Tainted: P O 5.4.106-1-pve #1
[ 665.496334] Hardware name: System manufacturer System Product Name/M3A78-CM, BIOS 2801 08/23/2010
[ 665.496335] Call Trace:
[ 665.496346] dump_stack+0x6d/0x8b
[ 665.496361] spl_dumpstack+0x29/0x2b [spl]
[ 665.496369] vcmn_err.cold.1+0x60/0x94 [spl]
[ 665.496548] ? arc_read+0x4fb/0x1210 [zfs]
[ 665.496605] ? bmov+0x17/0x20 [zfs]
[ 665.496665] ? zfs_btree_insert_leaf_impl.isra.9+0x4c/0x60 [zfs]
[ 665.496721] ? zfs_btree_insert_into_leaf+0x1c4/0x230 [zfs]
[ 665.496793] zfs_panic_recover+0x6f/0x90 [zfs]
[ 665.496863] range_tree_add_impl+0x2b5/0x1030 [zfs]
[ 665.496925] ? dbuf_read+0x1eb/0x510 [zfs]
[ 665.496995] range_tree_add+0x11/0x20 [zfs]
[ 665.497062] space_map_load_callback+0x59/0x90 [zfs]
[ 665.497134] space_map_iterate+0x1df/0x3e0 [zfs]
[ 665.497143] ? __switch_to_asm+0x40/0x70
[ 665.497210] ? spa_stats_destroy+0x1c0/0x1c0 [zfs]
[ 665.497282] space_map_load_length+0x64/0xe0 [zfs]
[ 665.497351] metaslab_load+0x15f/0x830 [zfs]
[ 665.497359] ? del_timer_sync+0x29/0x40
[ 665.497362] ? __wake_up_common_lock+0x8c/0xc0
[ 665.497427] metaslab_activate+0x4c/0x230 [zfs]
[ 665.497497] metaslab_claim_concrete+0x8b/0x390 [zfs]
[ 665.497566] metaslab_claim_impl+0x8b/0xa0 [zfs]
[ 665.497637] ? spa_config_enter+0xba/0x110 [zfs]
[ 665.497702] metaslab_claim+0xe0/0x230 [zfs]
[ 665.497775] zio_dva_claim+0x27/0x40 [zfs]
[ 665.497843] zio_wait+0xe1/0x280 [zfs]
[ 665.497917] zil_claim_log_block+0x87/0xa0 [zfs]
[ 665.497990] zil_parse+0x217/0x8c0 [zfs]
[ 665.498063] ? zil_free_lwb.isra.16+0xb0/0xb0 [zfs]
[ 665.498137] ? zil_claim_log_block+0xa0/0xa0 [zfs]
[ 665.498212] zil_check_log_chain+0x104/0x1c0 [zfs]
[ 665.498276] dmu_objset_find_dp_impl+0x145/0x3f0 [zfs]
[ 665.498340] dmu_objset_find_dp_cb+0x2a/0x40 [zfs]
[ 665.498351] taskq_thread+0x2f7/0x4e0 [spl]
[ 665.498356] ? wake_up_q+0x80/0x80
[ 665.498359] kthread+0x120/0x140
[ 665.498365] ? task_done+0xb0/0xb0 [spl]
[ 665.498366] ? kthread_park+0x90/0x90
[ 665.498368] ret_from_fork+0x22/0x40

and then
[ 846.981941] INFO: task zpool:5369 blocked for more than 120 seconds.
[ 846.981949] Tainted: P O 5.4.106-1-pve #1
[ 846.981952] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 846.981955] zpool D 0 5369 2535 0x80004002
[ 846.981958] Call Trace:
[ 846.981967] __schedule+0x2e6/0x700
[ 846.981970] schedule+0x33/0xa0
[ 846.981982] taskq_wait+0x80/0xd0 [spl]
[ 846.981985] ? wait_woken+0x80/0x80
[ 846.982119] dmu_objset_find_dp+0x127/0x230 [zfs]
[ 846.982188] ? zil_claim+0x290/0x290 [zfs]
[ 846.982260] spa_load+0xe3a/0x13e0 [zfs]
[ 846.982327] ? __dprintf+0xe5/0x160 [zfs]
[ 846.982398] spa_load_best+0x57/0x2d0 [zfs]
[ 846.982465] spa_import+0x1ea/0x810 [zfs]
[ 846.982481] ? nvpair_value_common.part.13+0x14d/0x170 [znvpair]
[ 846.982550] zfs_ioc_pool_import+0x12d/0x150 [zfs]
[ 846.982623] zfsdev_ioctl_common+0x5b2/0x820 [zfs]
[ 846.982631] ? __kmalloc_node+0x267/0x330
[ 846.982633] ? lru_cache_add_active_or_unevictable+0x39/0xb0
[ 846.982701] zfsdev_ioctl+0x54/0xe0 [zfs]
[ 846.982708] do_vfs_ioctl+0xa9/0x640
[ 846.982712] ? handle_mm_fault+0xc9/0x1f0
[ 846.982713] ksys_ioctl+0x67/0x90
[ 846.982715] __x64_sys_ioctl+0x1a/0x20
[ 846.982718] do_syscall_64+0x57/0x190
[ 846.982721] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 846.982723] RIP: 0033:0x7fbcb9565427
[ 846.982729] Code: Bad RIP value.
[ 846.982730] RSP: 002b:00007fff151be678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 846.982732] RAX: ffffffffffffffda RBX: 00007fff151be6f0 RCX: 00007fbcb9565427
[ 846.982733] RDX: 00007fff151be6f0 RSI: 0000000000005a02 RDI: 0000000000000003
[ 846.982734] RBP: 00007fff151c25e0 R08: 000056268a443130 R09: 00007fbcb9631440
[ 846.982735] R10: 000056268a416010 R11: 0000000000000246 R12: 000056268a417530
[ 846.982735] R13: 000056268a41a890 R14: 0000000000000000 R15: 0000000000000000
[ 846.982753] INFO: task dmu_objset_find:5498 blocked for more than 120 seconds.
[ 846.982756] Tainted: P O 5.4.106-1-pve #1
[ 846.982759] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 846.982761] dmu_objset_find D 0 5498 2 0x80004000
[ 846.982763] Call Trace:
[ 846.982765] __schedule+0x2e6/0x700
[ 846.982768] schedule+0x33/0xa0
[ 846.982774] vcmn_err.cold.1+0x92/0x94 [spl]
[ 846.982833] ? bmov+0x17/0x20 [zfs]
[ 846.982894] ? zfs_btree_insert_leaf_impl.isra.9+0x4c/0x60 [zfs]
[ 846.982949] ? zfs_btree_insert_into_leaf+0x1c4/0x230 [zfs]
[ 846.983022] zfs_panic_recover+0x6f/0x90 [zfs]
[ 846.983092] range_tree_add_impl+0x2b5/0x1030 [zfs]
[ 846.983149] ? dbuf_read+0x1eb/0x510 [zfs]
[ 846.983220] range_tree_add+0x11/0x20 [zfs]
[ 846.983291] space_map_load_callback+0x59/0x90 [zfs]
[ 846.983363] space_map_iterate+0x1df/0x3e0 [zfs]
[ 846.983370] ? __switch_to_asm+0x40/0x70
[ 846.983442] ? spa_stats_destroy+0x1c0/0x1c0 [zfs]
[ 846.983510] space_map_load_length+0x64/0xe0 [zfs]
[ 846.983580] metaslab_load+0x15f/0x830 [zfs]
[ 846.983587] ? del_timer_sync+0x29/0x40
[ 846.983589] ? __wake_up_common_lock+0x8c/0xc0
[ 846.983654] metaslab_activate+0x4c/0x230 [zfs]
[ 846.983724] metaslab_claim_concrete+0x8b/0x390 [zfs]
[ 846.983793] metaslab_claim_impl+0x8b/0xa0 [zfs]
[ 846.983865] ? spa_config_enter+0xba/0x110 [zfs]
[ 846.983930] metaslab_claim+0xe0/0x230 [zfs]
[ 846.983998] zio_dva_claim+0x27/0x40 [zfs]
[ 846.984071] zio_wait+0xe1/0x280 [zfs]
[ 846.984145] zil_claim_log_block+0x87/0xa0 [zfs]
[ 846.984219] zil_parse+0x217/0x8c0 [zfs]
[ 846.984292] ? zil_free_lwb.isra.16+0xb0/0xb0 [zfs]
[ 846.984366] ? zil_claim_log_block+0xa0/0xa0 [zfs]
[ 846.984441] zil_check_log_chain+0x104/0x1c0 [zfs]
[ 846.984500] dmu_objset_find_dp_impl+0x145/0x3f0 [zfs]
[ 846.984564] dmu_objset_find_dp_cb+0x2a/0x40 [zfs]
[ 846.984575] taskq_thread+0x2f7/0x4e0 [spl]
[ 846.984581] ? wake_up_q+0x80/0x80
[ 846.984584] kthread+0x120/0x140
[ 846.984591] ? task_done+0xb0/0xb0 [spl]
[ 846.984593] ? kthread_park+0x90/0x90
[ 846.984595] ret_from_fork+0x22/0x40

I think that drive failure can be ruled out at this point, but I'm still skeptical about the mainboard/SATA controller and RAM.

The latter has been extensively tested, and in the end I managed to have a 100% stable system with 6Gb installed.
I've been reading a lot on the subject and it is recommended avoiding dedup=on with so little RAM, but I didn't know at the time of setting up the datasets. In any case after further investigating this potential misconfiguration I can confirm that the DDT takes up about 300Mb for now, so it should not be the issue either.

Do I have any options to restore the pool?
At this point with my data safely backed up I just find it quite frustrating to see a healty, unmountable pool.
 
Last edited:
To me, it sounds like the metadata of the pool is not healthy, but ZFS does not detect this (until it crashes). Can you zpool scrub a readonly pool and does it find no errors? Since you can read the files, at least you can copy (better not use send/receive) the files to another pool. Maybe someone more knowledgeable/experienced than me can say something about it?
 
Unfortunately scrub cannot be run on a readonly pool, so it's not an option.

Running zdb -e -d tank seem to report no problems

Dataset mos [META], ID 0, cr_txg 4, 1.91G, 1139 objects
Dataset tank/data@202104010100 [ZPL], ID 85, cr_txg 3036110, 24.6G, 178007 objects
...
Dataset tank/data [ZPL], ID 597, cr_txg 2208197, 25.1G, 179747 objects
...
Dataset tank [ZPL], ID 54, cr_txg 1, 192K, 14 objects
Verified large_blocks feature refcount of 0 is correct
Verified large_dnode feature refcount of 0 is correct
Verified sha512 feature refcount of 0 is correct
Verified skein feature refcount of 0 is correct
Verified edonr feature refcount of 0 is correct
Verified userobj_accounting feature refcount of 141 is correct
Verified encryption feature refcount of 0 is correct
Verified project_quota feature refcount of 141 is correct
Verified device_removal feature refcount of 0 is correct
Verified indirect_refcount feature refcount of 0 is correct

while zdb -e -bcsvL tank happens to be quite a long task...
Traversing all blocks to verify checksums ...
65.5M completed ( 5MB/s) estimated time remaining: 135hr 13min 07sec


In truth as soon as I got it mounted ro, I rushed to zfs send dataset > /path/to/backup/dataset.zfs and I succesfully exported about 1.5Tb of data with that method. After that I also ran a couple hours of rsync operations, and the pool/drives showed no issues at all.
 
After a lot more poking around, I managed to narrow the problem a bit more:
Code:
$ zdb -e -bc pool1
Traversing all blocks to verify checksums and verify nothing leaked ...
loading concrete vdev 0, metaslab 12 of 349 ...error: zfs: removing nonexistent segment from range tree (offset=31e59c4000 size=2000)
Aborted

Hopefully someone with more knowledge will be able to weigh in, from what I've been reading about metaslabs and space maps, they should not be anything crucial for the pool. At this point I'm just hoping there's some command which can trigger a rebuild of sorts?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!