[SOLVED] Can't recover data from a pool

darylnhan

New Member
Dec 24, 2023
11
1
3
Hello,
I have a pool called ZFSDATACENTER composed of 3 disks of 16TB each + a cache disk /dev/sdf of 900GB. It contained the VMs of my Proxmox 8. Since a crash yesterday, it loops on the zfs-mount service after the zfs-import@ZFSDATACENTER.service. I have to import it manually. However, I can't mount it on the /ZFSDATACENTER directory and access the data."

This is an output :

# zpool status -v ZFSDATACENTER
pool: ZFSDATACENTER
state: ONLINE
scan: scrub in progress since Sun Dec 24 14:38:08 2023
20.9T / 26.1T scanned at 9.44G/s, 968G / 26.1T issued at 438M/s
0B repaired, 3.62% done, 16:45:12 to go
config:

NAME STATE READ WRITE CKSUM
ZFSDATACENTER ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST16000NE000-2WX103_ZR505NC0 ONLINE 0 0 0
ata-ST16000NE000-2WX103_ZR50R98Q ONLINE 0 0 0
ata-ST16000NE000-2WX103_ZR700ASE ONLINE 0 0 0
cache
sdf ONLINE 0 0 0

errors: No known data errors


Thank for your help.
 
# zpool list -v ZFSDATACENTER
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
ZFSDATACENTER 43.7T 26.1T 17.5T - - 43% 59% 1.00x ONLINE -
raidz1-0 43.7T 26.1T 17.5T - - 43% 59.9% - ONLINE
ata-ST16000NE000-2WX103_ZR505NC0 14.6T - - - - - - - ONLINE
ata-ST16000NE000-2WX103_ZR50R98Q 14.6T - - - - - - - ONLINE
ata-ST16000NE000-2WX103_ZR700ASE 14.6T - - - - - - - ONLINE
cache - - - - - - - - -
sdf 838G 837G 962M - - 0% 99.9% - ONLINE
 
Your outputs are hard to read without CODE-tags. Maybe you could share the actual error message you get when you do a zpool import ZFSDATACENTER? Maybe try importing/mounting it somewhere else (temporarily)?
 
It loops ! Yesterday, i tried export and import... I saw dirty things in logs

Code:
prox zpool[2101]: cannot import 'ZFSDATACENTER': no such pool available
 prox systemd[1]: zfs-import@ZFSDATACENTER.service: Main process exited, code=exited, status=1/FAILURE
 prox systemd[1]: zfs-import@ZFSDATACENTER.service: Failed with result 'exit-code'.
 prox systemd[1]: Failed to start zfs-import@ZFSDATACENTER.service - Import ZFS pool ZFSDATACENTER.
 prox systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.
 prox systemd[1]: Reached target zfs-import.target - ZFS pool import target.
 prox systemd[1]: Starting zfs-mount.service - Mount ZFS filesystems...
 prox systemd[1]: Starting zfs-volume-wait.service - Wait for ZFS Volume (zvol) links in /dev...
 prox zvol_wait[2569]: Testing 69 zvol links
 prox zvol_wait[2569]: All zvol links are now present.
 prox systemd[1]: Finished zfs-volume-wait.service - Wait for ZFS Volume (zvol) links in /dev.
 prox systemd[1]: Reached target zfs-volumes.target - ZFS volumes are ready.
 prox kernel: INFO: task txg_sync:2461 blocked for more than 120 seconds.
 prox kernel:       Tainted: P           O       6.5.11-7-pve #1
 prox kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 prox kernel: task:txg_sync        state:D stack:0     pid:2461  ppid:2      flags:0x00004000
 prox kernel: Call Trace:
 prox kernel:  <TASK>
 prox kernel:  __schedule+0x3fd/0x1450
 prox kernel:  schedule+0x63/0x110
 prox kernel:  schedule_timeout+0x95/0x170
 prox kernel:  ? __pfx_process_timeout+0x10/0x10
 prox kernel:  io_schedule_timeout+0x51/0x80
 prox kernel:  __cv_timedwait_common+0x140/0x180 [spl]
 prox kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
 prox kernel:  __cv_timedwait_io+0x19/0x30 [spl]
 prox kernel:  zio_wait+0x13a/0x2c0 [zfs]
 prox kernel:  ? bplist_iterate+0xe7/0x110 [zfs]
 prox kernel:  spa_sync+0x5c9/0x1030 [zfs]
 prox kernel:  ? spa_txg_history_init_io+0x120/0x130 [zfs]
 prox kernel:  txg_sync_thread+0x1fd/0x390 [zfs]
 prox kernel:  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
 prox kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
 prox kernel:  thread_generic_wrapper+0x5c/0x70 [spl]
 prox kernel:  kthread+0xef/0x120
 prox kernel:  ? __pfx_kthread+0x10/0x10
 prox kernel:  ret_from_fork+0x44/0x70
 prox kernel:  ? __pfx_kthread+0x10/0x10
 
Code:
# mount |grep zfs
rpool/ROOT/pve-1 on / type zfs (rw,relatime,xattr,noacl,casesensitive)
rpool on /rpool type zfs (rw,noatime,xattr,noacl,casesensitive)
rpool/ROOT on /rpool/ROOT type zfs (rw,noatime,xattr,noacl,casesensitive)
rpool/data on /rpool/data type zfs (rw,noatime,xattr,noacl,casesensitive)

-----------------------------------------------------
[16:13:22] root@prox:/dev/disk
-----------------------------------------------------
# zpool import  ZFSDATACENTER

It seems be hung !
 
Code:
Dec 24 16:16:43 prox kernel: INFO: task zpool:18821 blocked for more than 120 seconds.
Dec 24 16:16:43 prox kernel:       Tainted: P           O       6.5.11-7-pve #1
Dec 24 16:16:43 prox kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 24 16:16:43 prox kernel: task:zpool           state:D stack:0     pid:18821 ppid:10196  flags:0x00004002
Dec 24 16:16:43 prox kernel: Call Trace:
Dec 24 16:16:43 prox kernel:  <TASK>
Dec 24 16:16:43 prox kernel:  __schedule+0x3fd/0x1450
Dec 24 16:16:43 prox kernel:  schedule+0x63/0x110
Dec 24 16:16:43 prox kernel:  io_schedule+0x46/0x80
Dec 24 16:16:43 prox kernel:  cv_wait_common+0xac/0x140 [spl]
Dec 24 16:16:43 prox kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Dec 24 16:16:43 prox kernel:  __cv_wait_io+0x18/0x30 [spl]
Dec 24 16:16:43 prox kernel:  txg_wait_synced_impl+0xe1/0x130 [zfs]
Dec 24 16:16:43 prox kernel:  txg_wait_synced+0x10/0x60 [zfs]
Dec 24 16:16:43 prox kernel:  zil_replay+0xa1/0x150 [zfs]
Dec 24 16:16:43 prox kernel:  zfsvfs_setup+0x244/0x280 [zfs]
Dec 24 16:16:43 prox kernel:  zfs_domount+0x4b2/0x610 [zfs]
Dec 24 16:16:43 prox kernel:  ? dbuf_rele_and_unlock+0x3a8/0x570 [zfs]
Dec 24 16:16:43 prox kernel:  zpl_mount+0x286/0x300 [zfs]
Dec 24 16:16:43 prox kernel:  legacy_get_tree+0x28/0x60
Dec 24 16:16:43 prox kernel:  vfs_get_tree+0x27/0xe0
Dec 24 16:16:43 prox kernel:  path_mount+0x4e3/0xb20
Dec 24 16:16:43 prox kernel:  ? putname+0x5b/0x80
Dec 24 16:16:43 prox kernel:  __x64_sys_mount+0x127/0x160
Dec 24 16:16:43 prox kernel:  do_syscall_64+0x58/0x90
Dec 24 16:16:43 prox kernel:  ? irqentry_exit_to_user_mode+0x17/0x20
Dec 24 16:16:43 prox kernel:  ? irqentry_exit+0x43/0x50
Dec 24 16:16:43 prox kernel:  ? exc_page_fault+0x94/0x1b0
Dec 24 16:16:43 prox kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Dec 24 16:16:43 prox kernel: RIP: 0033:0x7f94f6282b7a
Dec 24 16:16:43 prox kernel: RSP: 002b:00007f94f4812748 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
Dec 24 16:16:43 prox kernel: RAX: ffffffffffffffda RBX: 00007f94f4815ae0 RCX: 00007f94f6282b7a
Dec 24 16:16:43 prox kernel: RDX: 00007f94f6559151 RSI: 00007f94f4816b40 RDI: 000055bf4b731780
Dec 24 16:16:43 prox kernel: RBP: 00007f94f4814810 R08: 00007f94f48137c0 R09: 0000000000000073
Dec 24 16:16:43 prox kernel: R10: 0000000000200000 R11: 0000000000000246 R12: 000055bf4b731770
Dec 24 16:16:43 prox kernel: R13: 00007f94f4816b40 R14: 000055bf4b731780 R15: 00007f94f48137c0
Dec 24 16:16:43 prox kernel:  </TASK>
Dec 24 16:16:43 prox kernel: INFO: task txg_sync:18696 blocked for more than 120 seconds.
Dec 24 16:16:43 prox kernel:       Tainted: P           O       6.5.11-7-pve #1
Dec 24 16:16:43 prox kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 24 16:16:43 prox kernel: task:txg_sync        state:D stack:0     pid:18696 ppid:2      flags:0x00004000
Dec 24 16:16:43 prox kernel: Call Trace:
Dec 24 16:16:43 prox kernel:  <TASK>
Dec 24 16:16:43 prox kernel:  __schedule+0x3fd/0x1450
Dec 24 16:16:43 prox kernel:  schedule+0x63/0x110
Dec 24 16:16:43 prox kernel:  schedule_timeout+0x95/0x170
Dec 24 16:16:43 prox kernel:  ? __pfx_process_timeout+0x10/0x10
Dec 24 16:16:43 prox kernel:  io_schedule_timeout+0x51/0x80
Dec 24 16:16:43 prox kernel:  __cv_timedwait_common+0x140/0x180 [spl]
Dec 24 16:16:43 prox kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Dec 24 16:16:43 prox kernel:  __cv_timedwait_io+0x19/0x30 [spl]
Dec 24 16:16:43 prox kernel:  zio_wait+0x13a/0x2c0 [zfs]
Dec 24 16:16:43 prox kernel:  ? bplist_iterate+0xe7/0x110 [zfs]
Dec 24 16:16:43 prox kernel:  spa_sync+0x5c9/0x1030 [zfs]
Dec 24 16:16:43 prox kernel:  ? spa_txg_history_init_io+0x120/0x130 [zfs]
Dec 24 16:16:43 prox kernel:  txg_sync_thread+0x1fd/0x390 [zfs]
Dec 24 16:16:43 prox kernel:  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
Dec 24 16:16:43 prox kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Dec 24 16:16:43 prox kernel:  thread_generic_wrapper+0x5c/0x70 [spl]
Dec 24 16:16:43 prox kernel:  kthread+0xef/0x120
Dec 24 16:16:43 prox kernel:  ? __pfx_kthread+0x10/0x10
Dec 24 16:16:43 prox kernel:  ret_from_fork+0x44/0x70
Dec 24 16:16:43 prox kernel:  ? __pfx_kthread+0x10/0x10
Dec 24 16:16:43 prox kernel:  ret_from_fork_asm+0x1b/0x30
Dec 24 16:16:43 prox kernel:  </TASK>
 
Bash:
# systemctl cat zfs-import@ZFSDATACENTER
# /lib/systemd/system/zfs-import@.service
[Unit]
Description=Import ZFS pool %i
Documentation=man:zpool(8)
DefaultDependencies=no
After=systemd-udev-settle.service
After=cryptsetup.target
After=multipathd.target
Before=zfs-import.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/zpool import -N -d /dev/disk/by-id -o cachefile=none %I

[Install]
WantedBy=zfs-import.target
 
# zpool import ZFSDATACENTER

It seems be hung !
Yes, the drives seem to be very slow, which matches the logs you showed. Maybe not import it and check the SMART status of your drives using smartctl -a and run a long test with smartctl -t long on each of them. Or wait for the scrub that you were running to finish.
But, if I start "
Bash:
systemctl start zfs-import@ZFSDATACENTER.service.
"

It can be imported
No, I don't think so, because it is not shown:
Code:
# mount |grep zfs
rpool/ROOT/pve-1 on / type zfs (rw,relatime,xattr,noacl,casesensitive)
rpool on /rpool type zfs (rw,noatime,xattr,noacl,casesensitive)
rpool/ROOT on /rpool/ROOT type zfs (rw,noatime,xattr,noacl,casesensitive)
rpool/data on /rpool/data type zfs (rw,noatime,xattr,noacl,casesensitive)
Must I repair something ?
I guess one of more of the drives are having problems or the drive controller or connections are bad. Check/test them with SMART (and wait a long time).
 
The daemon "smartd" doesn't reveal anything. But I agree with you. I will check/test my hard disks . Thank you and Merry Christmas. I hope recover my precious data..
 
My pool ZFSDATACENTER is recovered after many hours :

"zpool scrub ZFSDATACENTER " did its job.

Code:
# df
Filesystem       Type      Size  Used Avail Use% Mounted on
rpool/ROOT/pve-1 zfs       5.3T  797G  4.5T  15% /
efivarfs         efivarfs  304K   94K  206K  32% /sys/firmware/efi/efivars
rpool            zfs       4.5T  256K  4.5T   1% /rpool
rpool/ROOT       zfs       4.5T  128K  4.5T   1% /rpool/ROOT
rpool/data       zfs       4.5T  128K  4.5T   1% /rpool/data
/dev/fuse        fuse      128M   24K  128M   1% /etc/pve
ZFSDATACENTER    zfs        29T   16T   14T  55% /ZFSDATACENTER
total            -          48T   17T   31T  36% -
 
  • Like
Reactions: leesteken