ZFS invalid checksum 0

kip

New Member
May 20, 2021
5
0
1
Greetings!

I have a server running Proxmox VE 6.4-6 on Debian 10 (buster) with all my data and VMs on a zfs pool, "primary-storage". Yesterday, my server froze up and after about 30 minutes of no response, I cut it off by holding the power button. Since then, I have been unable to access my zfs pool.
  • When the server boots up to the login screen, I get a PANIC: primary-storage: blkptr at 000000006bf40adb has invalid CHECKSUM 0 message followed by periodic INFO: task zpool:961 blocked for more than X seconds. messages.
  • The web interface fails to load the ZFS page for that node.
  • ps aux | grep zpool returns two processes which I cannot kill:
    root 961 0.0 0.0 568016 4688 ? D 13:26 0:00 zpool import -d /dev/disk/by-id/ -o cachefile=none primary-storage root 1161 0.0 0.0 10640 4216 ? D 13:26 0:00 zpool list -o name -H primary-storage
  • zpool import returns no pools available to import.
  • zpool list hangs seemingly forever.
I don't have much experience working with zfs directly - I mostly just set it up in the web ui and left it alone. I would appreciate any assistance in getting my system back up and running (or at least pulling the data out since I don't have a recent backup). Thanks!
 
Were that pool using some kind of raid? If its just a single disk ZFS can't repair corrupted data. Thats only working if you use a mirror or raidz.
 
I have two identical drives mirrored. Are there some commands I need to run or could it be as simple as booting with only one drive plugged in? I somehow hadn't considered even trying that.
 
I was able to get more information about the zpool by removing my OS drive and installing Proxmox on another drive:

Code:
# zpool import
   pool: primary-storage
     id: 14124500962267091041
  state: ONLINE
status: The pool was last accessed by another system.
 action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
 config:
        primary-storage                               ONLINE
          mirror-0                                    ONLINE
            ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3LE8LLS  ONLINE
            ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N0SPT18Y  ONLINE

Still no progress actually importing it.
 
My experience with similar errors is that it is not repairable, even with a mirror/RAID1 otherwise ZFS would have repaired it already (by reading from the non-faulty drive). Somehow the metadata got written incorrectly or its checksum got written incorrectly. Maybe there is a way to switch to an earlier version of the metadata (and lose some data) that is still correct? Maybe zpool import -F or even -F -X or --rewind-to-checkpoint could help here? Please read the manual because it is labeled with "WARNING: This option can be extremely hazardous to the health of your pool and should only be used as a last resort". Maybe someone more experienced can give advise on this?
 
If you don't want to risk loosing data and you got a pair of spare drives you should do it like always with data recovery.
Do a block level copy from the damaged drives to empty new drives and only work with the copy so the original drives are untouched and you can try it again if something bad happens.
 
I found a couple of drives I could free up for cloning to play it safe. The first dd clone is currently estimated at 1 day 20 hours... I'll report back later. Thanks for the suggestions so far!
 
Both zpool import -F -f primary-storage and zpool import -F -X -f primary-storage resulted in the same checksum panic I started with. zpool import --rewind-to-checkpoint -f primary-storage stated that there were no checkpoints (I didn't create any and I guess Proxmox didn't either).

Just to confirm - once there is a panic, there is no sense in waiting any longer, right? I gave each of the import panics about five minutes before rebooting to try another command.