Unable To Mount ZFS disk

PMVE-User

Member
Jan 12, 2023
32
6
8
Working Drive no longer auto mounts or mounts when ProxMox is booted up.

The Drive is visible using `lsblk -f` but not visible in the ZFS drives section. The current drive was working 100% yesterday. Did a reboot of the PMVE server and realized that the drive is no longer mounting as the VM keeps failing to start.

How do I go about fixing this issue as rebooting has not solved anything.

I have attached screenshots showing the issue with the zfs "MV-Storage-3TB" drive.

pm-error-imgs-0.png


pm-error-imgs-1.png

pm-error-imgs-2.png

pm-error-imgs-3.png

Any help on this issue is much appreciated.

Thank You
 
Whats the output of zpool status and zpool import?
Code:
root@node1:~# zpool import
no pools available to import
root@node1:~# zpool status
  pool: Mirror-6TB-1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 29.2M in 00:00:07 with 0 errors on Sat Dec 17 18:01:01 2022
config:

        NAME                                          STATE     READ WRITE CKSUM
        Mirror-6TB-1                                  DEGRADED     0     0     0
          mirror-0                                    DEGRADED     0     0     0
            ata-WDC_WD30EZRX-00_WD-  DEGRADED     0     0     0  too many errors
            ata-WDC_WD30EZRX-00_WD-  ONLINE       0     0     0

errors: No known data errors
root@node1:~#

I will run `zpool clear` once i am able to get the "MV-Storage-3TB" mounted
 
Then there is the question why your pool is degraded because of "too many errors" when your disks aren't showing any errors.
And a zpool clear won't fix anything. Then you are just ignoring the problem.

I would run a scrub to see if there are really no problems.
 
Last edited:
Then there is the question why your pool is degraded because of "too many errors" when your disks ren't showing any errors.
And a zpool clear won't fix anything. Then you are just ignoring the problem.
That's what I was thinking.

I would run a scrub to see if there are really no problems

What do you mean by "run a scrub" ?
 
Last edited:
So I figured out you meant "zpool scrub". Would that also fix or show errors on the "MV-Storage-3TB" drive as it does not show up in `zpool import` or `zpool status`
 
Trying to run a scrub on MV-Storage-3TB results in the following:
Bash:
root@node1:~# zpool scrub MV-Storage-3TB
cannot open 'MV-Storage-3TB': no such pool

Current Status for Mirror-6TB-1 scrub:
Bash:
root@node1:~# zpool status
  pool: Mirror-6TB-1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Jan 13 00:51:20 2023
        128G scanned at 343M/s, 33.2G issued at 89.0M/s, 918G total
        0B repaired, 3.62% done, 02:49:43 to go
config:

        NAME                                          STATE     READ WRITE CKSUM
        Mirror-6TB-1                                  DEGRADED     0     0     0
          mirror-0                                    DEGRADED     0     0     0
            ata-WDC_WD30E_WD-  DEGRADED     0     0     0  too many errors
            ata-WDC_WD30EZ_WD-  ONLINE       0     0     0

errors: No known data errors
 
So I figured out you meant "zpool scrub". Would that also fix or show errors on the "MV-Storage-3TB" drive as it does not show up in `zpool import` or `zpool status`
It will read those entire disks and recalculate the checksums. If it finds a checksum error (=corrupted data/metadata) it will fix it with the healthy copy of the mirrored disk.
 
  • Like
Reactions: PMVE-User
It will read those entire disks and recalculate the checksums. If it finds a checksum error (=corrupted data/metadata) it will fix it with the healthy copy of the mirrored disk.

That makes sense, the scrub on the Mirror-6TB-1 should complete in the next 2hrs:30Min.

I am still completely stumped on what to do with the 'MV-Storage-3TB' drive that's showing up but not in the ZFS tab/zpool list/zpool import.

I am doing some research on this issue, have not found something similar as of yet.

This one comes close: https://forum.proxmox.com/threads/s...viously-working-zfs-disk-image-for-vm.108584/
 
That

That means nothing. You could even destroy that pool and it would still be listed in the webUI PVE and that storage.cfg.
Interesting, i still cannot figure out why the drive shows up in `lsblk`, the "storage" section and the "disks" section but i'm not able to mount the pool using `zpool`.

Mind boggling things occurring on my instance.
 
/dev/sdf is not even recognized as a "Hard Disk" in PVE anymore, instead it is "unknown"...

The full outputs in code-tags each of the following might be helpful:
  • qm config 105
  • fdisk -l /dev/sdf
  • smartctl -a /dev/sdf

Did you already run a short (smartctl -t short /dev/sdf) and/or long (smartctl -t long /dev/sdf) SMART-test: [1]?

[1] https://www.thomas-krenn.com/en/wiki/SMART_tests_with_smartctl

Correct, I only picked up on it recently after double checking the "Storage" section. Which then lead me to check the `/dev/disk/by-uuid/` turns out its there.

pm-error-imgs-5.png

The following `qm config 105` shows the correct VM config with the correct drives.
Bash:
agent: 1
balloon: 2048
bios: ovmf
boot: order=sata0;net0;sata1;sata2;sata3
cores: 4
cpu: kvm64,flags=+aes
efidisk0: local-lvm:vm-105-disk-0,efitype=4m,size=4M
memory: 4096
meta: creation-qemu=6.2.0,ctime=1652466304
name: OpenMediaVault
net0: e1000=MAC_ADDR,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
sata0: local-lvm:vm-105-disk-1,size=34408M,ssd=1
sata1: MV-Storage-3TB:vm-105-disk-0,size=2600G,ssd=1
sata2: Storage-1:vm-105-disk-0,size=300G,ssd=1
sata3: Storage-1:vm-105-disk-1,size=100G,ssd=1
smbios1: uuid=UUID-XXXXX
sockets: 1
startup: order=1
vmgenid: VMGENID-XXXXXX

Double checked the hardware as well. Removed the drive and changed the cables.

`fdisk -l /dev/sdf` results show that the drive is plugged
Code:
Disk /dev/sdf: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk model: WDC WD30EZ...
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: BFC8CA42-BCD7-2148-.....

Device          Start        End    Sectors  Size Type
/dev/sdf1        2048 5860515839 5860513792  2.7T Solaris /usr & Apple ZFS
/dev/sdf9  5860515840 5860532223      16384    8M Solaris reserved 1

the result of `smartctl -a /dev/sdf`

Code:
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Green
Device Model:     WDC XXXXXX
Serial Number:    WD-XXXXXX
LU WWN Device Id: XXXXX
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (51180) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 492) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   157   140   021    Pre-fail  Always       -       9125
  4 Start_Stop_Count        0x0032   078   078   000    Old_age   Always       -       22401
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   037   037   000    Old_age   Always       -       46291
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   092   092   000    Old_age   Always       -       8668
192 Power-Off_Retract_Count 0x0032   197   197   000    Old_age   Always       -       2988
193 Load_Cycle_Count        0x0032   039   039   000    Old_age   Always       -       484308
194 Temperature_Celsius     0x0022   116   091   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Not too sure what to make out of this as everything points in my opinion to a working drive but its not working.
 
Last edited:
I am currently running a `smartctl --test=long /dev/sdf` to see what the results of this would be.

It might show some errors that give me a better idea to what could be causing this issue.
 
After running the tests here are the results.

Code:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  25) The self-test routine was aborted by
                                        the host.
Total time to complete Offline
data collection:                (51180) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 492) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       14563
  3 Spin_Up_Time            0x0027   143   140   021    Pre-fail  Always       -       9841
  4 Start_Stop_Count        0x0032   078   078   000    Old_age   Always       -       22403
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   037   037   000    Old_age   Always       -       46303
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   092   092   000    Old_age   Always       -       8670
192 Power-Off_Retract_Count 0x0032   197   197   000    Old_age   Always       -       2990
193 Load_Cycle_Count        0x0032   039   039   000    Old_age   Always       -       484519
194 Temperature_Celsius     0x0022   108   091   000    Old_age   Always       -       44
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               90%     46303         -
# 2  Short offline       Aborted by host               80%     46303         -
# 3  Short offline       Aborted by host               90%     46303         -
# 4  Short offline       Completed without error       00%     46303         -
# 5  Extended offline    Completed: read failure       90%     46292         454199856

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

This is my first time working with an apparently faulty drive. To my limited but increasing knowledge on drives. I have come to the conclusion that the drive is on its last leg and it's probably time for a replacement.

The only issue with replacing the drive, all the data on "MV-Storage-3TB" is 100% not backed up as I was only planning to get new drives later this month to get a raid setup going, bad luck i guess.

Has any one repaired bad blocks and if it is possible to repair the drive. All I need to access to the drive to get all the data of it.

I am currently running the following to check where the exactly the bad blocks are.
Code:
root@node1:~# badblocks -v /dev/sdf > /tmp/wd_bad_blocks.txt
Checking blocks 0 to 2930266583
Checking for bad blocks (read-only test):

This cannot be the end of the line, i'm almost certain that there's a way to fix the bad blocks, even if it needs to rewrite the blocks

Code:
root@node1:/tmp# tune2fs -l /dev/sdf | grep Block
tune2fs: Bad magic number in super-block while trying to open /dev/sdf
Found a gpt partition table in /dev/sdf
root@node1:/tmp# tune2fs -l /dev/sdf1 
tune2fs 1.46.2 (28-Feb-2021)
tune2fs: Bad magic number in super-block while trying to open /dev/sdf1
/dev/sdf1 contains a zfs_member file system labelled 'MV-Storage-3TB'

`tune2fs` shows the drive but it also shows where the bad block might be ?
`tune2fs: Bad magic number in super-block while trying to open /dev/sdf1`
 
Is there perhaps anyone that has dealt with an issue like this and repaired a drive ?

If so, could you link me to any useful and informative websites or video links please.

If you could assist, that would be much appreciated.
 
Solved.

Since it was a ZFS pool.

I then studied all methods for zpools (openzfs).

So if this does happen to anyone else. The solution is extremely simple. Given i had to do hours of research and re attempt this a year later.

So this is how you find out if you can get access to the drive again.

$zdb -e YourPoolName

For me it took 10Hours to run as my drive is a 3TB.

Once that has completed. The last output would look something like this.

Bash:
ZFS_DBGMSG(zdb) START:
spa.c:6110:spa_import(): spa_import: importing MV-Storage-3TB
spa_misc.c:418:spa_load_note(): spa_load(MV-Storage-3TB, config trusted): LOADING
vdev.c:160:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD30EZRX-00MMMB0_WD-WCAWZ2611164-part1': best uberblock found for spa MV-Storage-3TB. txg 2756441
spa_misc.c:418:spa_load_note(): spa_load(MV-Storage-3TB, config untrusted): using uberblock with txg=2756441
spa.c:8392:spa_async_request(): spa=MV-Storage-3TB async request task=2048
spa_misc.c:418:spa_load_note(): spa_load(MV-Storage-3TB, config trusted): LOADED
spa.c:8392:spa_async_request(): spa=MV-Storage-3TB async request task=32
spa.c:8392:spa_async_request(): spa=MV-Storage-3TB async request task=4
spa.c:8392:spa_async_request(): spa=MV-Storage-3TB async request task=4
ZFS_DBGMSG(zdb) END

Now if you notice that we have a viable uberblock to use txg 2756441

Now we can do the following:

$zpool import -f -T 2756441 readonly=on -F MV-Storage-3TB

That would take another maybe 10 hours for a 3TB drive.

Once that ran, the drive was mounted and i could read from the drive. Now the issue here is, if you reboot this would not remount itself. To remount/import it normally. All you have to do is:

$zpool import -f -T 2756441 -F MV-Storage-3TB
Let that run and then reboot. Once rebooted you would be able to simply import the pool as normal.

$zpool import -f MV-Storage-3TB

That's it :)
 
Last edited:
  • Like
Reactions: vemsom and UdoB