ZFS Suspended

raynaud

Member
Apr 29, 2022
10
0
6
Greetings, I have a problem, recently I was using a disk with a zfs and there I have the virtual hard drives of all the vms, in another I have the backups, normally I have 3 vms of that zfs running, a trueNAs machine, a linux server, and another pure apis linux, recently it began to be suspended after updating the version of proxmox to 7.4-17, thinking that the disk could be failing, I bought a more recent one with more space, however it is giving me the same error, I am not completely I'm sure if it's a coincidence that the new disk was damaged and it's giving me the same errors, in this case what I did was connect the new one, create a new zfs, copy all the vms one by one to the new zfs and remove the disk old man, I leave the log, on the other hand the smart test gives me the following:

LOG:
https://sharetxt.live/proxmoxLog1

Pruba SMART
Code:
SMART Error Log Version: 1
ATA Error Count: 5
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 occurred at disk power-on lifetime: 75 hours (3 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 10 10 0a 00 e0  Device Fault; Error: ABRT 16 sectors at LBA = 0x00000a10 = 2576

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 10 0a 00 e0 08      00:00:30.306  READ DMA
  ef 10 02 00 00 00 a0 08      00:00:30.298  SET FEATURES [Enable SATA feature]

Error 4 occurred at disk power-on lifetime: 75 hours (3 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 10 10 0a 00 e0  Device Fault; Error: ABRT 16 sectors at LBA = 0x00000a10 = 2576

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 10 0a 00 e0 08      00:00:30.106  READ DMA
  ef 10 02 00 00 00 a0 08      00:00:30.098  SET FEATURES [Enable SATA feature]

Error 3 occurred at disk power-on lifetime: 75 hours (3 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 10 10 0a 00 e0  Device Fault; Error: ABRT 16 sectors at LBA = 0x00000a10 = 2576

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 10 0a 00 e0 08      00:00:29.906  READ DMA
  ef 10 02 00 00 00 a0 08      00:00:29.898  SET FEATURES [Enable SATA feature]

Error 2 occurred at disk power-on lifetime: 75 hours (3 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 10 10 0a 00 e0  Device Fault; Error: ABRT 16 sectors at LBA = 0x00000a10 = 2576

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 10 0a 00 e0 08      00:00:29.706  READ DMA
  ef 10 02 00 00 00 a0 08      00:00:29.699  SET FEATURES [Enable SATA feature]

Error 1 occurred at disk power-on lifetime: 75 hours (3 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 10 10 0a 00 e0  Device Fault; Error: ABRT 16 sectors at LBA = 0x00000a10 = 2576

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 10 0a 00 e0 08      00:00:29.373  READ DMA
  ef 10 02 00 00 00 a0 08      00:00:29.366  SET FEATURES [Enable SATA feature]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@pve:~# zpool status
  pool: MainStorage
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
config:

        NAME                                        STATE     READ WRITE CKSUM
        MainStorage                                 ONLINE       0     0     0
          ata-WDC_WD20EFAX-68B2RN1_WD-WXA2A51HF5LN  ONLINE      33     0    40
errors: List of errors unavailable: pool I/O is currently suspended

I add some images of what is happening to me, I have already tried with zpool import and if it works, also with zpool clear, however after starting a vm, minutes later it is suspended, I am open to any support you can give me, thank you very much and Excellent day.

1699482090266.png
1699482137316.png

1699482280385.png
 
If SMART reports errors then it's the drive that's the problem. If SMART is fine but ZFS reports read/checksum errors, then it could be cable, disk controller or memory.
Thank you very much for answering, what seems strange to me is that this happens with two different disks, one that is already more than a year old and a recent one, that has not even been working for three days. It could be a coincidence that the disk has been damaged. , but I wanted to rule out some kind of malfunction on my computer, since I changed its sata port, but the behavior is very similar.
 
Thank you very much for answering, what seems strange to me is that this happens with two different disks, one that is already more than a year old and a recent one, that has not even been working for three days. It could be a coincidence that the disk has been damaged. , but I wanted to rule out some kind of malfunction on my computer, since I changed its sata port, but the behavior is very similar.and chakec
Run a long self-test (smartctl -t long /dev/sdX and wait until it is finished and check again with smartctl -a /dev/sdX) to make sure.
Drives can fail because of power issues or maybe they are from the same problematic batch or they have problems in firmware that should be updated or maybe they were dropped once too often or maybe the system case vibrates too much?
It looks like WD20EFAX are SMR drives and those are problematic with ZFS, but it would expect write errors (because they take too long to write)...
 
what seems strange to me is that this happens with two different disks,
What two disks? I only see sn WXA2A51HF5LN in your report.

Once the smartctl long test concluded, your next steps are as follows:
1. test is clean: zpool clear MainStorage, and attach a second disk to the pool in case (certainty) that you will have another fault. zfs on a single disk serves almost no purpose.
2. disk is faulty: this is where the response will be based on how important the data is. continued use is potentially destructive to data that is still accessible so you need to consider your choices carefully.

If data is not important, toss the drive.
if some data is important, mount a recovery space and clear the error on the zpool. copy the important data to the recovery space and cross your fingers.
if all data is important, attach another disk to MainStorage and let it rebuild. hopefully it would capture most of the data before the rebuild dies; this process may take LONG time, and there's a good chance it wont yield a working file system.
 
What two disks? I only see sn WXA2A51HF5LN in your report.

Once the smartctl long test concluded, your next steps are as follows:
1. test is clean: zpool clear MainStorage, and attach a second disk to the pool in case (certainty) that you will have another fault. zfs on a single disk serves almost no purpose.
2. disk is faulty: this is where the response will be based on how important the data is. continued use is potentially destructive to data that is still accessible so you need to consider your choices carefully.

If data is not important, toss the drive.
if some data is important, mount a recovery space and clear the error on the zpool. copy the important data to the recovery space and cross your fingers.
if all data is important, attach another disk to MainStorage and let it rebuild. hopefully it would capture most of the data before the rebuild dies; this process may take LONG time, and there's a good chance it wont yield a working file system.
Thank you very much for responding, this WXA2A51HF5LN disk is the replacement for one that I had had a WD Blue 1TB for 5 years, and it started to behave in the same way, in this case, what I did was create this ZFS MainStorage and moved the virtual disks from each machine to this new disk, and after the third day it started to behave the same.
 
So, two disks having problems in the same machine in such a short time? Can you try the disks in another machine to check if it's really the disk?
 
So, two disks having problems in the same machine in such a short time? Can you try the disks in another machine to check if it's really the disk?
Not yet, but that's what surprises me, that the same error is repeated, I changed the sata port of the disk to make sure that was it, but the behavior is the same, although the blue WD still responds if I connect it, sometimes it doesn't It starts the first time, but it responds unlike the new one, but it is good advice, thank you very much for responding
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!