ZFS (4 x WD40EFRX) issues with HPE Gen10 Microserver / Marvell RAID

mat.ec

Active Member
Mar 12, 2018
6
0
41
54
HI Forum,
actually I'm pretty new to ZFS so I'm a bit confused about the behavior of my recently setup HPE Server.

After roughly 10 years my Synology NAS refused to boot again. So I took an old HPE Microserver GEN10 and added a SSD for the OS and my four WD40 drives from the NAS.

One drive has about 13 k power on hours, the other three have 60 k power on hours. Even though the Synology was complaining about errors on one of the disks before it died, I was able to install Proxmox and create a ZFS Pool (raidz2) without any issues.

On Board Marvell Raid is not used - at least there's no virtual disk created. Server is running updated Bios in UEFI mode (ZA10A380).

So I restored my data backup on the ZFS Pool (partly creating storage for a VM, partly creating storage as storage for containers).

However, after a while SMART tools and (!) ZFS complained about faulty disks (disk 3 and 4) and degraded the pool, but after a reboot eventhing was fine again. Now, after four more days (the server is just running idle) the pool is degraded again (all four disks) but SMART status is passed for all of them.

After anther reboot disk 1 (the youngest one) is marked as faulted, 2 & 4 degraded and 3 is online.

Do I just face issues because of the age of my disks and adding for new drives will solve this or is there anything to consider when using ZFS on a HPE Microserver Gen 10?

Any guidance / help would be highly appreciated.

best,



Mat
 
Any advice will probably depends on the kind of errors you are seeing. It could be the drives, the cables, the controller, the system memory, temperature, power, etc.
ZFS read, checksum and write errors indicate different issues. What are the SMART issues you are seeing? What are the results of a long SMART self-test? What errors do you see in the system logs (literal copied error messages will be the most helpful).
 
thanx for your swift reply - to avoid to attach the entire syslog here some extracts from journald

Code:
Feb 12 10:23:39 pve kernel: zio pool=bdk vdev=/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5XJL1V0-part1 error=5 type=1 offset=5750882304 size=65536 flags=1605809
Feb 12 10:23:39 pve kernel: I/O error, dev sdd, sector 11234240 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 2
Feb 12 10:23:39 pve kernel: sd 3:0:0:0: [sdd] tag#7 CDB: Read(16) 88 00 00 00 00 00 00 ab 6b c0 00 00 00 80 00 00
Feb 12 10:23:39 pve kernel: sd 3:0:0:0: [sdd] tag#7 Add. Sense: Unaligned write command
Feb 12 10:23:39 pve kernel: sd 3:0:0:0: [sdd] tag#7 Sense Key : Illegal Request [current]
Feb 12 10:23:39 pve kernel: sd 3:0:0:0: [sdd] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=38s
Feb 12 10:23:39 pve kernel: ata4.00: configured for UDMA/33
Feb 12 10:23:39 pve kernel: ata4.00: error: { ABRT }
Feb 12 10:23:39 pve kernel: ata4.00: status: { DRDY ERR }
Feb 12 10:23:39 pve kernel: ata4.00: cmd c8/00:80:c0:6b:ab/00:00:00:00:00/e0 tag 7 dma 65536 in
                                     res 51/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
Feb 12 10:23:39 pve kernel: ata4.00: failed command: READ DMA
Feb 12 10:23:39 pve kernel: ata4.00: irq_stat 0x40000001
Feb 12 10:23:39 pve kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 12 10:23:34 pve kernel: ata4: EH complete
Feb 12 10:23:34 pve kernel: ata4.00: configured for UDMA/33
Feb 12 10:23:34 pve kernel: ata4.00: error: { ABRT }
Feb 12 10:23:34 pve kernel: ata4.00: status: { DRDY ERR }
Feb 12 10:23:34 pve kernel: ata4.00: cmd c8/00:00:50:6d:ab/00:00:00:00:00/e0 tag 13 dma 131072 in
                                     res 51/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
Feb 12 10:23:34 pve kernel: ata4.00: failed command: READ DMA
Feb 12 10:23:34 pve kernel: ata4.00: irq_stat 0x40000001
Feb 12 10:23:34 pve kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 12 10:23:31 pve kernel: ata4: EH complete
Feb 12 10:23:31 pve kernel: ata4.00: configured for UDMA/33
Feb 12 10:23:31 pve kernel: ata4.00: error: { ABRT }
Feb 12 10:23:31 pve kernel: ata4.00: status: { DRDY ERR }
Feb 12 10:23:31 pve kernel: ata4.00: cmd c8/00:80:c0:6b:ab/00:00:00:00:00/e0 tag 9 dma 65536 in
                                     res 51/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
Feb 12 10:23:31 pve kernel: ata4.00: failed command: READ DMA
Feb 12 10:23:31 pve kernel: ata4.00: irq_stat 0x40000001
Feb 12 10:23:31 pve kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 12 10:23:27 pve kernel: ata4: EH complete
Feb 12 10:23:27 pve kernel: ata4.00: configured for UDMA/33
Feb 12 10:23:27 pve kernel: ata4.00: error: { ABRT }
Feb 12 10:23:27 pve kernel: ata4.00: status: { DRDY ERR }
Feb 12 10:23:27 pve kernel: ata4.00: cmd c8/00:00:50:6d:ab/00:00:00:00:00/e0 tag 24 dma 131072 in
                                     res 51/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)


sample output for one drive (the "youngest")

Code:
Error 125 occurred at disk power-on lifetime: 11947 hours (497 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08      00:21:07.347  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08      00:21:07.344  IDENTIFY DEVICE
  c8 00 08 00 00 00 e0 08      00:21:06.801  READ DMA
  ec 00 00 00 00 00 a0 08      00:21:06.734  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:21:06.734  SET FEATURES [Set transfer mode]

Error 124 occurred at disk power-on lifetime: 11947 hours (497 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 08 00 00 00 e0  Device Fault; Error: ABRT 8 sectors at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 00 00 00 e0 08      00:21:06.801  READ DMA
  ec 00 00 00 00 00 a0 08      00:21:06.734  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:21:06.734  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08      00:21:06.709  IDENTIFY DEVICE
  c8 00 08 00 00 00 e0 08      00:21:06.196  READ DMA

1707733111380.png


Code:
root@pve:~# zpool status -v
  pool: bdk
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Mon Feb 12 08:35:47 2024
        4.22T / 8.91T scanned at 450M/s, 2.26T / 8.91T issued at 240M/s
        11.2M repaired, 25.33% done, 08:04:04 to go
config:

        NAME                                          STATE     READ WRITE CKSUM
        bdk                                           DEGRADED     0     0     0
          raidz2-0                                    DEGRADED 1.13K     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2JTL54E  FAULTED    115     0    58  too many errors
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E6ZF8F8K  FAULTED     16     0    54  too many errors
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5XJL1V0  DEGRADED 1.31K     0   175  too many errors
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E3JH08V4  DEGRADED     1     0   165  too many errors

errors: Permanent errors have been detected in the following files:

        /bdk/subvol-501-disk-0/movies/Movie/R/...
        /bdk/subvol-501-disk-0/movies/Movie/R/...
        /bdk/subvol-501-disk-0/movies/Movie/S/...
 
You appear to get data transfer errors from the drives and ZFS also reports read and checksum errors. This indicates a faulty drive or bad cables or a faulty SATA controller. I'm not sure if the drive would notice a faulty cable or SATA controller, but the data coming from the drives is corrupt.
Do a long SMART self-test (smartctl -t long /dev/disk/by-id/ata-... and check after a long time with smartctl -a /dev/disk/by-id/ata-...) (on each drive) and if that notices problems, then it's definitely (also) the drive and you might want to contact the seller or WD for a replacement if it's still under warranty. If that comes back clean (after some hours), then it's (only) a problem between the drive and the CPU (cable, controller, memory).

Off-topic: please note that a stripe of mirrors would give better performance and more usable space than a Raidz2 (but less redundancy if two drives of the same mirror fail).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!