ZFS RAID issue with new SSD disks

Inglebard

Renowned Member
May 20, 2016
102
7
83
32
Hi,

I just install proxmox on a new server with new disks.

I notice since the beginnig that the ZFS RAID broke after some times.

Here is an example :

Code:
  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 00:01:43 with 0 errors on Sun Jun 12 00:25:44 2022
config:

        NAME                                                     STATE     READ WRITE CKSUM
        rpool                                                    DEGRADED     0     0     0
          mirror-0                                               DEGRADED     0     0     0
            ata-SAMSUNG_MZ7L3480HCHQ-00A07_S664NE0RC02318-part3  FAULTED      0    10     0  too many errors
            ata-SAMSUNG_MZ7L3480HCHQ-00A07_S664NE0RC02441-part3  ONLINE       0     6     0

The raid broke always due to write "error".
Both disk are SSD disks, I already done a ZFS raid with HDD without issue.

Is there something to do ?


proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-6
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 
Hi,
did you replace the drive or did you recreate the ZFS RAID to fix this? Can you show your SMART values for the "faulty" drive?
 
No, I didn't replace anything because everything is "new".

I can see this in dmesg and doesn't seems good:

Code:
76319.167456] ata1.00: exception Emask 0x0 SAct 0x2820 SErr 0x0 action 0x0
[76319.167493] ata1.00: irq_stat 0x40000001
[76319.167503] ata1.00: failed command: SEND FPDMA QUEUED
[76319.167514] ata1.00: cmd 64/01:28:00:00:00/00:00:00:00:00/a0 tag 5 ncq dma 512 out
                        res 41/04:00:68:85:d7/00:00:06:00:00/40 Emask 0x1 (device error)
[76319.167549] ata1.00: status: { DRDY ERR }
[76319.167558] ata1.00: error: { ABRT }
[76319.167567] ata1.00: failed command: WRITE FPDMA QUEUED
[76319.167579] ata1.00: cmd 61/00:58:68:85:d7/01:00:06:00:00/40 tag 11 ncq dma 131072 out
                        res 41/04:00:68:85:d7/00:01:06:00:00/00 Emask 0x401 (device error) <F>
[76319.167614] ata1.00: status: { DRDY ERR }
[76319.167624] ata1.00: error: { ABRT }
[76319.168001] ata1.00: failed command: WRITE FPDMA QUEUED
[76319.168390] ata1.00: cmd 61/a8:68:68:86:d7/00:00:06:00:00/40 tag 13 ncq dma 86016 out
                        res 41/04:00:68:85:d7/00:00:06:00:00/40 Emask 0x1 (device error)
[76319.169158] ata1.00: status: { DRDY ERR }
[76319.169532] ata1.00: error: { ABRT }
[76319.170319] ata1.00: supports DRM functions and may not be fully accessible
[76319.174125] ata1.00: supports DRM functions and may not be fully accessible
[76319.177276] ata1.00: configured for UDMA/133
[76319.177745] sd 0:0:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[76319.178162] sd 0:0:0:0: [sda] tag#5 Sense Key : Illegal Request [current]
[76319.178567] sd 0:0:0:0: [sda] tag#5 Add. Sense: Unaligned write command
[76319.178967] sd 0:0:0:0: [sda] tag#5 CDB: Write same(16) 93 08 00 00 00 00 0c 37 d9 80 00 00 01 18 00 00
[76319.179448] blk_update_request: I/O error, dev sda, sector 204986752 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
[76319.179967] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[76319.180452] sd 0:0:0:0: [sda] tag#11 Sense Key : Illegal Request [current]
[76319.180881] sd 0:0:0:0: [sda] tag#11 Add. Sense: Unaligned write command
[76319.181303] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 06 d7 85 68 00 01 00 00
[76319.181723] blk_update_request: I/O error, dev sda, sector 114787688 op 0x1:(WRITE) flags 0x700 phys_seg 20 prio class 0
[76319.182168] zio pool=rpool vdev=/dev/disk/by-id/ata-SAMSUNG_MZ7L3480HCHQ-00A07_S664NE0RC02318-part3 error=5 type=2 offset=58233376768 size=131072 flags=40080c80
[76319.183105] sd 0:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[76319.183649] sd 0:0:0:0: [sda] tag#13 Sense Key : Illegal Request [current]
[76319.184212] sd 0:0:0:0: [sda] tag#13 Add. Sense: Unaligned write command
[76319.184752] sd 0:0:0:0: [sda] tag#13 CDB: Write(10) 2a 00 06 d7 86 68 00 00 a8 00
[76319.185235] blk_update_request: I/O error, dev sda, sector 114787944 op 0x1:(WRITE) flags 0x700 phys_seg 18 prio class 0
[76319.185737] zio pool=rpool vdev=/dev/disk/by-id/ata-SAMSUNG_MZ7L3480HCHQ-00A07_S664NE0RC02318-part3 error=5 type=2 offset=58233507840 size=86016 flags=40080c80



Here are smart info

sda :
Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2713
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       29
177 Wear_Leveling_Count     0x0013   099   099   005    Pre-fail  Always       -       34
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   100   100   010    Pre-fail  Always       -       438
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   061   057   000    Old_age   Always       -       39
194 Temperature_Celsius     0x0022   061   056   000    Old_age   Always       -       39 (Min/Max 25/44)
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
202 Unknown_SSD_Attribute   0x0033   100   100   010    Pre-fail  Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       17
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       9089360246
242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       3154214189
243 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
244 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
245 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       65535
246 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       65535
247 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       65535
251 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       9523516160


sdb:
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2713
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       25
177 Wear_Leveling_Count     0x0013   099   099   005    Pre-fail  Always       -       33
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   100   100   010    Pre-fail  Always       -       445
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   062   058   000    Old_age   Always       -       38
194 Temperature_Celsius     0x0022   062   057   000    Old_age   Always       -       38 (Min/Max 25/43)
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
202 Unknown_SSD_Attribute   0x0033   100   100   010    Pre-fail  Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       13
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       8754075881
242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       3015509542
243 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
244 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
245 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       65535
246 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       65535
247 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       65535
251 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       9188468416


related to https://github.com/openzfs/zfs/issues/10094 ?
 
Last edited:
Hi,

from the error in dmesg I would first try to change the cable to the disk and hope that this fixes the issue. If this does not help, I would recommend changing the disk.

I just skimmed the link to the openzfs issue, there were at least some people in the trhead saying that changing the HBA controller, cable, disk, updating firmware, .... fixed it for them.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!