We had configured a simple zfs mirror between two Samsung 870 QVO 4TB SSDs.
Recently the state of the pool has changed to "degraded", one of the SSDs went completely offline, it didn't even respond to simple smart commands (I tried smartctl --all /dev/sda and it returned a generic error).
You can find here the journalctl log:
I double-checked sata connections and PSU, nothing out of the order.
I assumed that the SSD was completely dead, I removed it from the Proxmox server and connected it to a second Windows PC.
To my great surprise, the SSD seems to work correctly. It passed an extended smart test and an in-depth check of the blocks.
I attached the output of smartcrtl -x for this ssd...
What is going on? Is this drive really faulted? Was it all just a ZFS error?
Recently the state of the pool has changed to "degraded", one of the SSDs went completely offline, it didn't even respond to simple smart commands (I tried smartctl --all /dev/sda and it returned a generic error).
You can find here the journalctl log:
Code:
Aug 23 07:04:47 pve1 kernel: ahci 0000:11:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x7dbc0000 flags=0x0000]
Aug 23 07:04:48 pve1 kernel: ata4.00: exception Emask 0x10 SAct 0x4000200 SErr 0x0 action 0x6 frozen
Aug 23 07:04:48 pve1 kernel: ata4.00: irq_stat 0x08000000, interface fatal error
Aug 23 07:04:48 pve1 kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Aug 23 07:04:48 pve1 kernel: ata4.00: cmd 61/48:48:68:2f:d6/00:00:bf:01:00/40 tag 9 ncq dma 36864 out
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Aug 23 07:04:48 pve1 kernel: ata4.00: status: { DRDY }
Aug 23 07:04:48 pve1 kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Aug 23 07:04:48 pve1 kernel: ata4.00: cmd 61/00:d0:68:2e:d6/01:00:bf:01:00/40 tag 26 ncq dma 131072 out
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Aug 23 07:04:48 pve1 kernel: ata4.00: status: { DRDY }
Aug 23 07:04:48 pve1 kernel: ata4: hard resetting link
Aug 23 07:04:57 pve1 xcloud-endpoint-manager[1609]: 2024-08-23 07:04:57.209 I [t 1619] Config file have no updates.
Aug 23 07:04:57 pve1 xcloud-endpoint-manager[1609]: 2024-08-23 07:04:57.210 I [t 1619] Connected to: 172.16.10.50
Aug 23 07:04:57 pve1 xcloud-endpoint-manager[1609]: 2024-08-23 07:04:57.231 I [t 1619] No work items to process.
Aug 23 07:04:58 pve1 kernel: ata4: softreset failed (1st FIS failed)
Aug 23 07:04:58 pve1 kernel: ata4: hard resetting link
Aug 23 07:05:08 pve1 kernel: ata4: softreset failed (1st FIS failed)
Aug 23 07:05:08 pve1 kernel: ata4: hard resetting link
Aug 23 07:05:43 pve1 kernel: ata4: softreset failed (1st FIS failed)
Aug 23 07:05:43 pve1 kernel: ata4: limiting SATA link speed to 3.0 Gbps
Aug 23 07:05:43 pve1 kernel: ata4: hard resetting link
Aug 23 07:05:48 pve1 kernel: ata4: softreset failed (1st FIS failed)
Aug 23 07:05:48 pve1 kernel: ata4: softreset failed
Aug 23 07:05:48 pve1 kernel: ata4: reset failed, giving up
Aug 23 07:05:48 pve1 kernel: ata4.00: disable device
Aug 23 07:05:48 pve1 kernel: ata4: EH complete
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=60s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#19 CDB: Write(16) 8a 00 00 00 00 01 bf d6 2e 68 00 00 01 00 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513452136 op 0x1:(WRITE) flags 0x0 phys_seg 5 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846886445056 size=131072 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#21 CDB: Write(16) 8a 00 00 00 00 01 bf cf b2 68 00 00 00 70 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513027176 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846668865536 size=57344 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#23 CDB: Write(16) 8a 00 00 00 00 01 bf d6 26 68 00 00 00 28 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513450088 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846885396480 size=20480 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#0 CDB: Write(16) 8a 00 00 00 00 01 bf d8 ed 10 00 00 00 08 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513632016 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846978543616 size=4096 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#1 CDB: Write(16) 8a 00 00 00 00 01 bf d8 f6 e8 00 00 00 10 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513634536 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846979833856 size=8192 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#2 CDB: Write(16) 8a 00 00 00 00 01 bf d8 fd 60 00 00 00 28 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513636192 op 0x1:(WRITE) flags 0x0 phys_seg 5 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846980681728 size=20480 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#3 CDB: Write(16) 8a 00 00 00 00 01 bf d9 03 c0 00 00 00 08 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513637824 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846981517312 size=4096 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#16 CDB: Write(16) 8a 00 00 00 00 01 bf d9 07 28 00 00 00 50 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513638696 op 0x1:(WRITE) flags 0x0 phys_seg 10 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846981963776 size=40960 flags=1074267264
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#19 CDB: Read(16) 88 00 00 00 00 00 00 00 0a 10 00 00 00 10 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=270336 size=8192 flags=721089
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#21 CDB: Read(16) 88 00 00 00 00 01 d1 c0 74 10 00 00 00 10 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7814018064 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=4000776200192 size=8192 flags=721089
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=4000776462336 size=8192 flags=721089
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=1529785479168 size=32768 flags=1572992
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=1529785511936 size=32768 flags=1572992
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846886576128 size=36864 flags=1572992
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846885003264 size=28672 flags=1572992
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846967910400 size=8192 flags=1572992
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846967848960 size=4096 flags=1572992
Aug 23 07:05:48 pve1 kernel: Buffer I/O error on dev sda1, logical block 976752112, async page read
Aug 23 07:05:48 pve1 zed[817181]: eid=111 class=io pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 size=8192 offset=270336 priority=0 err=5 flags=0xb00c1
Aug 23 07:05:48 pve1 zed[817183]: eid=112 class=io pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 size=8192 offset=4000776200192 priority=0 err=5 flags=0xb00c1
Aug 23 07:05:48 pve1 zed[817185]: eid=113 class=io pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 size=8192 offset=4000776462336 priority=0 err=5 flags=0xb00c1
Aug 23 07:05:48 pve1 zed[817186]: eid=114 class=probe_failure pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1
Aug 23 07:05:48 pve1 kernel: Buffer I/O error on dev sda9, logical block 2032, async page read
Aug 23 07:05:48 pve1 kernel: Buffer I/O error on dev sda1, logical block 976752112, async page read
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=270336 size=8192 flags=721601
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=4000776200192 size=8192 flags=721601
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=4000776462336 size=8192 flags=721601
Aug 23 07:05:48 pve1 zed[817193]: eid=116 class=statechange pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 vdev_state=FAULTED
Aug 23 07:05:48 pve1 zed[817192]: eid=115 class=probe_failure pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1
I double-checked sata connections and PSU, nothing out of the order.
I assumed that the SSD was completely dead, I removed it from the Proxmox server and connected it to a second Windows PC.
To my great surprise, the SSD seems to work correctly. It passed an extended smart test and an in-depth check of the blocks.
I attached the output of smartcrtl -x for this ssd...
What is going on? Is this drive really faulted? Was it all just a ZFS error?