ZFS Mirror Degraded

Dylan96

New Member
Sep 10, 2024
3
0
1
We had configured a simple zfs mirror between two Samsung 870 QVO 4TB SSDs.
Recently the state of the pool has changed to "degraded", one of the SSDs went completely offline, it didn't even respond to simple smart commands (I tried smartctl --all /dev/sda and it returned a generic error).

You can find here the journalctl log:

Code:
Aug 23 07:04:47 pve1 kernel: ahci 0000:11:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x7dbc0000 flags=0x0000]
Aug 23 07:04:48 pve1 kernel: ata4.00: exception Emask 0x10 SAct 0x4000200 SErr 0x0 action 0x6 frozen
Aug 23 07:04:48 pve1 kernel: ata4.00: irq_stat 0x08000000, interface fatal error
Aug 23 07:04:48 pve1 kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Aug 23 07:04:48 pve1 kernel: ata4.00: cmd 61/48:48:68:2f:d6/00:00:bf:01:00/40 tag 9 ncq dma 36864 out
                                      res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Aug 23 07:04:48 pve1 kernel: ata4.00: status: { DRDY }
Aug 23 07:04:48 pve1 kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Aug 23 07:04:48 pve1 kernel: ata4.00: cmd 61/00:d0:68:2e:d6/01:00:bf:01:00/40 tag 26 ncq dma 131072 out
                                      res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Aug 23 07:04:48 pve1 kernel: ata4.00: status: { DRDY }
Aug 23 07:04:48 pve1 kernel: ata4: hard resetting link
Aug 23 07:04:57 pve1 xcloud-endpoint-manager[1609]: 2024-08-23 07:04:57.209 I [t 1619] Config file have no updates.
Aug 23 07:04:57 pve1 xcloud-endpoint-manager[1609]: 2024-08-23 07:04:57.210 I [t 1619] Connected to: 172.16.10.50
Aug 23 07:04:57 pve1 xcloud-endpoint-manager[1609]: 2024-08-23 07:04:57.231 I [t 1619] No work items to process.
Aug 23 07:04:58 pve1 kernel: ata4: softreset failed (1st FIS failed)
Aug 23 07:04:58 pve1 kernel: ata4: hard resetting link
Aug 23 07:05:08 pve1 kernel: ata4: softreset failed (1st FIS failed)
Aug 23 07:05:08 pve1 kernel: ata4: hard resetting link
Aug 23 07:05:43 pve1 kernel: ata4: softreset failed (1st FIS failed)
Aug 23 07:05:43 pve1 kernel: ata4: limiting SATA link speed to 3.0 Gbps
Aug 23 07:05:43 pve1 kernel: ata4: hard resetting link
Aug 23 07:05:48 pve1 kernel: ata4: softreset failed (1st FIS failed)
Aug 23 07:05:48 pve1 kernel: ata4: softreset failed
Aug 23 07:05:48 pve1 kernel: ata4: reset failed, giving up
Aug 23 07:05:48 pve1 kernel: ata4.00: disable device
Aug 23 07:05:48 pve1 kernel: ata4: EH complete
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=60s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#19 CDB: Write(16) 8a 00 00 00 00 01 bf d6 2e 68 00 00 01 00 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513452136 op 0x1:(WRITE) flags 0x0 phys_seg 5 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846886445056 size=131072 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#21 CDB: Write(16) 8a 00 00 00 00 01 bf cf b2 68 00 00 00 70 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513027176 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846668865536 size=57344 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#23 CDB: Write(16) 8a 00 00 00 00 01 bf d6 26 68 00 00 00 28 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513450088 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846885396480 size=20480 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#0 CDB: Write(16) 8a 00 00 00 00 01 bf d8 ed 10 00 00 00 08 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513632016 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846978543616 size=4096 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#1 CDB: Write(16) 8a 00 00 00 00 01 bf d8 f6 e8 00 00 00 10 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513634536 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846979833856 size=8192 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#2 CDB: Write(16) 8a 00 00 00 00 01 bf d8 fd 60 00 00 00 28 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513636192 op 0x1:(WRITE) flags 0x0 phys_seg 5 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846980681728 size=20480 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#3 CDB: Write(16) 8a 00 00 00 00 01 bf d9 03 c0 00 00 00 08 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513637824 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846981517312 size=4096 flags=1572992
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#16 CDB: Write(16) 8a 00 00 00 00 01 bf d9 07 28 00 00 00 50 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7513638696 op 0x1:(WRITE) flags 0x0 phys_seg 10 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846981963776 size=40960 flags=1074267264
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#19 CDB: Read(16) 88 00 00 00 00 00 00 00 0a 10 00 00 00 10 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=270336 size=8192 flags=721089
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Aug 23 07:05:48 pve1 kernel: sd 3:0:0:0: [sda] tag#21 CDB: Read(16) 88 00 00 00 00 01 d1 c0 74 10 00 00 00 10 00 00
Aug 23 07:05:48 pve1 kernel: I/O error, dev sda, sector 7814018064 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=4000776200192 size=8192 flags=721089
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=4000776462336 size=8192 flags=721089
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=1529785479168 size=32768 flags=1572992
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=1529785511936 size=32768 flags=1572992
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846886576128 size=36864 flags=1572992
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846885003264 size=28672 flags=1572992
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846967910400 size=8192 flags=1572992
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=2 offset=3846967848960 size=4096 flags=1572992
Aug 23 07:05:48 pve1 kernel: Buffer I/O error on dev sda1, logical block 976752112, async page read
Aug 23 07:05:48 pve1 zed[817181]: eid=111 class=io pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 size=8192 offset=270336 priority=0 err=5 flags=0xb00c1
Aug 23 07:05:48 pve1 zed[817183]: eid=112 class=io pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 size=8192 offset=4000776200192 priority=0 err=5 flags=0xb00c1
Aug 23 07:05:48 pve1 zed[817185]: eid=113 class=io pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 size=8192 offset=4000776462336 priority=0 err=5 flags=0xb00c1
Aug 23 07:05:48 pve1 zed[817186]: eid=114 class=probe_failure pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1
Aug 23 07:05:48 pve1 kernel: Buffer I/O error on dev sda9, logical block 2032, async page read
Aug 23 07:05:48 pve1 kernel: Buffer I/O error on dev sda1, logical block 976752112, async page read
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=270336 size=8192 flags=721601
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=4000776200192 size=8192 flags=721601
Aug 23 07:05:48 pve1 kernel: zio pool=SSD-MirrorZFS vdev=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 error=5 type=1 offset=4000776462336 size=8192 flags=721601
Aug 23 07:05:48 pve1 zed[817193]: eid=116 class=statechange pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1 vdev_state=FAULTED
Aug 23 07:05:48 pve1 zed[817192]: eid=115 class=probe_failure pool='SSD-MirrorZFS' vdev=ata-Samsung_SSD_870_QVO_4TB_S5STNF0W800666D-part1

I double-checked sata connections and PSU, nothing out of the order.
I assumed that the SSD was completely dead, I removed it from the Proxmox server and connected it to a second Windows PC.
To my great surprise, the SSD seems to work correctly. It passed an extended smart test and an in-depth check of the blocks.

1.png

2.png

I attached the output of smartcrtl -x for this ssd...

What is going on? Is this drive really faulted? Was it all just a ZFS error?
 

Attachments

QLC drives are really the worst for ZFS. Use it in your Windows PC but not in a server.
There are plenty of threads here in this forum. Just search for QLC...
 
  • Like
Reactions: Kingneutron
I had this problem and it turned out to be a bad/loose SATA cable. Replacing the cable solved the problem for me. The replaced cable seemed fine, no visible defects
 
please call zpool status and zpool list -v
i already replaced the drives, i only have the logs

I had this problem and it turned out to be a bad/loose SATA cable. Replacing the cable solved the problem for me. The replaced cable seemed fine, no visible defects
the sata cable is brand new... we'll see if it happens again (with different drives)

I know, the QVO were a lot cheaper when we bought them. I just want to understand why it's working fine now...
 
Hey Guys, this thread is interesting. Have you read the first post? Dylan wrote Proxmox marked a ZFS pool as degraded because one disk failed but it is not true and after a few test on the "failed" SSD that is confirmed. The SSD seems in good state now.
So, the questions are: why Proxmox marked the pool in degraded state? and why it marked the SSD as FAIL? please help us to understand the log...
 
Wrong "Proxmox marked a ZFS" that's not right!
Proxmox is a extension to Debian 12.7 and in the future to Debian 13.

The SSD SAMSUNG 870 QVO 4 TB has a Chiptyp: QLC thats bad. everytime 4 Bit must be written to the flash cells.
that's slow. IO don't see any dram cache on the device.
There may be write timeouts and data lost by zfs writs.
 
  • Like
Reactions: Kingneutron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!