Hi, good morning everyone
Im having issues with my VM RAIDZ pool crashing from time to time due to unexistent errors (no badblocks errors or corruptions on the FS)
Reported on linux kernel bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201693
And receiving this errors on journal:
ta1.00: exception Emask 0x10 SAct 0x7f80 SErr 0x440100 action 0x6 frozen
ata1.00: irq_stat 0x08000000, interface fatal error
ata1: SError: { UnrecovData CommWake Handshk }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/00:38:20:16:02/0a:00:65:00:00/40 tag 7 ncq dma 1310720 ou
res 40/00:40:20:20:02/00:00:65:00:00/40 Emask 0x10 (ATA bus error)
ata1.00: status: { DRDY }
I can confirm this happening on AMD Sata controller (https://www.supermicro.com/en/products/motherboard/M11SDV-8C+-LN4F) and an EVO 870:
*-sata
description: SATA controller
product: FCH SATA Controller [AHCI mode]
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0.2
bus info: pci@0000:07:00.2
logical name: scsi0
logical name: scsi1
logical name: scsi2
logical name: scsi3
version: 51
width: 32 bits
clock: 33MHz
capabilities: sata pm pciexpress msi ahci_1.0 bus_master cap_list emulated
configuration: driver=ahci latency=0
resources: irq:46 memory:ef602000-ef602fff
*-disk:0
description: ATA Disk
product: Samsung SSD 870
physical id: 0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: 1B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sda
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=1295f5ca-cc51-6449-b884-b9b76f930336
*-disk:1
description: ATA Disk
product: Samsung SSD 870
physical id: 1
bus info: scsi@1:0.0.0
logical name: /dev/sdb
version: 2B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sdb
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=6bee9b39-d327-cc4b-b3df-d8582cce9d04
*-disk:2
description: ATA Disk
product: Samsung SSD 870
physical id: 2
bus info: scsi@2:0.0.0
logical name: /dev/sdc
version: 1B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sdc
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=5445d2ac-f424-ee4b-ae77-4e7b03750b3f
*-disk:3
description: ATA Disk
product: Samsung SSD 870
physical id: 3
bus info: scsi@3:0.0.0
logical name: /dev/sdd
version: 1B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sdd
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=4f3e0797-d8f7-2b4d-9834-00bcc939e156
I have 4 disk on a RAIDZ config:
=== START OF INFORMATION SECTION ===
Device Model: Samsung SSD 870 EVO 1TB
LU WWN Device Id: 5 002538 f311ab0dc
Firmware Version: SVT01B6Q
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Mar 19 11:07:39 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Running PROXMOX 7.1.10 with 5.13 kernel:
5.13.19-6-pve #1 SMP PVE 5.13.19-14 (Thu, 10 Mar 2022 16:24:52 +0100) x86_64 GNU/Linux
Disabled TRIM individually (libata.force=1.00:noncq,2.00:noncq,3.00:noncq,4.00:noncq) and generally (libata.force=noncq) via grub and under device specific config:
echo 1 > /sys/block/sd*/device/queue_depth
Also enabled device energy max_performance mode:
echo max_performance > /sys/class/scsi_host/host*/link_power_management_policy
None of the configurations seems to work consistently making the zfs array to crash within hours/minutes. Been running the same hardware on a Truenas system (FreeBSD) with constant rsyncs during a whole day on the same pool without any single error.
Tested on last 5.15 test kernel from pve without any success (this kernel release suposedly has a patch solving this).
Is there someone with the same issues?
Let me know if i can provide any more info to debug this.
Thanks in advance.
Im having issues with my VM RAIDZ pool crashing from time to time due to unexistent errors (no badblocks errors or corruptions on the FS)
Reported on linux kernel bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201693
And receiving this errors on journal:
ta1.00: exception Emask 0x10 SAct 0x7f80 SErr 0x440100 action 0x6 frozen
ata1.00: irq_stat 0x08000000, interface fatal error
ata1: SError: { UnrecovData CommWake Handshk }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/00:38:20:16:02/0a:00:65:00:00/40 tag 7 ncq dma 1310720 ou
res 40/00:40:20:20:02/00:00:65:00:00/40 Emask 0x10 (ATA bus error)
ata1.00: status: { DRDY }
I can confirm this happening on AMD Sata controller (https://www.supermicro.com/en/products/motherboard/M11SDV-8C+-LN4F) and an EVO 870:
*-sata
description: SATA controller
product: FCH SATA Controller [AHCI mode]
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0.2
bus info: pci@0000:07:00.2
logical name: scsi0
logical name: scsi1
logical name: scsi2
logical name: scsi3
version: 51
width: 32 bits
clock: 33MHz
capabilities: sata pm pciexpress msi ahci_1.0 bus_master cap_list emulated
configuration: driver=ahci latency=0
resources: irq:46 memory:ef602000-ef602fff
*-disk:0
description: ATA Disk
product: Samsung SSD 870
physical id: 0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: 1B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sda
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=1295f5ca-cc51-6449-b884-b9b76f930336
*-disk:1
description: ATA Disk
product: Samsung SSD 870
physical id: 1
bus info: scsi@1:0.0.0
logical name: /dev/sdb
version: 2B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sdb
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=6bee9b39-d327-cc4b-b3df-d8582cce9d04
*-disk:2
description: ATA Disk
product: Samsung SSD 870
physical id: 2
bus info: scsi@2:0.0.0
logical name: /dev/sdc
version: 1B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sdc
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=5445d2ac-f424-ee4b-ae77-4e7b03750b3f
*-disk:3
description: ATA Disk
product: Samsung SSD 870
physical id: 3
bus info: scsi@3:0.0.0
logical name: /dev/sdd
version: 1B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sdd
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=4f3e0797-d8f7-2b4d-9834-00bcc939e156
I have 4 disk on a RAIDZ config:
=== START OF INFORMATION SECTION ===
Device Model: Samsung SSD 870 EVO 1TB
LU WWN Device Id: 5 002538 f311ab0dc
Firmware Version: SVT01B6Q
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Mar 19 11:07:39 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Running PROXMOX 7.1.10 with 5.13 kernel:
5.13.19-6-pve #1 SMP PVE 5.13.19-14 (Thu, 10 Mar 2022 16:24:52 +0100) x86_64 GNU/Linux
Disabled TRIM individually (libata.force=1.00:noncq,2.00:noncq,3.00:noncq,4.00:noncq) and generally (libata.force=noncq) via grub and under device specific config:
echo 1 > /sys/block/sd*/device/queue_depth
Also enabled device energy max_performance mode:
echo max_performance > /sys/class/scsi_host/host*/link_power_management_policy
None of the configurations seems to work consistently making the zfs array to crash within hours/minutes. Been running the same hardware on a Truenas system (FreeBSD) with constant rsyncs during a whole day on the same pool without any single error.
Tested on last 5.15 test kernel from pve without any success (this kernel release suposedly has a patch solving this).
Is there someone with the same issues?
Let me know if i can provide any more info to debug this.
Thanks in advance.