Problems with Samsung SSD and AMD Sata controller

Mar 19, 2022
16
1
8
33
Hi, good morning everyone :)

Im having issues with my VM RAIDZ pool crashing from time to time due to unexistent errors (no badblocks errors or corruptions on the FS)

Reported on linux kernel bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201693

And receiving this errors on journal:

ta1.00: exception Emask 0x10 SAct 0x7f80 SErr 0x440100 action 0x6 frozen
ata1.00: irq_stat 0x08000000, interface fatal error
ata1: SError: { UnrecovData CommWake Handshk }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/00:38:20:16:02/0a:00:65:00:00/40 tag 7 ncq dma 1310720 ou
res 40/00:40:20:20:02/00:00:65:00:00/40 Emask 0x10 (ATA bus error)
ata1.00: status: { DRDY }

I can confirm this happening on AMD Sata controller (https://www.supermicro.com/en/products/motherboard/M11SDV-8C+-LN4F) and an EVO 870:

*-sata
description: SATA controller
product: FCH SATA Controller [AHCI mode]
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0.2
bus info: pci@0000:07:00.2
logical name: scsi0
logical name: scsi1
logical name: scsi2
logical name: scsi3
version: 51
width: 32 bits
clock: 33MHz
capabilities: sata pm pciexpress msi ahci_1.0 bus_master cap_list emulated
configuration: driver=ahci latency=0
resources: irq:46 memory:ef602000-ef602fff
*-disk:0
description: ATA Disk
product: Samsung SSD 870
physical id: 0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: 1B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sda
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=1295f5ca-cc51-6449-b884-b9b76f930336
*-disk:1
description: ATA Disk
product: Samsung SSD 870
physical id: 1
bus info: scsi@1:0.0.0
logical name: /dev/sdb
version: 2B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sdb
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=6bee9b39-d327-cc4b-b3df-d8582cce9d04
*-disk:2
description: ATA Disk
product: Samsung SSD 870
physical id: 2
bus info: scsi@2:0.0.0
logical name: /dev/sdc
version: 1B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sdc
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=5445d2ac-f424-ee4b-ae77-4e7b03750b3f
*-disk:3
description: ATA Disk
product: Samsung SSD 870
physical id: 3
bus info: scsi@3:0.0.0
logical name: /dev/sdd
version: 1B6Q
size: 931GiB (1TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sdd
size: 931GiB (1TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=4f3e0797-d8f7-2b4d-9834-00bcc939e156

I have 4 disk on a RAIDZ config:

=== START OF INFORMATION SECTION ===
Device Model: Samsung SSD 870 EVO 1TB
LU WWN Device Id: 5 002538 f311ab0dc
Firmware Version: SVT01B6Q
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Mar 19 11:07:39 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Running PROXMOX 7.1.10 with 5.13 kernel:

5.13.19-6-pve #1 SMP PVE 5.13.19-14 (Thu, 10 Mar 2022 16:24:52 +0100) x86_64 GNU/Linux

Disabled TRIM individually (libata.force=1.00:noncq,2.00:noncq,3.00:noncq,4.00:noncq) and generally (libata.force=noncq) via grub and under device specific config:

echo 1 > /sys/block/sd*/device/queue_depth

Also enabled device energy max_performance mode:

echo max_performance > /sys/class/scsi_host/host*/link_power_management_policy

None of the configurations seems to work consistently making the zfs array to crash within hours/minutes. Been running the same hardware on a Truenas system (FreeBSD) with constant rsyncs during a whole day on the same pool without any single error.

Tested on last 5.15 test kernel from pve without any success (this kernel release suposedly has a patch solving this).

Is there someone with the same issues?

Let me know if i can provide any more info to debug this.

Thanks in advance.
 

Attachments

  • Screenshot 2022-03-11 18.26.33.png
    Screenshot 2022-03-11 18.26.33.png
    58.8 KB · Views: 22
Hi all!
This is my first post and and I had the same issue.
I've installed the new kernel pre-release version 5.15.
apt update && apt install pve-kernel-5.15

Then apt upgrade, and looks like it is resolved.

Regards!
 
Last edited:
Not the same here, what worked for me was to set libata.force=noncq,3.0 on kernel CMDLINE as suggested on the bugzilla
 
Hi,

To add on this issue, we have an MSI X570 A-Pro with an 5900X and 5x 4To Samsung 860 and had that issue appear recently.
Only the first two S-ATA port produce issues, swapping disks always produce errors on the first two ports.
It seems that the first two ports are on an ASMedia controller (26:00.0) and subsequent ports are on an AMD controller (2c:00.0).

```
26:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
2b:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
2c:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
```

```
[0:0:0:0] disk ATA Samsung SSD 860 3B6Q /dev/sda
dir: /sys/bus/scsi/devices/0:0:0:0 [/sys/devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:04.0/0000:26:00.0/ata1/host0/target0:0:0/0:0:0:0]
[1:0:0:0] disk ATA Samsung SSD 860 3B6Q /dev/sdb
dir: /sys/bus/scsi/devices/1:0:0:0 [/sys/devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:04.0/0000:26:00.0/ata2/host1/target1:0:0/1:0:0:0]
[3:0:0:0] disk ATA Samsung SSD 860 3B6Q /dev/sdc
dir: /sys/bus/scsi/devices/3:0:0:0 [/sys/devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:0a.0/0000:2c:00.0/ata4/host3/target3:0:0/3:0:0:0]
[4:0:0:0] disk ATA Samsung SSD 860 3B6Q /dev/sdd
dir: /sys/bus/scsi/devices/4:0:0:0 [/sys/devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:0a.0/0000:2c:00.0/ata5/host4/target4:0:0/4:0:0:0]
[7:0:0:0] disk ATA Samsung SSD 860 3B6Q /dev/sde
dir: /sys/bus/scsi/devices/7:0:0:0 [/sys/devices/pci0000:00/0000:00:01.2/0000:20:00.0/0000:21:0a.0/0000:2c:00.0/ata8/host7/target7:0:0/7:0:0:0]

```

In any case, disabling NCQ fixes the issue.
 
Last edited:
hi guys, unfurtunatly I`m dealing with the same error. In my case im using a minisforum HM80 with an AMR Ryzen 7 4800u. SMART values of one of my two SSD showing constantly growing CRC errors. Change the cable and SSD with no success. The SSDs are Crucial MX500. I think this will set my RAID1 in Degraded state with minimal amount of io errors.

zpool status -v tank0
pool: tank0
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 00:03:14 with 0 errors on Wed Jan 11 12:58:33 2023
config:

NAME STATE READ WRITE CKSUM
tank0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-CT1000MX500SSD1_2215E625EF1A ONLINE 2 0 0
ata-CT1000MX500SSD1_2213E6223DFB ONLINE 0 0 0

errors: No known data errors

So i am not sure if disabling NCQ seams the best way for me. Is there anything else i should do? I have already contacted the Minisforum support...
 
hi guys, unfurtunatly I`m dealing with the same error. In my case im using a minisforum HM80 with an AMR Ryzen 7 4800u. SMART values of one of my two SSD showing constantly growing CRC errors. Change the cable and SSD with no success. The SSDs are Crucial MX500. I think this will set my RAID1 in Degraded state with minimal amount of io errors.



So i am not sure if disabling NCQ seams the best way for me. Is there anything else i should do? I have already contacted the Minisforum support...
Same here....with same device and same SSDs :-( Already on PVE 8.0.3 with kernel 6.2.16-3 Any suggestions?
 
Last edited:
For MX500 or Samsung 870 Evo, and any not Datacenter SSD drives , Have you checked your SMART status ?
because ZFS burn consumer SSD like these.
Many topics remind it & many people replace each year depending on usage.
 
For MX500 or Samsung 870 Evo, and any not Datacenter SSD drives , Have you checked your SMART status ?
because ZFS burn consumer SSD like these.
Many topics remind it & many people replace each year depending on usage.
The whole device is one week old and the error appear asap on starting to use the ssd ZFS RAID
Smart is "fine" - just transferring Errors (UDMA)
1687951233646.png
 
Last edited:
Could you please provide logs when some I/O occur ?
If the error is identical to the one encountered with Samsung SSD's, the noncq fix should work.
 
Yes, thanks for the logs. If that's not enough, you could try `libata.force=2.00:noncq,2.00:3.0G`

BTW, I too have a HM80 as a homeserver running Proxmox. But using only the NVMe drive. Great performance and stability for the power consumption ;)
 
Yes, thanks for the logs. If that's not enough, you could try `libata.force=2.00:noncq,2.00:3.0G`

BTW, I too have a HM80 as a homeserver running Proxmox. But using only the NVMe drive. Great performance and stability for the power consumption ;)
thanks for your quick reply. Where do I have to put this kernel options? I actually used this command:
echo "1" > /sys/block/sdX/device/queue_depth

Yeah basically the HM80 is phantastic. It´s my temporary server to move over from other bare metal servers to virtualisation. If the support would be a little bit better, I would buy much more of them . But they still replied like this as an answer to my "Error problem":

Hi ,
Thanks for contacting Minisforum Support .
We are sorry that we can not provide you with Proxmov related advice, we are a computer hardware manufacturer, at present, can only provide Windows system basic services, other system software related issues, please consult the software official, if you suspect hardware failure, please install the windows system, easy to judge and confirm, thank you for your understanding and support.
 
Ok, so you hot-disabled queuing. This won't persist after a reboot.

You'll have to edit /etc/default/grub and the "GRUB_CMDLINE_LINUX_DEFAULT=" line.
Example


Then read on here https://pve.proxmox.com/wiki/Host_Bootloader to apply these changes.
Holy sh*t. This works. 1 NVME 1TB and 2 SSD with 500gigs as RAID1 ZFS. You are a genius and I don't have to return the device. Thanks a lot for helping me.
 
I'm getting the same error on an AMD X570 UD motherboard, is this an AMD error?

Interestingly I don't have any issues on a cheap JMICRON SATA card... I swapped power cables, enclosures, and SATA cables, same error on the same SATA ports.

No errors on the same drive on different controllers (LSI and JMICRON)

Only the first two S-ATA port produce issues, swapping disks always produce errors on the first two ports.

Yes, I think I see the same issue - ports 1 and 2.

No CRC errors, no smart errors, just SATA link errors.

The error seems very much like https://bugzilla.kernel.org/show_bug.cgi?id=201693 except I see it with 4TB WD Red Pro drives.
 

Attachments

  • log.txt
    121.7 KB · Views: 3
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!