ata exception and failed write command WRITE DMA EXT

merasil · Mar 9, 2020

Hi there,

i recently noticed a problem with my Homeserver. As i logged onto my IPMI KVM i was welcomed by a lot of ata exceptions and failed commands.

Code:

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.3.10-1-pve root=/dev/mapper/pve-root ro quiet libata.force=noncq
[    0.109971] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.3.10-1-pve root=/dev/mapper/pve-root ro quiet libata.force=noncq
[    0.178635] Memory: 32700520K/33443260K available (14339K kernel code, 2396K rwdata, 4848K rodata, 2664K init, 5048K bss, 742740K reserved, 0K cma-reserved)
[    0.626359] libata version 3.00 loaded.
[    1.393373] acpi_cpufreq: overriding BIOS provided _PSD data
[    1.418341] Write protecting the kernel read-only data: 22528k
[    1.548981] ata1: SATA max UDMA/133 abar m4096@0xefc02000 port 0xefc02100 irq 52
[    1.548982] ata2: SATA max UDMA/133 abar m4096@0xefc02000 port 0xefc02180 irq 53
[    1.548983] ata3: SATA max UDMA/133 abar m4096@0xefc02000 port 0xefc02200 irq 54
[    1.548985] ata4: SATA max UDMA/133 abar m4096@0xefc02000 port 0xefc02280 irq 55
[    1.548986] ata5: SATA max UDMA/133 abar m4096@0xefc02000 port 0xefc02300 irq 56
[    1.548987] ata6: SATA max UDMA/133 abar m4096@0xefc02000 port 0xefc02380 irq 57
[    1.548989] ata7: SATA max UDMA/133 abar m4096@0xefc02000 port 0xefc02400 irq 58
[    1.548990] ata8: SATA max UDMA/133 abar m4096@0xefc02000 port 0xefc02480 irq 59
[    1.861106] ata7: SATA link down (SStatus 0 SControl 300)
[    1.861356] ata6: SATA link down (SStatus 0 SControl 300)
[    1.861640] ata8: SATA link down (SStatus 0 SControl 300)
[    2.022250] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.022268] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.022346] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.022361] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.022378] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.022645] ata3.00: FORCE: horkage modified (noncq)
[    2.022669] ata2.00: FORCE: horkage modified (noncq)
[    2.022729] ata3.00: ATA-9: WDC WD10EFRX-68FYTN0, 82.00A82, max UDMA/133
[    2.022731] ata3.00: 1953525168 sectors, multi 16: LBA48 NCQ (not used)
[    2.022755] ata2.00: ATA-9: WDC WD10EFRX-68FYTN0, 82.00A82, max UDMA/133
[    2.022756] ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (not used)
[    2.023120] ata4.00: FORCE: horkage modified (noncq)
[    2.023187] ata4.00: ATA-8: WDC WD10EFRX-68JCSN0, 01.01A01, max UDMA/133
[    2.023188] ata4.00: 1953525168 sectors, multi 16: LBA48 NCQ (not used)
[    2.023224] ata1.00: FORCE: horkage modified (noncq)
[    2.023237] ata3.00: configured for UDMA/133
[    2.023273] ata2.00: configured for UDMA/133
[    2.023287] ata1.00: ATA-8: WDC WD10EFRX-68JCSN0, 01.01A01, max UDMA/133
[    2.023289] ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (not used)
[    2.023802] ata5.00: FORCE: horkage modified (noncq)
[    2.023879] ata5.00: supports DRM functions and may not be fully accessible
[    2.023880] ata5.00: ATA-9: Samsung SSD 850 EVO M.2 250GB, EMT21B6Q, max UDMA/133
[    2.023881] ata5.00: 488397168 sectors, multi 1: LBA48 NCQ (not used)
[    2.024024] ata4.00: configured for UDMA/133
[    2.024144] ata1.00: configured for UDMA/133
[    2.026084] ata5.00: supports DRM functions and may not be fully accessible
[    2.026880] ata5.00: configured for UDMA/133
[    5.373771] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
[ 2053.836204] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
[ 2053.836231] ata3.00: irq_stat 0x08000000, interface fatal error
[ 2053.836248] ata3: SError: { Handshk }
[ 2053.836260] ata3.00: failed command: WRITE DMA EXT
[ 2053.836276] ata3.00: cmd 35/00:00:00:90:bc/00:0a:0e:00:00/e0 tag 21 dma 1310720 out
[ 2053.836314] ata3.00: status: { DRDY }
[ 2053.836327] ata3: hard resetting link
[ 2054.312186] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 2054.313186] ata3.00: configured for UDMA/133
[ 2054.313204] ata3: EH complete
[ 3754.882265] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
[ 3754.882295] ata3.00: irq_stat 0x08000000, interface fatal error
[ 3754.882311] ata3: SError: { Handshk }
[ 3754.882323] ata3.00: failed command: WRITE DMA EXT
[ 3754.882339] ata3.00: cmd 35/00:00:c0:cd:09/00:0a:0f:00:00/e0 tag 2 dma 1310720 out
[ 3754.882378] ata3.00: status: { DRDY }
[ 3754.882391] ata3: hard resetting link
[ 3755.358268] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 3755.359289] ata3.00: configured for UDMA/133
[ 3755.359303] ata3: EH complete
[ 3826.353826] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
[ 3826.353857] ata3.00: irq_stat 0x08000000, interface fatal error
[ 3826.353875] ata3: SError: { Handshk }
[ 3826.353887] ata3.00: failed command: WRITE DMA EXT
[ 3826.353904] ata3.00: cmd 35/00:00:80:7c:4c/00:0a:0f:00:00/e0 tag 20 dma 1310720 out
[ 3826.353944] ata3.00: status: { DRDY }
[ 3826.353958] ata3: hard resetting link
[ 3826.829822] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 3826.830842] ata3.00: configured for UDMA/133
[ 3826.830856] ata3: EH complete
[ 3877.329530] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
[ 3877.329561] ata3.00: irq_stat 0x08000000, interface fatal error
[ 3877.329579] ata3: SError: { Handshk }
[ 3877.329591] ata3.00: failed command: WRITE DMA EXT
[ 3877.329608] ata3.00: cmd 35/00:00:80:16:7a/00:0a:0f:00:00/e0 tag 2 dma 1310720 out
[ 3877.329647] ata3.00: status: { DRDY }
[ 3877.329660] ata3: hard resetting link
[ 3877.805536] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 3877.806514] ata3.00: configured for UDMA/133
[ 3877.806528] ata3: EH complete

At first i dissabled ncq cause i read that it could be the problem, but it didnt changed anything except the error message from "failed command write fpdma queued" to the one above. I swapped my HDDs inside the bay to check if its the hdd or cable/mainboard. Didnt change so i assume its the cable/mainboard. How can i determine which port is the root of the problem?

apoc · Mar 10, 2020

First I would go ahead an run smart checks (the long ones) on each individual drive to see what they report.
sometimes devices are not in the SMART database so that might need updating or interpreting the "raw values".
Otherwise - probably very hard to find out.

merasil · Mar 11, 2020

tburger said:
First I would go ahead an run smart checks (the long ones) on each individual drive to see what they report.
sometimes devices are not in the SMART database so that might need updating or interpreting the "raw values".
Otherwise - probably very hard to find out.

So i did the long smart test on all drives: no error found. dmesg throws this error only if i copy a large amount of data onto my fileserver.

apoc · Mar 11, 2020

What kind of storage-configuration do you use?
Any detail might help. What is a "large amount of data"?
What happens if you create IO-load on the file-server itself (e.g. copy files within the server).

merasil · Mar 11, 2020

tburger said:
What kind of storage-configuration do you use?
Any detail might help. What is a "large amount of data"?
What happens if you create IO-load on the file-server itself (e.g. copy files within the server).

I got a supermicro M11SDV-8C+-LN4F with 4 SATA Ports those are connected to a backplane (Case is CFI-A7879v2, dont know the backplane vendor) via 4 SATA Cables. My HDDs are 4x 1TB WD Red.
Proxmox is installed on a m.2 ssd.
The HDDs are configured via mdadm as RAID0 (yeah i know, but i dont need redundancy)
With large amount of data i mean anything beyond 4GB (or at least i have tried only that amount). Doesnt matter if its an image-file or an virtual harddisk or a large amount of pictures. But i found out that the error does happen more often if its a lot of small files that get copied (e.g. pictures)

apoc · Mar 11, 2020

So typically what will happen when you do IO to the disk: it goes to the Memory and system cache first. It gets buffered there until a certain point where the system starts destaging (enforced).
small files are more IO-intensive for the (slow) SATA spindles. So it is plausible to me that this triggers the behavior earlier.
My approach would be to change the behavior of file-system caching first. If there is less cache, the system will destage earlier which perhaps helps.
On the other hand if it manifests itself earlier then, it is a load-related problem (and that is always a challenge).

Since the EPYC processors are fairly new (compared to all the Intel stuff) the problem might just go away, because over time the drivers get better, also in the Linux kernel.
Also what to look for are BIOS-updates.
And finally BiOS-Settings. Typically you can choose between "legacy IDE" and "AHCI" SATA - check which is enabled - and try the other. From my feeling I would prefer AHCI. But who knows ...
Additionally check for your SATA-cables and dont underestimate the power supply. A bad PSU can cause all sorts of trouble....

You could also try limit the IO (there is an IO-Limiter in Proxmox) - you can configure it on each individual, virtual disk. It specifies read/write bandwidth limits, IO limits as well as maximum bursts etc. That is kind of a workaround, but if it gets you going...

merasil · Mar 14, 2020

tburger said:
So typically what will happen when you do IO to the disk: it goes to the Memory and system cache first. It gets buffered there until a certain point where the system starts destaging (enforced).
small files are more IO-intensive for the (slow) SATA spindles. So it is plausible to me that this triggers the behavior earlier.
My approach would be to change the behavior of file-system caching first. If there is less cache, the system will destage earlier which perhaps helps.
On the other hand if it manifests itself earlier then, it is a load-related problem (and that is always a challenge).

Since the EPYC processors are fairly new (compared to all the Intel stuff) the problem might just go away, because over time the drivers get better, also in the Linux kernel.
Also what to look for are BIOS-updates.
And finally BiOS-Settings. Typically you can choose between "legacy IDE" and "AHCI" SATA - check which is enabled - and try the other. From my feeling I would prefer AHCI. But who knows ...
Additionally check for your SATA-cables and dont underestimate the power supply. A bad PSU can cause all sorts of trouble....

You could also try limit the IO (there is an IO-Limiter in Proxmox) - you can configure it on each individual, virtual disk. It specifies read/write bandwidth limits, IO limits as well as maximum bursts etc. That is kind of a workaround, but if it gets you going...

So i did change the SATA Cables against new ones. I think the errors gotten worse after this. I looked at the bios and there are no options to change the sata behavior. I can enable/disable hotplugging. Thats all.
So if i understand u correctly there could be a hardware fault (Mainboard/Backplane/Cables). What keeps me wondering is, that i dont see any performance issues at all. If i hadnt seen the errors in Mainscreen i would not have thought that there is a problem.

*EDIT* I found out that the error is only write related... no matter how much data i copy from the server, the error does not apear

apoc · Mar 14, 2020

That is indeed strange and a reason why I use ZFS. It is far more sensible - e.g. it doesn't help if you store on a place which has issues on the HW side.

Try searching the error via websearch.
I found multiple references to HW related faults including controllers.
That would perhaps back my theory about the data Ctrl support.

merasil · Mar 14, 2020

yeah thing is.. i could rma all that stuff but they will charge me if there is no issue with this specific piece of hardware. So i should at least find out which piece (Case(Backplane)/Mainboard) it is :/

apoc · Mar 14, 2020

I understand. Try removing the backplane by attaching the cables directly to the disks.
Perhaps its the backplane...

Mulvak · Oct 27, 2020

merasil said:
yeah thing is.. i could rma all that stuff but they will charge me if there is no issue with this specific piece of hardware. So i should at least find out which piece (Case(Backplane)/Mainboard) it is :/

I'm having the same issue. Did you ever resolve yours?

Also on an Epyc 7251 CPU, ASRock Rack EPYCD8-2T. There is one standard SATA 6G channel, ATA3. This is my boot drive a Samsung EVO 800 series, brand new. It has two M.2 NVME (Crucial) drives on the mobo with no issue.

After reading this I tried a new PS, changed cables. Same result. I found this:
https://community.amd.com/thread/253334
Which lists the exact same errors I am getting. So I am thinking it may be the Samsung drive, which apparently is a know issue (for the past 1.5+ years!): https://bugzilla.kernel.org/show_bug.cgi?id=201693

So was wondering if you corrected your issue and if by chance you were using a Samsung drive?

As soon as I get a chance I'm goign to replace the drive. Unfortunately it is on a system where I JUST got PCI passthrough working beautifully so need to make sure I don't mess that up.

I'll post back with my results of the drive change.

ata exception and failed write command WRITE DMA EXT

merasil

Active Member

apoc

Famous Member

merasil

Active Member

apoc

Famous Member

merasil

Active Member

apoc

Famous Member

merasil

Active Member

apoc

Famous Member

merasil

Active Member

apoc

Famous Member

Mulvak

Member

We value your privacy