Kernel errors: failed command: WRITE FPDMA QUEUED

SheridansNL · Aug 2, 2023

Hello.

I recently bought 4 new drives (wdc wd60efrx-68c5zn0) for a virtual nas solution on my server.
My server setup is also new and updated.

My setup:

Asrockrack x570d4u-2l2t/BCM
AMD 4750G
MSI GTX 1050Ti
2x samsung 980 pro
3x samsun 870 Evo
4x wdc wd60efrx-68c5zn0

I tried TrueNas, got errors
Synology DSM, got errors
To test I mounted one drive to a Windows11 VM. Where i did some read/write tests, got errors.

I get the following error messages on al 4 drives.


[14719.891148] ata4.00: exception Emask 0x0 SAct 0x200280 SErr 0xd0000 action 0x6 frozen
[14719.891173] ata4: SError: { PHYRdyChg CommWake 10B8B }
[14719.891186] ata4.00: failed command: READ FPDMA QUEUED
[14719.891197] ata4.00: cmd 60/08:38:a0:0c:5b/00:00:00:00:00/40 tag 7 ncq dma 4096 in
                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[14719.891227] ata4.00: status: { DRDY }
[14719.891237] ata4.00: failed command: WRITE FPDMA QUEUED
[14719.891249] ata4.00: cmd 61/00:48:60:7d:e9/08:00:00:00:00/40 tag 9 ncq dma 1048576 ou
                        res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[14719.891279] ata4.00: status: { DRDY }
[14719.891288] ata4.00: failed command: WRITE FPDMA QUEUED
[14719.891299] ata4.00: cmd 61/e8:a8:68:3f:59/00:00:00:00:00/40 tag 21 ncq dma 118784 out
                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[14719.891334] ata4.00: status: { DRDY }
[14719.891348] ata4: hard resetting link
[14720.375129] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[14720.378655] ata4.00: configured for UDMA/133
[14720.378725] ata4: EH complete

smartctl gives a pass:

I mounted the drive directly to pve (directory - ex4) and up/download an x amount of data and install and run a ubuntu vm on it.
I thought maby it was an issues with the passthrough, but:


[16486.041224] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0xd0000 action 0x6 frozen
[16486.041255] ata4: SError: { PHYRdyChg CommWake 10B8B }
[16486.041268] ata4.00: failed command: FLUSH CACHE EXT
[16486.041280] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 28
                        res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[16486.041319] ata4.00: status: { DRDY }
[16486.041330] ata4: hard resetting link
[16486.517818] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[16486.521549] ata4.00: configured for UDMA/133
[16486.521553] ata4.00: retrying FLUSH 0xea Emask 0x4
[16486.521726] ata4: EH complete

I'm getting a bit desparate here and hope someone can help me resolve this issue.

LongDono · Aug 4, 2023

I have the same situation with now 11 drives (3 of which aren't even needed but I purchased trying to solve the problem) and have been trying to solve it for months. They appear as drive interface errors which typically occur when system is under heavy I/O load, ex: SnapRAID syncing, scrubbing. Typically, these errors happen over and over and kill any connectivity with the affected drive (multiple drives, including 4 brand new ones).

Code:

[Fri Aug  4 13:03:24 2023] ata3.00: exception Emask 0x11 SAct 0x400 SErr 0x0 action 0x6 frozen
[Fri Aug  4 13:03:24 2023] ata3.00: irq_stat 0x48000008, interface fatal error
[Fri Aug  4 13:03:24 2023] ata3.00: failed command: READ FPDMA QUEUED
[Fri Aug  4 13:03:24 2023] ata3.00: cmd 60/00:50:00:36:e1/02:00:17:00:00/40 tag 10 ncq dma 262144 in
                                    res 40/00:00:00:36:e1/00:00:17:00:00/40 Emask 0x10 (ATA bus error)
[Fri Aug  4 13:03:24 2023] ata3.00: status: { DRDY }
[Fri Aug  4 13:03:24 2023] ata3: hard resetting link
[Fri Aug  4 13:03:24 2023] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Fri Aug  4 13:03:24 2023] ata3.00: supports DRM functions and may not be fully accessible
[Fri Aug  4 13:03:25 2023] ata3.00: supports DRM functions and may not be fully accessible
[Fri Aug  4 13:03:25 2023] ata3.00: configured for UDMA/133
[Fri Aug  4 13:03:25 2023] sd 3:0:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Fri Aug  4 13:03:25 2023] sd 3:0:0:0: [sdb] tag#10 Sense Key : Illegal Request [current]
[Fri Aug  4 13:03:25 2023] sd 3:0:0:0: [sdb] tag#10 Add. Sense: Unaligned write command
[Fri Aug  4 13:03:25 2023] sd 3:0:0:0: [sdb] tag#10 CDB: Read(16) 88 00 00 00 00 00 17 e1 36 00 00 00 02 00 00 00
[Fri Aug  4 13:03:25 2023] blk_update_request: I/O error, dev sdb, sector 400635392 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0
[Fri Aug  4 13:03:25 2023] ata3: EH complete

Things I've already tried:
- Various PVE Kernels, 5.x, 6.x
- Changed BIOS Settings
- Tried different mainboard firmware
- Swapped out and then bought new hard drive cables
- PCIe SATA controller passthroughs changed to individual drive passthroughs
- Swapped mainboard SATA with PCI-e SATA addon cards
- Swapped mainboard SATA with M1015 HBA + RES2SV240 SAS Expander cards
- Memtest86 (individual DIMMs, paired DIMMs, all DIMM configurations)
- Swapped power supply
- Swapped mainboard with another of same model (want ECC support)

Sometimes it can be a few days before the errors begin to occur, which leads me to feel I've found a solution only to later have the errors re-appear. As you were mentioning getting desperate, I'm getting ready to give up and move the storage functionality of the server back to non-virtualized hardware. Sad because it defeats the purpose of moving everything to one server and making it a powerful one.

SheridansNL · Aug 5, 2023

What kind of drives and brand are you on?
How do they connect? (hotswap bays or other type)

Can you describe your setup?
Please also describe your brand and type server case.
Maby there are some simulaties..

LongDono · Aug 19, 2023

I'd tried going bare metal with a fresh Ubuntu LTS install and was still getting constant errors. I tried using an older CPU and mainboard in a different case (with the drives still mounted in the Proxmox case) and I was still getting errors!

Having ruled out Proxmox (sorry for the rant above), I took it a step further, and moved the drives from the Proxmox system to the 5-in-3 hotswap bays of the old server. After swapping out two drives (not a good start), SnapRAID is now restoring those drives, reading from all of the other drives, without any errors for many hours now.

This is when I finally saw my chief suspect, the sexy but evil power splitters below. I found a listing on Amazon with a similar splitter where a recent review said that the cable was giving them server errors that took a long time to figure out (and they were very annoyed). I checked and the wires are only 22 AWG thickness!

These were right there in my face the whole time and I never noticed them, let alone thought to question them. The SnapRAID Fix will be running for at least another 24 hours. I'd recommend checking your setup for something small but very important like this. I can't believe all of the technical things I tried only for it to be something this basic the whole time.

Previously where I had said I'd swapped with a different power supply, of course I only attached it to these splitters, leaving the problem in the loop. When all the drives are spun up and working hard (SnapRAID scrub), there would be a higher power draw than these wires can provide (even though even a single 12v rail is rated for the load), causing certain drives to malfunction.

Fingers crossed and good luck to you.

sacalito · Apr 18, 2024

LongDono said:
I'd tried going bare metal with a fresh Ubuntu LTS install and was still getting constant errors. I tried using an older CPU and mainboard in a different case (with the drives still mounted in the Proxmox case) and I was still getting errors!

Having ruled out Proxmox (sorry for the rant above), I took it a step further, and moved the drives from the Proxmox system to the 5-in-3 hotswap bays of the old server. After swapping out two drives (not a good start), SnapRAID is now restoring those drives, reading from all of the other drives, without any errors for many hours now.

This is when I finally saw my chief suspect, the sexy but evil power splitters below. I found a listing on Amazon with a similar splitter where a recent review said that the cable was giving them server errors that took a long time to figure out (and they were very annoyed). I checked and the wires are only 22 AWG thickness!

These were right there in my face the whole time and I never noticed them, let alone thought to question them. The SnapRAID Fix will be running for at least another 24 hours. I'd recommend checking your setup for something small but very important like this. I can't believe all of the technical things I tried only for it to be something this basic the whole time.

Previously where I had said I'd swapped with a different power supply, of course I only attached it to these splitters, leaving the problem in the loop. When all the drives are spun up and working hard (SnapRAID scrub), there would be a higher power draw than these wires can provide (even though even a single 12v rail is rated for the load), causing certain drives to malfunction.

Fingers crossed and good luck to you.

THANKYOU so so much, just made an account to thankyou, i broke my head until i saw your post!!!
thanks again

Search

Search

Kernel errors: failed command: WRITE FPDMA QUEUED

SheridansNL

New Member

LongDono

Member

SheridansNL

New Member

LongDono

Member

Attachments

sacalito

New Member

We value your privacy