Failed command: WRITE FPDMA QUEUED in VMs

cosminidis

Member
Jul 2, 2022
7
1
8
Hello, I've had this issue for >6 months with a couple of Linux VMs that display this error during / after backup (Proxmox Backup Server, kept up to date).
kern.log on the VM shows:
Jul 22 08:00:44 vubuntu3-srv kernel: [ 146.627177] loop6: detected capacity change from 0 to 8
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.317041] ata3.00: exception Emask 0x0 SAct 0xe0 SErr 0x0 action 0x6 frozen
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.317242] ata3.00: failed command: WRITE FPDMA QUEUED
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.317399] ata3.00: cmd 61/08:28:80:a8:a1/00:00:03:00:00/40 tag 5 ncq dma 4096 out
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.317399] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.317713] ata3.00: status: { DRDY }
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.317827] ata3.00: failed command: WRITE FPDMA QUEUED
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.317945] ata3.00: cmd 61/10:30:00:a9:a1/00:00:03:00:00/40 tag 6 ncq dma 8192 out
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.317945] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.318191] ata3.00: status: { DRDY }
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.318320] ata3.00: failed command: WRITE FPDMA QUEUED
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.318460] ata3.00: cmd 61/08:38:00:a9:a2/00:00:03:00:00/40 tag 7 ncq dma 4096 out
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.318460] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.318707] ata3.00: status: { DRDY }
Jul 22 08:10:27 vubuntu3-srv kernel: [ 726.318868] ata3: hard resetting link
Jul 22 08:10:27 vubuntu3-srv kernel: [ 730.132665] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 4291693351 wd_nsec: 4291691291
Jul 22 08:10:27 vubuntu3-srv kernel: [ 730.446145] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jul 22 08:10:27 vubuntu3-srv kernel: [ 730.446950] ata3.00: configured for UDMA/100
Jul 22 08:10:27 vubuntu3-srv kernel: [ 730.446972] ata3.00: device reported invalid CHS sector 0
Jul 22 08:10:27 vubuntu3-srv kernel: [ 730.446977] ata3.00: device reported invalid CHS sector 0
Jul 22 08:10:27 vubuntu3-srv kernel: [ 730.446979] ata3.00: device reported invalid CHS sector 0
Jul 22 08:10:27 vubuntu3-srv kernel: [ 730.447000] ata3: EH complete
Jul 22 08:15:15 vubuntu3-srv kernel: [ 1017.592300] ata3.00: exception Emask 0x0 SAct 0x8000000 SErr 0x0 action 0x6 frozen
Jul 22 08:15:15 vubuntu3-srv kernel: [ 1017.592536] ata3.00: failed command: WRITE FPDMA QUEUED
Jul 22 08:15:15 vubuntu3-srv kernel: [ 1017.592655] ata3.00: cmd 61/10:d8:b8:e7:65/00:00:03:00:00/40 tag 27 ncq dma 8192 out
Jul 22 08:15:15 vubuntu3-srv kernel: [ 1017.592655] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 22 08:15:15 vubuntu3-srv kernel: [ 1017.592969] ata3.00: status: { DRDY }
Jul 22 08:15:15 vubuntu3-srv kernel: [ 1017.593118] ata3: hard resetting link
Jul 22 08:15:15 vubuntu3-srv kernel: [ 1017.930565] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jul 22 08:15:15 vubuntu3-srv kernel: [ 1017.931303] ata3.00: configured for UDMA/100
Jul 22 08:15:15 vubuntu3-srv kernel: [ 1017.931320] ata3.00: device reported invalid CHS sector 0
Jul 22 08:15:15 vubuntu3-srv kernel: [ 1017.931337] ata3: EH complete
Jul 22 08:16:01 vubuntu3-srv kernel: [ 1063.647412] ata3.00: exception Emask 0x0 SAct 0x8000 SErr 0x0 action 0x6 frozen
Jul 22 08:16:01 vubuntu3-srv kernel: [ 1063.647584] ata3.00: failed command: WRITE FPDMA QUEUED
Jul 22 08:16:01 vubuntu3-srv kernel: [ 1063.647754] ata3.00: cmd 61/10:78:f0:e7:65/00:00:03:00:00/40 tag 15 ncq dma 8192 out
Jul 22 08:16:01 vubuntu3-srv kernel: [ 1063.647754] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 22 08:16:01 vubuntu3-srv kernel: [ 1063.648021] ata3.00: status: { DRDY }
Jul 22 08:16:01 vubuntu3-srv kernel: [ 1063.648257] ata3: hard resetting link
Jul 22 08:16:01 vubuntu3-srv kernel: [ 1064.022547] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jul 22 08:16:01 vubuntu3-srv kernel: [ 1064.023182] ata3.00: configured for UDMA/100
Jul 22 08:16:01 vubuntu3-srv kernel: [ 1064.023195] ata3.00: device reported invalid CHS sector 0
Jul 22 08:16:01 vubuntu3-srv kernel: [ 1064.023217] ata3: EH complete

This only happens if the VM HDD is on a Compellent SC4020 iSCSI storage. If I move the HDDs to local storage, the VM is stable during backups. Fsck doesn't show any errors, the iSCSI storage doesn't show any errors. All the updates didn't solve anything, I'm on PVE 8.4.5 now.

I have multiple Linux and Windows VMs that run just fine on the iSCSI storage, without issues because of the backup. Only two Linux VMs have this behaviour, and a Windows VM that resets (I think BSOD), and only if they are backed by the iSCSI storage. If moved on local storage everything is ok.

So far I've tried:
- all the possible combinations of the VM config (VirtIO SCSI / LSI / even PVSCSI), async_io on threads / native / io_ring, SSD emulation, etc.
- net.ipv4.conf.all.arp_ignore = 1 and rp_filter = 1 in /etc/sysctl.conf on the host
- SCSI header and body digests on and off on the SC4020

I'm out of ideas. Have anyone run into something similar? Any ideas how to debug this issue?

Best regards,
 
No, I never heard of this option before :-). Thank you for pointing this out, I'm setting this option on all the backup jobs and see how things work our in the following days. I'll use a local Thin LVM.
 
  • Like
Reactions: Kingneutron
Ok, fleecing doesn't help. I have the exact same behaviour.
The connection to the iSCSI is on dual NICs, both have IPs in the same network, each NIC goes to a different controller on the storage inside the same fault domain. The 2 NICs are part of a Linux bridge (I tried OpenSwitch as well).
The thing is that all other VMs behave normaly, so I don't think it's a iSCSI configuration thing. Also, when moved on local storage, these particular VMs behave normaly. So it doesn't look like a corrupt image.
I'm out of ideas.
 
When you backup the VM, are you doing Snapshot? If you can, try Stop VM - and put the problematic VMs on their own separate backup job.

Are the PBS and the iscsi on the same network / cable? When you get errors like that in-vm it usually means it's having trouble completing I/O write requests in a reasonable amount of time. Which could be bc of too much I/O over the same link, causing write delays to the iscsi. Might also be the backing storage on the iscsi side is having disk issues.

Another option you can try is in-vm backup to e.g. Veeam, instead of "outside" backup to PBS
 
The backup mode is indeed configured as Snapshot. I don't want to use Suspend or Stop, as I need those servers to work even during the night.
This is a production server, so I cannot power the VMs off before backup. Also, this doesn't solve / responds to my problem - why only that VM bahaves like that. I have other Ubuntu VMs, very similar to this one, which don't have this problem.
iSCSI has its own links, while the backup goes on a different NIC. There are no disk issues on the iSCSI box, it's all updated to the latest SCOS version and no errors in the logs.
Veeam is expensive, and I like how Proxmox Backup Server has support for offsite Remotes, so I want to stay on PBS.

I wonder if my issues could be related to the fact that both NICs used for iSCSI have IPs in the same subnet (direct connection, without a switch). The iSCSI storage demands IPs in the same subnet on both controllers for the fault domain, so I cannot change that. And still - why only those VMs, and none other?