Virtuals in LVM ata errors

Darkhunter

New Member
Aug 21, 2015
19
0
1
Hello, I have three proxmoxes and some virtuals in them are making some errors about ata:
Code:
Oct 19 22:54:07 hsotname kernel: [3699830.334837] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 19 22:54:07 hsotname kernel: [3699830.417547] ata1.00: failed command: WRITE DMA
Oct 19 22:54:07 hsotname kernel: [3699830.418078] ata1.00: cmd ca/00:08:28:22:80/00:00:00:00:00/e1 tag 0 dma 4096 out
Oct 19 22:54:07 hsotname kernel: [3699830.418078] res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Oct 19 22:54:07 hsotname kernel: [3699830.420693] ata1.00: status: { DRDY }
Oct 19 22:54:07 hsotname kernel: [3699830.424753] ata1: soft resetting link
Oct 19 22:54:07 hsotname kernel: [3699830.580393] ata1.01: NODEV after polling detection
Oct 19 22:54:07 hsotname kernel: [3699830.581356] ata1.00: configured for MWDMA2
Oct 19 22:54:07 hsotname kernel: [3699830.581360] ata1.00: device reported invalid CHS sector 0
Is it bad? What can I do with it?
 
which OS do you run? virtual CD-Rom related?
 
No, two of these proxmoxes are Debian 7 and one is Debian 8. I install plain debian and then install proxmox on it.
VM OSes are too Debian 7 or 8's.
 
Hi Proxmox Team.

We are getting the same erros in our debian 8 machines.
Setup is qcow2 with LVM in the VM.

Code:
[So Dez 27 05:17:44 2015] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[So Dez 27 05:17:44 2015] ata1.00: failed command: WRITE DMA
[So Dez 27 05:17:44 2015] ata1.00: cmd ca/00:80:b8:4e:ce/00:00:00:00:00/eb tag 0 dma 65536 out res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[So Dez 27 05:17:44 2015] ata1.00: status: { DRDY }
[So Dez 27 05:17:44 2015] ata1: soft resetting link
[So Dez 27 05:17:45 2015] ata1.01: NODEV after polling detection
[So Dez 27 05:17:45 2015] ata1.00: configured for MWDMA2
[So Dez 27 05:17:45 2015] ata1.00: device reported invalid CHS sector 0
[So Dez 27 05:17:45 2015] ata1: EH complete

We got multiple behaviors after that:
- 9 times the VM stopped working and we need to press reset or reboot multiple times until it works
- 1 time we got a kernel panic after it

Seems to be not a hardware defect, cause this problem occours also after migration to another node.
Strange is that all debian7 VMs running fine. It´s only latest debian 8 getting this error.

pveversion
Code:
proxmox-ve-2.6.32: 3.4-166 (running kernel: 2.6.32-43-pve)
pve-manager: 3.4-11 (running version: 3.4-11/6502936f)
pve-kernel-2.6.32-39-pve: 2.6.32-157
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-43-pve: 2.6.32-166
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-19
qemu-server: 3.4-6
pve-firmware: 1.1-5
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-34
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-14
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

pvperf
Code:
root@server14:/var/lib/vz# pveperf /var/lib/vz
CPU BOGOMIPS:      55203.24
REGEX/SECOND:      946969
HD SIZE:           2605.33 GB (/dev/mapper/pve-data)
BUFFERED READS:    130.43 MB/sec
AVERAGE SEEK TIME: 17.61 ms
FSYNCS/SECOND:     811.25
DNS EXT:           53.06 ms
DNS INT:           50.92 ms (xxx)

vmXXX.conf
Code:
#[hostname]
#[IP]
#
boot: cdn
bootdisk: ide0
cores: 2
ide0: local:110/vm-110-disk-1.qcow2,format=qcow2,cache=writethrough,size=201G
ide2: server16:iso/systemrescuecd-x86-4.6.1.iso,media=cdrom,size=459502K
memory: 4096
name: [FQDN]
net0: e1000=CE:C8:FE:B3:56:F8,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
smbios1: uuid=d5bc6275-b25a-4523-b927-0d0098a7cb74
sockets: 1

Hardware info
Code:
AMD Opteron(tm) Processor 6176 12 cores
Supermicro H8SGL
Adaptec 5405Z with ZMCP
2 x HGST HDN724030AL as RAID 1

All updated to the latest versions.

Has anyone the same problems? Or get it solved already?

kind regards
Michael
 
And again total crash of VM due to storage errors...
Need help please. :(

Code:
kernel: [242495.848207] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
kernel: [242495.849075] ata1.00: failed command: FLUSH CACHE
kernel: [242495.849772] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
kernel: [242495.849772]          res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
kernel: [242495.851831] ata1.00: status: { DRDY }
kernel: [242500.892182] ata1: link is slow to respond, please be patient (ready=0)
kernel: [242505.876134] ata1: device not ready (errno=-16), forcing hardreset
kernel: [242505.876246] ata1: soft resetting link
kernel: [242506.033244] ata1.00: configured for MWDMA2
kernel: [242506.033252] ata1.00: retrying FLUSH 0xe7 Emask 0x4
kernel: [242506.033620] ata1.00: device reported invalid CHS sector 0
kernel: [242506.033632] ata1: EH complete
kernel: [255097.832155] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
kernel: [255097.833034] ata1.00: failed command: FLUSH CACHE
kernel: [255097.833744] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
kernel: [255097.833744]          res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
kernel: [255097.835810] ata1.00: status: { DRDY }
kernel: [255102.876126] ata1: link is slow to respond, please be patient (ready=0)
kernel: [255107.860130] ata1: device not ready (errno=-16), forcing hardreset
kernel: [255107.860153] ata1: soft resetting link
kernel: [255108.017093] ata1.00: configured for MWDMA2
kernel: [255108.017113] ata1.00: retrying FLUSH 0xe7 Emask 0x4
kernel: [255108.017537] ata1.00: device reported invalid CHS sector 0
kernel: [255108.017550] ata1: EH complete
kernel: [309438.824333] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
kernel: [309438.825198] ata1.00: failed command: FLUSH CACHE
kernel: [309438.825921] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
kernel: [309438.825921]          res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
kernel: [309438.827996] ata1.00: status: { DRDY }
kernel: [309443.868140] ata1: link is slow to respond, please be patient (ready=0)
kernel: [309448.852147] ata1: device not ready (errno=-16), forcing hardreset
kernel: [309448.852175] ata1: soft resetting link
kernel: [309449.009123] ata1.00: configured for MWDMA2
kernel: [309449.009129] ata1.00: retrying FLUSH 0xe7 Emask 0x4
kernel: [309449.009532] ata1.00: device reported invalid CHS sector 0
kernel: [309449.009545] ata1: EH complete

This is only in the VM. No errors on host system :(

kind regards
Michael
 
I have similar issues (happens intermittently, I get what happens to communication failure with the disk (.raw) on the NFS server or high use of RAM in Proxmox host. Sometimes the VM will not be damaged, sometimes it comes into read-only, but has happened a few times the system crash and the install, disk partition table and partitions got corrupt). In the case bellow, the system stay ok of after the failure (no network issues to nfs server has detected. I beleive was a proxmox host bug with a high raw usage (nearby 90%). I need one method to protect the VMs disk in this cases, to avoid the risk to damage the disk in this cases.


Code:
# more /etc/debian_version
7.1




Aug 12 19:06:57 drive kernel: [1886649.976714] ata3: hard resetting link
Aug 12 19:07:07 drive kernel: [1886659.488315] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Aug 12 19:07:07 drive kernel: [1886659.496664] ata3.00: configured for UDMA/100
Aug 12 19:07:07 drive kernel: [1886659.496664] ata3.00: device reported invalid CHS sector 0
Aug 12 19:07:07 drive kernel: [1886659.496664] ata3: EH complete
Aug 12 19:07:59 drive kernel: [1886711.779164] ata3: hard resetting link
Aug 12 19:08:09 drive kernel: [1886721.948317] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Aug 12 19:08:09 drive kernel: [1886721.949215] ata3.00: configured for UDMA/100
Aug 12 19:08:09 drive kernel: [1886721.949215] ata3.00: device reported invalid CHS sector 0
Aug 12 19:08:09 drive kernel: [1886721.949215] ata3: EH complete

...

Aug 18 04:55:46 drive kernel: [2353978.632080] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Aug 18 04:55:46 drive kernel: [2353978.635310] ata3.00: configured for UDMA/100
Aug 18 04:55:46 drive kernel: [2353978.635310] ata3.00: device reported invalid CHS sector 0
Aug 18 04:55:46 drive kernel: [2353978.635310] ata3.00: device reported invalid CHS sector 0
Aug 18 04:55:46 drive kernel: [2353978.635310] ata3.00: device reported invalid CHS sector 0
Aug 18 04:55:46 drive kernel: [2353978.635310] ata3.00: device reported invalid CHS sector 0
...
Aug 18 04:55:46 drive kernel: [2353978.635310] ata3.00: device reported invalid CHS sector 0
Aug 18 04:55:46 drive kernel: [2353978.635310] ata3.00: device reported invalid CHS sector 0
Aug 18 04:55:46 drive kernel: [2353978.635310] ata3.00: device reported invalid CHS sector 0
Aug 18 04:55:46 drive kernel: [2353978.635310] ata3.00: device reported invalid CHS sector 0
Aug 18 04:55:46 drive kernel: [2353978.635310] ata3: EH complete
Aug 18 05:00:31 drive kernel: [2354263.845084] ata3: hard resetting link
Aug 18 05:05:31 drive kernel: [2354520.808093] Modules linked in: nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc loop snd_pcm snd_page_alloc snd_timer i2c_piix4 snd soundcore i2c_core processor psmouse evdev serio_raw joydev pcspkr thermal_sys button ext4 crc16 jbd2 mbcache usbhid hid sd_mod crc_t10dif sg sr_mod cdrom uhci_hcd ehci_hcd usbcore ata_generic ahci libahci e1000 floppy usb_common ata_piix libata scsi_mod [last unloaded: scsi_wait_scan]
...
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa003e911>] ? sata_link_resume+0x57/0x132 [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa0042533>] ? sata_link_hardreset+0x101/0x1bd [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa0049007>] ? ata_eh_reset+0x3ed/0x9bf [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa00499ac>] ? ata_eh_recover+0x2c6/0xfde [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa004262d>] ? sata_std_hardreset+0x3e/0x3e [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa004e848>] ? sata_pmp_error_handler+0x9d/0x7ef [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa004a991>] ? ata_scsi_port_error_handler+0x232/0x53d [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa0046dab>] ? ata_scsi_cmd_error_handler+0xdd/0x116 [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa004ad28>] ? ata_scsi_error+0x8c/0xb5 [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa003e911>] ? sata_link_resume+0x57/0x132 [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa0042533>] ? sata_link_hardreset+0x101/0x1bd [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa0049007>] ? ata_eh_reset+0x3ed/0x9bf [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa00499ac>] ? ata_eh_recover+0x2c6/0xfde [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa004262d>] ? sata_std_hardreset+0x3e/0x3e [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa004e848>] ? sata_pmp_error_handler+0x9d/0x7ef [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa004a991>] ? ata_scsi_port_error_handler+0x232/0x53d [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa0046dab>] ? ata_scsi_cmd_error_handler+0xdd/0x116 [libata]
Aug 18 05:05:31 drive kernel: [2354263.856067]  [<ffffffffa004ad28>] ? ata_scsi_error+0x8c/0xb5 [libata]
...
Aug 18 05:05:31 drive kernel: [2354563.492100] Modules linked in: nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc loop snd_pcm snd_page_alloc snd_timer i2c_piix4 snd soundcore i2c_core processor psmouse evdev serio_raw joydev pcspkr thermal_sys button ext4 crc16 jbd2 mbcache usbhid hid sd_mod crc_t10dif sg sr_mod cdrom uhci_hcd ehci_hcd usbcore ata_generic ahci libahci e1000 floppy usb_common ata_piix libata scsi_mod [last unloaded: scsi_wait_scan]
Aug 18 05:05:31 drive kernel: [2354563.721396] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Aug 18 05:05:31 drive kernel: [2354563.724761] ata3.00: configured for UDMA/100
Aug 18 05:05:31 drive kernel: [2354563.724761] ata3.00: device reported invalid CHS sector 0
Aug 18 05:05:31 drive kernel: [2354563.724761] ata3.00: device reported invalid CHS sector 0
Aug 18 05:05:31 drive kernel: [2354563.724761] ata3.00: device reported invalid CHS sector 0
Aug 18 05:05:31 drive kernel: [2354563.724761] ata3: EH complete


In another case, one VM with debian 8.5, the system was totally crashed, the vm stay online with read only partitions. When i rebooted the vm, the partition table was lost, and the partitions too. I was need to reinstall the system. I ask, how to solve this? Or just do not let the discs are corrupted in such cases?

Regards,
André
 
this exact issue is still current even in proxmox 4.4.
however the work around of moving the debian installed VM off of an LVM storage system does 'fix' the problem.
i went from a raid 5 to 2 mirrored sets to be able to create different types of storage for VM's.
as this only seems to happen in a very specific scenario i do not think proxmox (or debian) will ever look into it.
 
Hi zedicus, all,
Do you know if LVM underneath a filesystem (as opposed to it accessed directly by proxmox) also causes this? I think i have observed the same with a Ubuntu VM after reinstalling to 5beta and changing storage setup as below. Basically, changing too many things at once to easily point the finger at any one thing :).
From
4.2
Swraid/md0 /luks /ext4/mountpoint/VM

To
5beta
Hwraid/sdb/luks/lvm vg /lv/ext4/mountpoint/VM

(Where i noticed the error)
Overnight I've just moved to what i "thought" as a saner:

Hwraid/sdb/luks/lvm vg/VM

Will see if that helps, but from your post i fear not. so I might try some different guest drive controller options tomorrow (SCSI ->SATA) for science. I had been hoping to move to
Lv/drbd/lvm VG/VM But that's an even more lvm based setup!

Anyone reading who has found workarounds or common causes for this kind of behaviour in general?
 
i only tested with direct LVM to proxmox. however i THOROUGHLY tested LVM to proxmox and tried changing everything from the type of disk the VM saw, to the physical hard drives that the LVM sat on. and while some things seemed to help, the problem was always there. the weird part is if the LVM was fairly close to idle, the VM with debian would not show issue. it was only when the LVM set started getting activity (like restoring a backup VM to it) that the debian VM sitting on the LVM would start having issues. and eventually the debian VM would become inoperable.

i did have some issues with 5beta, but i did not have enough testing to say that it was for sure the same issue as this. as soon as i started seeing oddities in 5 beta i rolled back to 4.x branch as i was on a time line but did want to atleast see the changes in 5beta.

my work around was a small totally seperate disk with a filesystem mounted in proxmox as a folder. hand it too proxmox as a directory, and install my debian stuff on it. i understand this is not ideal, and this disk has ZERO redundancy apposed to the hardware raid 5 the LVM is on. however as i said, my timeline was ending and i had to do SOMETHING. hopefully now that the system is up i can continue testing on LVM and develope a migration plan for the older debian installs i have.
 
  • Like
Reactions: David Harvey
so i finally got around to diagnosing this. however it probably does not matter to most people at this point. it IS still possible to generate the issue even on up to date 4.4. so here is the what and why.

as stated above ATA errors (not awaays even the same ATA error) are generated in VM's on LVM direct. and no noticeable effects are gained by changing cache, the kernel the VM uses, and i even messed with some host settings before coming to the root of the issue. if enough load is placed on the drives that are hosting the LVM, and the VM is using a SATA driver to connect to that LVM, what happens is the VM sees the hard drive as slow or non responsive and generates ATA errors that you would see on a physical PC with a bad/loose data cable, or some other intermittent issue.

i know, why would you ever use LVM and set the guest OS to SATA or anything other than VirtIO? well on a restore it requires extra steps, not to mention some Kernels (ancient ones by todays standards) are not super great with VirtIO. on SOME guest systems an IDE device can be set and it will be usable. (WinXP i was able to make work well on LVM with and assigning the disk as IDE)

also actually using VirtIO does not actually fix the issue, however the guest VirtIO driver is capable of handling the soft hang and will generate a task timeout error. this is o.k. as it will not destroy the filesystem like hammering on linux with ATA errors can.

also if you have an LVM configured and have asigned SATA disks to VMs and have no issues. it isnt because you got a magic install, or i did something wrong, it is just that you have not generated enough traffic to a point to cause a guest to go into time out status across its virtual bus. i had to come up with a test pattern that would cause the host, and a VM to go into 100% write on the same set of disks at the same time for extended periods (fastest i saw was 15 minutes, longest was just over 2 hours, before errors generated) to generate enough data to reliably test this issue.

long story short, if you have LVM storage connected directly to host, anything sitting on it must be VirtIO. if you have a need for VMs to use SATA, IDE, or some other HD style then you need to create a file system somewhere and hand it through to proxmox as a directory. hopefully someone finds this useful still.

(note: these users make it seem as if it was ONLY debian affected. i can show that MOST OS will have an issue, but debian had the distinct ability to present the user a lot more noticeable problems than windows, or even other versions of linux and bsd.)
 
Last edited:
good analysis !
did the WRITE DMA commands timeing out made the kernel switched the associated mount to read-only ? ( necessiting a reboot afterwards)
or did the VM just hanged in your case ?
 
good analysis !
did the WRITE DMA commands timeing out made the kernel switched the associated mount to read-only ? ( necessiting a reboot afterwards)
or did the VM just hanged in your case ?

no never caused a switch to read only in the guest. The issue would slowly degrade the guest filesystem until the guest would no longer boot.

i was curious rather adding ceph or even ZFS would be a solution. however i felt it was adding unnecessary layers so i never looked into them as a solution to this issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!