SATA errors and VM freezes

Carlos Chavez · Mar 5, 2018

I am just starting with Proxmox and we are running into a very show stopping problem. We have an older but very powerful server (HP ML350p G8) with 64GB of ram and 2 10 core Xeon processors. W are using 3 3TB hard discs in Raid5 for storage (local-lvm).

We had a couple VMs running on a Virtualbox headless installation that were converted for use on Proxmox. The VMs work fine but there are issues when we want to do backups or anything else that loads the hard disks. We start getting ATA errors like:

[ 379.635347] ata3.00: status: { DRDY }
[ 379.635391] ata3: hard resetting link
[ 403.235946] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 403.236169] ata3.00: configured for UDMA/100
[ 403.236172] ata3.00: retrying FLUSH 0xea Emask 0x4
[ 403.236258] ata3.00: device reported invalid CHS sector 0
[ 403.236269] ata3: EH complete

This happens on both virtual machines and only while doing something like a backup, cloning a vm or converting a disk image (something that loads disk IO). What can be done to prevent this issue? As is we cannot do anything without slowing down or freezing the VMs. A simple backup should not bring everything down. Or is this a limitation for LVM? Any pointers? Thanks.

wolfgang · Mar 6, 2018

Hi,

do you get this error in the guest or on the host?
If you get it in the guest please send the config of this guest.

Carlos Chavez · Mar 6, 2018

Sorry I did not make that clear. The only two guests I am running at the moment are the ones hanging and giving those errors. They are running CentOS 7 and were previously running on VIrtualbox. Here is the config:

ID: 102
agent: 1
boot: cdn
bootdisk: sata0
cores: 4
ide2: none,media=cdrom
memory: 4096
name: Dev
net0: e1000=08:00:27:36:AB:FB,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
sata0: local-lvm:vm-102-disk-1,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=c9c45bd9-4d82-44f0-a37a-ac86343c74a5
sockets: 1

ID: 103
agent: 1
boot: cdn
bootdisk: sata0
cores: 8
ide2: none,media=cdrom
memory: 8192
name: Prod
net0: e1000=08:00:27:C0:22:CC,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
sata0: local-lvm:vm-103-disk-1,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=56c42b87-998f-479f-8501-e37e876fa45f
sockets: 1

Those are the only two VMs running on the server at the moment. I am trying to import another one but like I mentioned I start getting those SATA errors and hangs when I run the qm importdisk or any other process that causes a lot of disk activity on the local-lvm storage (like a backup of one of the guests).

wolfgang · Mar 7, 2018

Please use instead sata bus, scsi with virtio bus.
The network card give you more performance if you use virtio too.

Carlos Chavez · Mar 7, 2018

Is there a procedure to change from SATA to VirtIO? I am guessing that it is not as simple as changing the settings on Proxmox.

Carlos Chavez · Mar 7, 2018

Ok, found the procedure and made the changes. Nos both VMs are using VirtIO for both SCSI and network. I started a clone job to test out and I get this on one of the VMs:
[ 480.499434] INFO: task xfsaild/dm-0:439 blocked for more than 120 seconds.
[ 480.499492] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 480.499537] xfsaild/dm-0 D ffff88017c8adee0 0 439 2 0x00000000
[ 480.499540] Call Trace:
[ 480.499548] [<ffffffff8109affe>] ? try_to_del_timer_sync+0x5e/0x90
[ 480.499552] [<ffffffff816ab8a9>] schedule+0x29/0x70
[ 480.499590] [<ffffffffc015ed46>] _xfs_log_force+0x1c6/0x2c0 [xfs]
[ 480.499593] [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[ 480.499606] [<ffffffffc016abcc>] ? xfsaild+0x16c/0x6f0 [xfs]
[ 480.499617] [<ffffffffc015ee6c>] xfs_log_force+0x2c/0x70 [xfs]
[ 480.499628] [<ffffffffc016aa60>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[ 480.499637] [<ffffffffc016abcc>] xfsaild+0x16c/0x6f0 [xfs]
[ 480.499647] [<ffffffffc016aa60>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[ 480.499650] [<ffffffff810b270f>] kthread+0xcf/0xe0
[ 480.499652] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
[ 480.499654] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[ 480.499656] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
[ 480.499666] INFO: task mysqld:1568 blocked for more than 120 seconds.
[ 480.499707] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 480.499753] mysqld D ffff8800bba05ee0 0 1568 1000 0x00000080
[ 480.499755] Call Trace:
[ 480.499759] [<ffffffff812fc6d3>] ? blk_peek_request+0x83/0x280
[ 480.499761] [<ffffffff816ab8a9>] schedule+0x29/0x70
[ 480.499763] [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0
[ 480.499765] [<ffffffff812f96a3>] ? __blk_run_queue+0x33/0x40
[ 480.499768] [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
[ 480.499769] [<ffffffff816aae2d>] io_schedule_timeout+0xad/0x130
[ 480.499771] [<ffffffff816abedd>] wait_for_completion_io+0xfd/0x140
[ 480.499773] [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[ 480.499775] [<ffffffff812fe704>] blkdev_issue_flush+0xb4/0x110
[ 480.499787] [<ffffffffc01575a9>] xfs_blkdev_issue_flush+0x19/0x20 [xfs]
[ 480.499797] [<ffffffffc013fd10>] xfs_file_fsync+0x1b0/0x1e0 [xfs]
[ 480.499801] [<ffffffff81235765>] do_fsync+0x65/0xa0
[ 480.499802] [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214
[ 480.499804] [<ffffffff81235a63>] SyS_fdatasync+0x13/0x20
[ 480.499805] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[ 480.499807] [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
[ 480.499813] INFO: task bash:1812 blocked for more than 120 seconds.
[ 480.499853] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 480.499898] bash D ffff8802345d8fd0 0 1812 1806 0x00000084
[ 480.499900] Call Trace:
[ 480.499902] [<ffffffff812f9769>] ? queue_unplugged+0x39/0xa0
[ 480.499903] [<ffffffff816ab8a9>] schedule+0x29/0x70
[ 480.499905] [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0
[ 480.499906] [<ffffffff812fcff4>] ? blk_finish_plug+0x14/0x40
[ 480.499917] [<ffffffffc013a3e4>] ? _xfs_buf_ioapply+0x334/0x460 [xfs]
[ 480.499919] [<ffffffff816abc5d>] wait_for_completion+0xfd/0x140
[ 480.499920] [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[ 480.499931] [<ffffffffc013c1e3>] ? _xfs_buf_read+0x23/0x40 [xfs]
[ 480.499940] [<ffffffffc013c0d6>] xfs_buf_submit_wait+0xe6/0x1d0 [xfs]
[ 480.499951] [<ffffffffc016be79>] ? xfs_trans_read_buf_map+0x199/0x400 [xfs]
[ 480.499961] [<ffffffffc013c1e3>] _xfs_buf_read+0x23/0x40 [xfs]
[ 480.499970] [<ffffffffc013c2f7>] xfs_buf_read_map+0xf7/0x160 [xfs]
[ 480.499980] [<ffffffffc016be79>] xfs_trans_read_buf_map+0x199/0x400 [xfs]
[ 480.499990] [<ffffffffc012fdee>] xfs_imap_to_bp+0x6e/0xf0 [xfs]
[ 480.500000] [<ffffffffc013056e>] xfs_iread+0x7e/0x340 [xfs]
[ 480.500003] [<ffffffff8121e22c>] ? inode_init_always+0x11c/0x1f0
[ 480.500014] [<ffffffffc0145736>] xfs_iget+0x316/0x7e0 [xfs]
[ 480.500017] [<ffffffff811a2f65>] ? kvfree+0x35/0x40
[ 480.500028] [<ffffffffc014e4f7>] xfs_lookup+0xf7/0x140 [xfs]
[ 480.500038] [<ffffffffc014b14b>] xfs_vn_lookup+0x7b/0xd0 [xfs]
[ 480.500040] [<ffffffff8120d52d>] lookup_real+0x1d/0x50
[ 480.500042] [<ffffffff8120de02>] __lookup_hash+0x42/0x60
[ 480.500045] [<ffffffff816a3602>] lookup_slow+0x42/0xa7
[ 480.500047] [<ffffffff812113ab>] path_lookupat+0x77b/0x7b0
[ 480.500049] [<ffffffff811e1675>] ? kmem_cache_alloc+0x35/0x1e0
[ 480.500051] [<ffffffff81213def>] ? getname_flags+0x4f/0x1a0
[ 480.500053] [<ffffffff8121140b>] filename_lookup+0x2b/0xc0
[ 480.500054] [<ffffffff81214f87>] user_path_at_empty+0x67/0xc0
[ 480.500058] [<ffffffff81116a02>] ? from_kgid_munged+0x12/0x20
[ 480.500060] [<ffffffff8120892f>] ? cp_new_stat+0x14f/0x180
[ 480.500062] [<ffffffff81214ff1>] user_path_at+0x11/0x20
[ 480.500063] [<ffffffff81208423>] vfs_fstatat+0x63/0xc0
[ 480.500065] [<ffffffff8120898e>] SYSC_newstat+0x2e/0x60
[ 480.500067] [<ffffffff816b895a>] ? system_call_after_swapgs+0x187/0x214
[ 480.500068] [<ffffffff816b8953>] ? system_call_after_swapgs+0x180/0x214
[ 480.500070] [<ffffffff816b894c>] ? system_call_after_swapgs+0x179/0x214
[ 480.500071] [<ffffffff816b8945>] ? system_call_after_swapgs+0x172/0x214
[ 480.500073] [<ffffffff816b893e>] ? system_call_after_swapgs+0x16b/0x214
[ 480.500074] [<ffffffff816b8937>] ? system_call_after_swapgs+0x164/0x214
[ 480.500076] [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214
[ 480.500077] [<ffffffff816b8929>] ? system_call_after_swapgs+0x156/0x214
[ 480.500079] [<ffffffff816b8922>] ? system_call_after_swapgs+0x14f/0x214
[ 480.500080] [<ffffffff816b891b>] ? system_call_after_swapgs+0x148/0x214
[ 480.500082] [<ffffffff81208c6e>] SyS_newstat+0xe/0x10
[ 480.500084] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[ 480.500086] [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214

Carlos Chavez · Mar 7, 2018

I also notice that both VMs go over 100% CPU usage even though nothing is really using that much. Both machines rarely go above 50%.

VM103 (8 cores)
10:00:47 up 16 min, 1 user, load average: 5.93, 4.41, 2.47
VM102 (4 cores)
10:01:25 up 20 min, 1 user, load average: 3.31, 3.84, 2.16

Carlos Chavez · Mar 7, 2018

Once the clone job ended everything went back to normal. As is Proxmox would be unusable in a production environment whenever you need to backup, clone or do anything that loads the lvm volume.

wolfgang · Mar 8, 2018

That is strange because Centos7 has virtio support enabled in the kernel.
Can you install form a current Centos7 ISO with virtio and test again?

Carlos Chavez · Mar 8, 2018

Same result with a new installation:
ID: 104
agent: 1
bootdisk: scsi0
cores: 4
ide2: none,media=cdrom
memory: 2048
name: Carlos
net0: virtio=02:0F:86:EB:78:5D,bridge=vmbr0
numa: 0
ostype: l26
scsi0: local-lvm:vm-104-disk-2,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=a4f962ad-e947-437e-bbc7-a430e015097d
sockets: 1

Here is what I got when I tried to clone another VM:
Mar 7 16:43:34 carlos kernel: INFO: task xfsaild/dm-0:422 blocked for more than 120 seconds.
Mar 7 16:43:34 carlos kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 7 16:43:34 carlos kernel: xfsaild/dm-0 D ffff880035b09fa0 0 422 2 0x00000000
Mar 7 16:43:34 carlos kernel: Call Trace:
Mar 7 16:43:34 carlos kernel: [<ffffffff8109affe>] ? try_to_del_timer_sync+0x5e/0x90
Mar 7 16:43:34 carlos kernel: [<ffffffff816ab8a9>] schedule+0x29/0x70
Mar 7 16:43:34 carlos kernel: [<ffffffffc0243d46>] _xfs_log_force+0x1c6/0x2c0 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
Mar 7 16:43:34 carlos kernel: [<ffffffffc024fbcc>] ? xfsaild+0x16c/0x6f0 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffffc0243e6c>] xfs_log_force+0x2c/0x70 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffffc024fa60>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffffc024fbcc>] xfsaild+0x16c/0x6f0 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffffc024fa60>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0
Mar 7 16:43:34 carlos kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
Mar 7 16:43:34 carlos kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90
Mar 7 16:43:34 carlos kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40

The imported VMs are now running with VirtIO SCSI and Net and they also hang. The only thing I got rid of were the SATA errors.

Carlos Chavez · Mar 8, 2018

I just noticed this on the PVE server:
Mar 7 09:52:50 pve kernel: [166143.648716] kvm D 0 16363 1 0x00000000
Mar 7 09:52:50 pve kernel: [166143.648719] Call Trace:
Mar 7 09:52:50 pve kernel: [166143.648726] __schedule+0x3e0/0x870
Mar 7 09:52:50 pve kernel: [166143.648728] schedule+0x36/0x80
Mar 7 09:52:50 pve kernel: [166143.648731] schedule_timeout+0x1dd/0x360
Mar 7 09:52:50 pve kernel: [166143.648734] ? dm_make_request+0x74/0xc0
Mar 7 09:52:50 pve kernel: [166143.648735] io_schedule_timeout+0x1e/0x50
Mar 7 09:52:50 pve kernel: [166143.648736] ? io_schedule_timeout+0x1e/0x50
Mar 7 09:52:50 pve kernel: [166143.648738] wait_for_completion_io+0xb4/0x140
Mar 7 09:52:50 pve kernel: [166143.648740] ? wake_up_q+0x80/0x80
Mar 7 09:52:50 pve kernel: [166143.648744] submit_bio_wait+0x68/0x90
Mar 7 09:52:50 pve kernel: [166143.648747] blkdev_issue_flush+0x5c/0x90
Mar 7 09:52:50 pve kernel: [166143.648749] blkdev_fsync+0x35/0x50
Mar 7 09:52:50 pve kernel: [166143.648752] vfs_fsync_range+0x4e/0xb0
Mar 7 09:52:50 pve kernel: [166143.648753] do_fsync+0x3d/0x70
Mar 7 09:52:50 pve kernel: [166143.648754] SyS_fdatasync+0x13/0x20
Mar 7 09:52:50 pve kernel: [166143.648757] entry_SYSCALL_64_fastpath+0x24/0xab
Mar 7 09:52:50 pve kernel: [166143.648758] RIP: 0033:0x7fa2347e763d
Mar 7 09:52:50 pve kernel: [166143.648759] RSP: 002b:00007fa1207fc5f0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Mar 7 09:52:50 pve kernel: [166143.648760] RAX: ffffffffffffffda RBX: 00007fa1207ff700 RCX: 00007fa2347e763d
Mar 7 09:52:50 pve kernel: [166143.648761] RDX: 00007fa2290bce50 RSI: 00005620a7a7a788 RDI: 0000000000000015
Mar 7 09:52:50 pve kernel: [166143.648762] RBP: 00007fff6373c990 R08: 0000000000000000 R09: 00007fa1207ff700
Mar 7 09:52:50 pve kernel: [166143.648762] R10: 00007fa1207fc620 R11: 0000000000000293 R12: 0000000000000000
Mar 7 09:52:50 pve kernel: [166143.648763] R13: 00007fff6373c8ff R14: 00007fa1207ff700 R15: 00007fa229033a00

I guess this points to the problem being on the server? Any ideas where to look?

wolfgang · Mar 9, 2018

can you send me the output of

Code:

pveversion -v

Carlos Chavez · Mar 9, 2018

root@pve:~# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9

wolfgang · Mar 12, 2018

Can you tell me what controller do you use as raid controller?
General a raid 5 is not recommended for virtualization, because it is slow.
But this should normally do not produce such error.

Carlos Chavez · Mar 12, 2018

The server has a Smart Array P420i controller with 1mb of cache onboard.

Carlos Chavez · Mar 13, 2018

Do you think this may be a performance issue with the array controller? I am a bit new to virtualization so I did not consider the performance penalty of a raid 5 configuration. What would you recommend for this particular server? We have 3 3TB hard disks.

I would have to reinstall PVE to make any changes to the storage configuration. Does the server ID change? If so how do I get a new key?

wolfgang · Mar 13, 2018

I found on the HPE webside critical firmware updates about that series.
But I not 100% sure because they use the internal part number.
So please update the firmware.

Carlos Chavez said:
What would you recommend for this particular server?

Recommended is 10 or 1.

Search

Search

SATA errors and VM freezes

Carlos Chavez

Active Member

wolfgang

Proxmox Retired Staff

Carlos Chavez

Active Member

wolfgang

Proxmox Retired Staff

Carlos Chavez

Active Member

Carlos Chavez

Active Member

Carlos Chavez

Active Member

Carlos Chavez

Active Member

wolfgang

Proxmox Retired Staff

Carlos Chavez

Active Member

Carlos Chavez

Active Member

wolfgang

Proxmox Retired Staff

Carlos Chavez

Active Member

wolfgang

Proxmox Retired Staff

Carlos Chavez

Active Member

Carlos Chavez

Active Member

wolfgang

Proxmox Retired Staff

We value your privacy