SATA errors and VM freezes

Mar 2, 2018
11
0
41
53
I am just starting with Proxmox and we are running into a very show stopping problem. We have an older but very powerful server (HP ML350p G8) with 64GB of ram and 2 10 core Xeon processors. W are using 3 3TB hard discs in Raid5 for storage (local-lvm).

We had a couple VMs running on a Virtualbox headless installation that were converted for use on Proxmox. The VMs work fine but there are issues when we want to do backups or anything else that loads the hard disks. We start getting ATA errors like:

[ 379.635347] ata3.00: status: { DRDY }
[ 379.635391] ata3: hard resetting link
[ 403.235946] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 403.236169] ata3.00: configured for UDMA/100
[ 403.236172] ata3.00: retrying FLUSH 0xea Emask 0x4
[ 403.236258] ata3.00: device reported invalid CHS sector 0
[ 403.236269] ata3: EH complete

This happens on both virtual machines and only while doing something like a backup, cloning a vm or converting a disk image (something that loads disk IO). What can be done to prevent this issue? As is we cannot do anything without slowing down or freezing the VMs. A simple backup should not bring everything down. Or is this a limitation for LVM? Any pointers? Thanks.
 
Hi,

do you get this error in the guest or on the host?
If you get it in the guest please send the config of this guest.
 
Sorry I did not make that clear. The only two guests I am running at the moment are the ones hanging and giving those errors. They are running CentOS 7 and were previously running on VIrtualbox. Here is the config:

ID: 102
agent: 1
boot: cdn
bootdisk: sata0
cores: 4
ide2: none,media=cdrom
memory: 4096
name: Dev
net0: e1000=08:00:27:36:AB:FB,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
sata0: local-lvm:vm-102-disk-1,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=c9c45bd9-4d82-44f0-a37a-ac86343c74a5
sockets: 1

ID: 103
agent: 1
boot: cdn
bootdisk: sata0
cores: 8
ide2: none,media=cdrom
memory: 8192
name: Prod
net0: e1000=08:00:27:C0:22:CC,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
sata0: local-lvm:vm-103-disk-1,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=56c42b87-998f-479f-8501-e37e876fa45f
sockets: 1

Those are the only two VMs running on the server at the moment. I am trying to import another one but like I mentioned I start getting those SATA errors and hangs when I run the qm importdisk or any other process that causes a lot of disk activity on the local-lvm storage (like a backup of one of the guests).
 
Please use instead sata bus, scsi with virtio bus.
The network card give you more performance if you use virtio too.
 
Ok, found the procedure and made the changes. Nos both VMs are using VirtIO for both SCSI and network. I started a clone job to test out and I get this on one of the VMs:
[ 480.499434] INFO: task xfsaild/dm-0:439 blocked for more than 120 seconds.
[ 480.499492] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 480.499537] xfsaild/dm-0 D ffff88017c8adee0 0 439 2 0x00000000
[ 480.499540] Call Trace:
[ 480.499548] [<ffffffff8109affe>] ? try_to_del_timer_sync+0x5e/0x90
[ 480.499552] [<ffffffff816ab8a9>] schedule+0x29/0x70
[ 480.499590] [<ffffffffc015ed46>] _xfs_log_force+0x1c6/0x2c0 [xfs]
[ 480.499593] [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[ 480.499606] [<ffffffffc016abcc>] ? xfsaild+0x16c/0x6f0 [xfs]
[ 480.499617] [<ffffffffc015ee6c>] xfs_log_force+0x2c/0x70 [xfs]
[ 480.499628] [<ffffffffc016aa60>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[ 480.499637] [<ffffffffc016abcc>] xfsaild+0x16c/0x6f0 [xfs]
[ 480.499647] [<ffffffffc016aa60>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[ 480.499650] [<ffffffff810b270f>] kthread+0xcf/0xe0
[ 480.499652] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
[ 480.499654] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[ 480.499656] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
[ 480.499666] INFO: task mysqld:1568 blocked for more than 120 seconds.
[ 480.499707] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 480.499753] mysqld D ffff8800bba05ee0 0 1568 1000 0x00000080
[ 480.499755] Call Trace:
[ 480.499759] [<ffffffff812fc6d3>] ? blk_peek_request+0x83/0x280
[ 480.499761] [<ffffffff816ab8a9>] schedule+0x29/0x70
[ 480.499763] [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0
[ 480.499765] [<ffffffff812f96a3>] ? __blk_run_queue+0x33/0x40
[ 480.499768] [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
[ 480.499769] [<ffffffff816aae2d>] io_schedule_timeout+0xad/0x130
[ 480.499771] [<ffffffff816abedd>] wait_for_completion_io+0xfd/0x140
[ 480.499773] [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[ 480.499775] [<ffffffff812fe704>] blkdev_issue_flush+0xb4/0x110
[ 480.499787] [<ffffffffc01575a9>] xfs_blkdev_issue_flush+0x19/0x20 [xfs]
[ 480.499797] [<ffffffffc013fd10>] xfs_file_fsync+0x1b0/0x1e0 [xfs]
[ 480.499801] [<ffffffff81235765>] do_fsync+0x65/0xa0
[ 480.499802] [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214
[ 480.499804] [<ffffffff81235a63>] SyS_fdatasync+0x13/0x20
[ 480.499805] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[ 480.499807] [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
[ 480.499813] INFO: task bash:1812 blocked for more than 120 seconds.
[ 480.499853] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 480.499898] bash D ffff8802345d8fd0 0 1812 1806 0x00000084
[ 480.499900] Call Trace:
[ 480.499902] [<ffffffff812f9769>] ? queue_unplugged+0x39/0xa0
[ 480.499903] [<ffffffff816ab8a9>] schedule+0x29/0x70
[ 480.499905] [<ffffffff816a92b9>] schedule_timeout+0x239/0x2c0
[ 480.499906] [<ffffffff812fcff4>] ? blk_finish_plug+0x14/0x40
[ 480.499917] [<ffffffffc013a3e4>] ? _xfs_buf_ioapply+0x334/0x460 [xfs]
[ 480.499919] [<ffffffff816abc5d>] wait_for_completion+0xfd/0x140
[ 480.499920] [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[ 480.499931] [<ffffffffc013c1e3>] ? _xfs_buf_read+0x23/0x40 [xfs]
[ 480.499940] [<ffffffffc013c0d6>] xfs_buf_submit_wait+0xe6/0x1d0 [xfs]
[ 480.499951] [<ffffffffc016be79>] ? xfs_trans_read_buf_map+0x199/0x400 [xfs]
[ 480.499961] [<ffffffffc013c1e3>] _xfs_buf_read+0x23/0x40 [xfs]
[ 480.499970] [<ffffffffc013c2f7>] xfs_buf_read_map+0xf7/0x160 [xfs]
[ 480.499980] [<ffffffffc016be79>] xfs_trans_read_buf_map+0x199/0x400 [xfs]
[ 480.499990] [<ffffffffc012fdee>] xfs_imap_to_bp+0x6e/0xf0 [xfs]
[ 480.500000] [<ffffffffc013056e>] xfs_iread+0x7e/0x340 [xfs]
[ 480.500003] [<ffffffff8121e22c>] ? inode_init_always+0x11c/0x1f0
[ 480.500014] [<ffffffffc0145736>] xfs_iget+0x316/0x7e0 [xfs]
[ 480.500017] [<ffffffff811a2f65>] ? kvfree+0x35/0x40
[ 480.500028] [<ffffffffc014e4f7>] xfs_lookup+0xf7/0x140 [xfs]
[ 480.500038] [<ffffffffc014b14b>] xfs_vn_lookup+0x7b/0xd0 [xfs]
[ 480.500040] [<ffffffff8120d52d>] lookup_real+0x1d/0x50
[ 480.500042] [<ffffffff8120de02>] __lookup_hash+0x42/0x60
[ 480.500045] [<ffffffff816a3602>] lookup_slow+0x42/0xa7
[ 480.500047] [<ffffffff812113ab>] path_lookupat+0x77b/0x7b0
[ 480.500049] [<ffffffff811e1675>] ? kmem_cache_alloc+0x35/0x1e0
[ 480.500051] [<ffffffff81213def>] ? getname_flags+0x4f/0x1a0
[ 480.500053] [<ffffffff8121140b>] filename_lookup+0x2b/0xc0
[ 480.500054] [<ffffffff81214f87>] user_path_at_empty+0x67/0xc0
[ 480.500058] [<ffffffff81116a02>] ? from_kgid_munged+0x12/0x20
[ 480.500060] [<ffffffff8120892f>] ? cp_new_stat+0x14f/0x180
[ 480.500062] [<ffffffff81214ff1>] user_path_at+0x11/0x20
[ 480.500063] [<ffffffff81208423>] vfs_fstatat+0x63/0xc0
[ 480.500065] [<ffffffff8120898e>] SYSC_newstat+0x2e/0x60
[ 480.500067] [<ffffffff816b895a>] ? system_call_after_swapgs+0x187/0x214
[ 480.500068] [<ffffffff816b8953>] ? system_call_after_swapgs+0x180/0x214
[ 480.500070] [<ffffffff816b894c>] ? system_call_after_swapgs+0x179/0x214
[ 480.500071] [<ffffffff816b8945>] ? system_call_after_swapgs+0x172/0x214
[ 480.500073] [<ffffffff816b893e>] ? system_call_after_swapgs+0x16b/0x214
[ 480.500074] [<ffffffff816b8937>] ? system_call_after_swapgs+0x164/0x214
[ 480.500076] [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214
[ 480.500077] [<ffffffff816b8929>] ? system_call_after_swapgs+0x156/0x214
[ 480.500079] [<ffffffff816b8922>] ? system_call_after_swapgs+0x14f/0x214
[ 480.500080] [<ffffffff816b891b>] ? system_call_after_swapgs+0x148/0x214
[ 480.500082] [<ffffffff81208c6e>] SyS_newstat+0xe/0x10
[ 480.500084] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[ 480.500086] [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
 
I also notice that both VMs go over 100% CPU usage even though nothing is really using that much. Both machines rarely go above 50%.

VM103 (8 cores)
10:00:47 up 16 min, 1 user, load average: 5.93, 4.41, 2.47
VM102 (4 cores)
10:01:25 up 20 min, 1 user, load average: 3.31, 3.84, 2.16
 
Once the clone job ended everything went back to normal. As is Proxmox would be unusable in a production environment whenever you need to backup, clone or do anything that loads the lvm volume.
 
That is strange because Centos7 has virtio support enabled in the kernel.
Can you install form a current Centos7 ISO with virtio and test again?
 
Same result with a new installation:
ID: 104
agent: 1
bootdisk: scsi0
cores: 4
ide2: none,media=cdrom
memory: 2048
name: Carlos
net0: virtio=02:0F:86:EB:78:5D,bridge=vmbr0
numa: 0
ostype: l26
scsi0: local-lvm:vm-104-disk-2,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=a4f962ad-e947-437e-bbc7-a430e015097d
sockets: 1

Here is what I got when I tried to clone another VM:
Mar 7 16:43:34 carlos kernel: INFO: task xfsaild/dm-0:422 blocked for more than 120 seconds.
Mar 7 16:43:34 carlos kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 7 16:43:34 carlos kernel: xfsaild/dm-0 D ffff880035b09fa0 0 422 2 0x00000000
Mar 7 16:43:34 carlos kernel: Call Trace:
Mar 7 16:43:34 carlos kernel: [<ffffffff8109affe>] ? try_to_del_timer_sync+0x5e/0x90
Mar 7 16:43:34 carlos kernel: [<ffffffff816ab8a9>] schedule+0x29/0x70
Mar 7 16:43:34 carlos kernel: [<ffffffffc0243d46>] _xfs_log_force+0x1c6/0x2c0 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
Mar 7 16:43:34 carlos kernel: [<ffffffffc024fbcc>] ? xfsaild+0x16c/0x6f0 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffffc0243e6c>] xfs_log_force+0x2c/0x70 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffffc024fa60>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffffc024fbcc>] xfsaild+0x16c/0x6f0 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffffc024fa60>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Mar 7 16:43:34 carlos kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0
Mar 7 16:43:34 carlos kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
Mar 7 16:43:34 carlos kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90
Mar 7 16:43:34 carlos kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40

The imported VMs are now running with VirtIO SCSI and Net and they also hang. The only thing I got rid of were the SATA errors.
 
I just noticed this on the PVE server:
Mar 7 09:52:50 pve kernel: [166143.648716] kvm D 0 16363 1 0x00000000
Mar 7 09:52:50 pve kernel: [166143.648719] Call Trace:
Mar 7 09:52:50 pve kernel: [166143.648726] __schedule+0x3e0/0x870
Mar 7 09:52:50 pve kernel: [166143.648728] schedule+0x36/0x80
Mar 7 09:52:50 pve kernel: [166143.648731] schedule_timeout+0x1dd/0x360
Mar 7 09:52:50 pve kernel: [166143.648734] ? dm_make_request+0x74/0xc0
Mar 7 09:52:50 pve kernel: [166143.648735] io_schedule_timeout+0x1e/0x50
Mar 7 09:52:50 pve kernel: [166143.648736] ? io_schedule_timeout+0x1e/0x50
Mar 7 09:52:50 pve kernel: [166143.648738] wait_for_completion_io+0xb4/0x140
Mar 7 09:52:50 pve kernel: [166143.648740] ? wake_up_q+0x80/0x80
Mar 7 09:52:50 pve kernel: [166143.648744] submit_bio_wait+0x68/0x90
Mar 7 09:52:50 pve kernel: [166143.648747] blkdev_issue_flush+0x5c/0x90
Mar 7 09:52:50 pve kernel: [166143.648749] blkdev_fsync+0x35/0x50
Mar 7 09:52:50 pve kernel: [166143.648752] vfs_fsync_range+0x4e/0xb0
Mar 7 09:52:50 pve kernel: [166143.648753] do_fsync+0x3d/0x70
Mar 7 09:52:50 pve kernel: [166143.648754] SyS_fdatasync+0x13/0x20
Mar 7 09:52:50 pve kernel: [166143.648757] entry_SYSCALL_64_fastpath+0x24/0xab
Mar 7 09:52:50 pve kernel: [166143.648758] RIP: 0033:0x7fa2347e763d
Mar 7 09:52:50 pve kernel: [166143.648759] RSP: 002b:00007fa1207fc5f0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Mar 7 09:52:50 pve kernel: [166143.648760] RAX: ffffffffffffffda RBX: 00007fa1207ff700 RCX: 00007fa2347e763d
Mar 7 09:52:50 pve kernel: [166143.648761] RDX: 00007fa2290bce50 RSI: 00005620a7a7a788 RDI: 0000000000000015
Mar 7 09:52:50 pve kernel: [166143.648762] RBP: 00007fff6373c990 R08: 0000000000000000 R09: 00007fa1207ff700
Mar 7 09:52:50 pve kernel: [166143.648762] R10: 00007fa1207fc620 R11: 0000000000000293 R12: 0000000000000000
Mar 7 09:52:50 pve kernel: [166143.648763] R13: 00007fff6373c8ff R14: 00007fa1207ff700 R15: 00007fa229033a00

I guess this points to the problem being on the server? Any ideas where to look?
 
can you send me the output of

Code:
pveversion -v
 
root@pve:~# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9
 
Can you tell me what controller do you use as raid controller?
General a raid 5 is not recommended for virtualization, because it is slow.
But this should normally do not produce such error.
 
Do you think this may be a performance issue with the array controller? I am a bit new to virtualization so I did not consider the performance penalty of a raid 5 configuration. What would you recommend for this particular server? We have 3 3TB hard disks.

I would have to reinstall PVE to make any changes to the storage configuration. Does the server ID change? If so how do I get a new key?
 
I found on the HPE webside critical firmware updates about that series.
But I not 100% sure because they use the internal part number.
So please update the firmware.

What would you recommend for this particular server?
Recommended is 10 or 1.