VM blocked due to hung_task_timeout_secs

Hi Guys,

can you post your vmid.conf ? guest os / guest kernel version ? & also proxmox host kernel version.

Code:
Code:
Guest OS Kernel
Linux proxcpt2 2.6.18-406.el5PAE #1 SMP Tue Jun 2 18:06:34 EDT 2015 i686 i686 i386 GNU/Linux

Code:
pveversion  -v
proxmox-ve-2.6.32: 3.4-156 (running kernel: 2.6.32-39-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-39-pve: 2.6.32-156
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-17
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Code:
cat /etc/pve/qemu-server/*.conf
cat 101.conf
boot: cdn
bootdisk: virtio0
cores: 2
ide2: local:iso/smeserver-8.1-i386.iso,media=cdrom
memory: 3072
name: sme8-.182
net0: virtio=12:53:17:38:98:B6,bridge=vmbr0
net1: rtl8139=7E:8D:25:BE:84:1F,bridge=vmbr1
onboot: 1
ostype: l26
smbios1: uuid=252969b4-92d3-40b2-9d39-e12b2d94d190
sockets: 1
virtio0: local:101/vm-101-disk-1.qcow2,format=qcow2,size=50G

cat 102.conf
boot: cdn
bootdisk: virtio0
cores: 2
ide2: local:iso/smeserver-8.1-i386.iso,media=cdrom
memory: 5120
name: sme8.184
net0: virtio=4E:28:58:73:3F:1A,bridge=vmbr0
net1: rtl8139=E6:07:87:F2:E1:79,bridge=vmbr1
onboot: 1
ostype: l26
smbios1: uuid=fda617d2-6fae-4bb2-9b38-4093ebf0eed2
sockets: 1
virtio0: local:102/vm-102-disk-1.qcow2,format=qcow2,size=50G
 
Code:
2.6.18-406.el5PAE

Afaik, virtio driver are buggy with kernel < 2.6.32

for example, flush/fsync don't work.
That could explain your problems.


as workaround, you could try to setup cache=directsync for your virtio.

(but if you can, upgrade your kernel !)
 
Hi Guys,

can you post your vmid.conf ? guest os / guest kernel version ? & also proxmox host kernel version.

root@proxmox4:~# pveversion -v
proxmox-ve-2.6.32: 3.4-157 (running kernel: 3.10.0-10-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-39-pve: 2.6.32-157
pve-kernel-3.10.0-10-pve: 3.10.0-34
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-18
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Here's an example VM conf. They're all configured pretty much the same. VMs are a mixture of Debian and Ubuntu, 2.6 & 3.x kernels. I've seen all of them hang. No pattern.

root@proxmox4:/etc/pve/qemu-server# cat 103.conf
balloon: 4096
boot: cdn
bootdisk: ide0
cores: 2
ide0: local:103/vm-103-disk-2.qcow2,format=qcow2,size=32G
ide2: none,media=cdrom
memory: 6144
name: mailedge2
net0: virtio=1E:6F:F4:E6:3B:27,bridge=vmbr1,tag=4
onboot: 1
ostype: l26
sockets: 1



BTW I switched all VMs to IDE yesterday. Still had a few hangs on 2.6 kernel. Switched back to 3.10 kernel last night and it's definitely more stable, but I still had one hang this morning. :(
 
Last edited:
I'm back. 26 days without having a crash and today ... one VM crashed v_v

why my VM are crashing v_v
 
I had all kind of issues with my HP machines that came with Broadcom BCM5719 and BCM5720 nics using tg3 drivers

Many as you describe , replaced them all with intel and has been rock solid for 90+ days now.
 
I tried everything to resolve this on Proxmox 3.4. Finally resorted to switching to RedHat's oVirt. The I/O issues are gone. Same hardware, similar setup.
 
We still have this problem, with Intel network cards. Looks like changing the disk image format to raw fixes the problem :confused:
 
Change all ours VM to raw format will be a hard task. Furthermore, I don't know if it's possible in our architecture.
 
Hi

Just had VM lockup for same reason. VM is Debian 7 and disk is qcow2 format.
Can anybody confirm: does changing disk format to raw fixes this problem?
 
Hi

Just had VM lockup for same reason. VM is Debian 7 and disk is qcow2 format.
Can anybody confirm: does changing disk format to raw fixes this problem?

Converting to raw did fix our problems. You can convert live by selecting 'move disk'.
 
Great, thanx.
Converted disk to raw and i also changed cache=none
Underlaying conf is RAID0 with SSD's, and seems that write speeds are pretty much same as was with cache=writetrough
Now, we'll wait and see.
 
Update:
Two days ago, one VM locked up again.
Proxmox v 3.4-11
VM was 64bit debian 7 and disk was format=raw cache=none

Relevalt data:
Dec 4 08:15:41 zabbix kernel: [66812.896175] ata3.00: exception Emask 0x0 SAct 0x10000 SErr 0x0 action 0x6 frozen
Dec 4 08:15:41 zabbix kernel: [66812.897001] ata3.00: failed command: WRITE FPDMA QUEUED
Dec 4 08:15:41 zabbix kernel: [66812.897459] ata3.00: cmd 61/18:80:50:64:25/00:00:01:00:00/40 tag 16 ncq 12288 out
Dec 4 08:15:41 zabbix kernel: [66812.897460] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 4 08:15:41 zabbix kernel: [66812.898770] ata3.00: status: { DRDY }
Dec 4 08:15:41 zabbix kernel: [66812.899117] ata3: hard resetting link
Dec 4 08:15:42 zabbix kernel: [66813.216295] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Dec 4 08:15:42 zabbix kernel: [66813.216847] ata3.00: configured for UDMA/100
Dec 4 08:15:42 zabbix kernel: [66813.216857] ata3.00: device reported invalid CHS sector 0
Dec 4 08:15:42 zabbix kernel: [66813.216866] ata3: EH complete
^C
root@zabbix:/var/log# uname -a
Linux zabbix 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u6 x86_64 GNU/Linux
root@zabbix:/var/log# cat /etc/debian_version
7.9

This log from Dec 4 was actually not lockup, just some errors in kernel log.
But 2 days ago VM was locked up, console was dead and CPU was 100% used.
Moved VM to Proxmox 4.0.57
And again, we'll wait and see (if v4 has same bug).
 
I'm well aware that this is a fairly old thread, but as I ran into the same problems while evaluating Proxmox, I think this is the best place to publish my solution ... if it's not common knowledge and best practice yet, but anyway:

Using the Phoronix / OpenBenchmarking test-suite pts/fio, I saw exactly the same error messages and wait states.

After applying a disk throttle of 200 MB for both read and write the problem disappeared.

There are still occasional error messages:

Nov 19 14:26:42 centos-a1 kernel: ata2: lost interrupt (Status 0x58)
Nov 19 14:26:42 centos-a1 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 19 14:26:42 centos-a1 kernel: ata2.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in#012 Get event status notification 4a 01 00 00 10 00 00 00 08 00res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Nov 19 14:26:42 centos-a1 kernel: ata2.00: status: { DRDY }
Nov 19 14:26:42 centos-a1 kernel: ata2: soft resetting link
Nov 19 14:26:43 centos-a1 kernel: ata2.00: configured for MWDMA2
Nov 19 14:26:43 centos-a1 kernel: ata2: EH complete

In the statistics I see still peaks above 1 GB/s which are related to this error message, but I'm optimistic that I can get rid of them playing with the disk throttle.

Attached please find a screenshot of the I/O diagram. The first half is before throttling the throughput.

Cheers,

Thomas

Sk%C3%A4rmavbild%202016-11-19%20kl.%2014.37.21.png
 
I'm well aware that this is a fairly old thread, but as I ran into the same problems while evaluating Proxmox, I think this is the best place to publish my solution ... if it's not common knowledge and best practice yet, but anyway:

Using the Phoronix / OpenBenchmarking test-suite pts/fio, I saw exactly the same error messages and wait states.

After applying a disk throttle of 200 MB for both read and write the problem disappeared.

There are still occasional error messages:

Nov 19 14:26:42 centos-a1 kernel: ata2: lost interrupt (Status 0x58)
Nov 19 14:26:42 centos-a1 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 19 14:26:42 centos-a1 kernel: ata2.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in#012 Get event status notification 4a 01 00 00 10 00 00 00 08 00res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Nov 19 14:26:42 centos-a1 kernel: ata2.00: status: { DRDY }
Nov 19 14:26:42 centos-a1 kernel: ata2: soft resetting link
Nov 19 14:26:43 centos-a1 kernel: ata2.00: configured for MWDMA2
Nov 19 14:26:43 centos-a1 kernel: ata2: EH complete

In the statistics I see still peaks above 1 GB/s which are related to this error message, but I'm optimistic that I can get rid of them playing with the disk throttle.

Attached please find a screenshot of the I/O diagram. The first half is before throttling the throughput.

Cheers,

Thomas

Sk%C3%A4rmavbild%202016-11-19%20kl.%2014.37.21.png
Are you still on PVE 3.x, or 4.x?
 
It's Proxmox 4.3-1.

I don't see any of these problems on a native server, nor in an ESXi VM.

I tried now with R/W> limits and R/W max burst of 150MB, but there are still unaccountable peaks in one of the tests. They don't show in the graph, but in atop I saw I/O busy rates of more than 400% at some times. Then the kernel threw an error.

When I set up the VM with controller type SCSI, the original problem described by the thread starter is back, so VirtIO seems to be the better choice.

Let me summarize my findings:
  • The blocking problems and the crashes described in the first entry of this thread can be mitigated by setting the disk throttle to reasonable values.
  • The remaining problem is "kernel: sd 2:0:0:0: [sda] abort", which might be recoverable in a real-world application.
  • The error occurred only during very heavy load tests, though in an otherwise empty machine (only one VM).
  • From 12 tests, only one test failed: Random Write - IO Engine: Libaio - Buffered: No - Direct: Yes - Block Size: 8KB.
I don't regard this as a blocking issue, but as soon as we have purchased the licenses, I will file an incident providing detailed information.
 
I'm still evaluating Proxmox, especially in the areas of reliability, scalability and performance. In a parallel installation I'm running some tests against a VMware ESXi installation, if required. The I/O problems have occurred in these isolated tests - I'm not yet at a point where I run real-world tests.

To me it looks as if the write cache resp. write-through mechanisms in Proxmox are somehow flaky. Theoretically, when choosing the no-cache option, the system should wait until the writes have been confirmed by the underlying file system (NFS in your case), but the extremely high write rates indicate that this isn't happening. I may have misunderstood the concept, though.

Did you experiment with the throttle settings in your NFS setup?

Like other backup systems, vzdump throughput can be limited using the -bwlimit option. Did you try that already? You might also try compression.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!