Why are my VMs dying with "hung_task_timeout_secs" ?

Jan 12, 2015
94
2
28
Last week I installed Proxmox 3.4 on two older servers which have been running Linux KVM (via virt-manager) for years (almost flawlessly). Now, when running on Proxmox, I am getting hung_task_timeout messages in several different VMs at different times of the day (see pic).

perror.png

I thought this was related to backup IO timeouts on the Proxmox server itself so I migrated all the guests to SSD (one guest had been running from SATA RAID because it had a 150GB image) but it is also happening on the second of the two Proxmox servers where only 1 guest is running (from SSD as well!). This is the same hardware which has been running the same 9 VMs for over a year so I'm not sure why this is happening. It seems to occur most frequently during the backups at 2am so I am going to try using the "suspend" option next time the backups run. Odd thing is there was nothing going on with the second hypervisor running the single guest. Both guest and Proxmox were basically at idle.

I'm going to try replacing the 3ware RAID card with a newer Areca one I have laying around but does anyone have any idea why this is happening?
 
is your screenshot from the Proxmox node? if yes, you run not a Proxmox kernel.

post the output of:

> pveversion -v
 
> pveversion -v
Hello, Screenshot is from console of hung guest. Here is pveversion from Proxmox.


# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
 
Just curious, any bonded interfaces? If so bond type and switch?


Hello, Screenshot is from console of hung guest. Here is pveversion from Proxmox.


# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
 
Last edited:
Last week I installed Proxmox 3.4 on two older servers which have been running Linux KVM (via virt-manager) for years (almost flawlessly). Now, when running on Proxmox, I am getting hung_task_timeout messages in several different VMs at different times of the day (see pic).

View attachment 2560

I thought this was related to backup IO timeouts on the Proxmox server itself so I migrated all the guests to SSD (one guest had been running from SATA RAID because it had a 150GB image) but it is also happening on the second of the two Proxmox servers where only 1 guest is running (from SSD as well!). This is the same hardware which has been running the same 9 VMs for over a year so I'm not sure why this is happening. It seems to occur most frequently during the backups at 2am so I am going to try using the "suspend" option next time the backups run. Odd thing is there was nothing going on with the second hypervisor running the single guest. Both guest and Proxmox were basically at idle.

I'm going to try replacing the 3ware RAID card with a newer Areca one I have laying around but does anyone have any idea why this is happening?

Hi,
any hint if you look with atop on the host? (apt-get install atop)

Udo
 
Hi,
any hint if you look with atop on the host? (apt-get install atop)

Udo

The host (Proxmox) "seemed" normal. I mean I checked dmesg, top, iostat.. nothing abnormal. It all seemed to point to a temporary resource flood - something happening during the vzdump process perhaps? I was not able to get a login on the guest (ubuntu 12.04/14.04) to run anything. It was hosed. I was going to go dig through the logs and try to piece something together but I've got a handful of other projects on hold right now so I'm going to just take a chance that the RAID card swap will fix it. The 3ware card is a good 4 years old I think The proxmox server itself is a Six-Core AMD Opteron 2435 with 32G of RAM and had 9 guests running concurrently (All Ubuntu 12.04 and 14.04) all configured to use less than 28G of RAM total. It seems like that box should handle that fine. Especially since.. As I mentioned before, the hosts were previously running on the same hardware without a problem (provisioned thorough virt-manager on Ubuntu) but there are quite a few things different now that they are running under Proxmox. Mostly the qcow2/raw files. It's more filesystem overhead than what was on the box before (was running straight from LVM partitions) not sure how much it matters on SSD though.
 
My logic was heading for LACP and bonding on the storage interfaces (with some iscsi in the mix). FYI. Have been dealing with this quite randomly on 4 different servers. Thats why I asked about the nic/bonding. I have since config my storage net with no LACP. Some research indicated this did not play well together.

Jury still out, ask this time next year and if its still running then I fixed it :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!