[CRITICAL] Huge IO load causes freezing during backups

gkovacs

Renowned Member
Dec 22, 2008
514
51
93
Budapest, Hungary
We have a new critical problem that started to appear on more than one of our PVE 3.0 nodes.

During nightly vzdump snapshot backups, possibly when a heavy file IO operation occurs inside a container, the entire server freezes in iowait. Processes still run, but seemingly no disk operations finish, load average crawls up to the hundreds. Only a hard reset solves the problem (even shutdown -rn is unable to successfully restart the server).

Problem started to appear since we upgraded to the new 2.6.32-20 kernel last week!

htop shows huge kernel iowait
iotop shows no userland io operations
console messages show hung task timeout

proxmox-ioblock.jpg

Environment
Intel Core i7, Adaptec HW RAID
Proxmox VE 3.0
ext4 filesystem, deadline scheduler

Code:
pve-manager: 3.0-23 (pve-manager/3.0/957f0862)
running kernel: 2.6.32-20-pve
proxmox-ve-2.6.32: 3.0-100
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-18-pve: 2.6.32-88
lvm2: 2.02.95-pve3
clvm: 2.02.95-pve3
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-4
qemu-server: 3.0-20
pve-firmware: 1.0-22
libpve-common-perl: 3.0-4
libpve-access-control: 3.0-4
libpve-storage-perl: 3.0-8
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-13
ksm-control-daemon: 1.1-1
 
Last edited:
Hi, can you test last 2.6.32-21-pve kernel from pvetest repository ?

Why, do you know any kernel modifications that could be connected to this problem?
If not, then no, these are production servers, so I'm not deliberately trying to induce problems like this.

BTW do we know the original OpenVZ version of the current -20 kernel, that we updated to last week?
I need it for my bugreport at the OpenVZ bugzilla.
 
Last edited:
Why, do you know any kernel modifications that could be connected to this problem?
If not, then no, these are production servers, so I'm not deliberately trying to induce problems like this.

the new kernel in pvetest is based on RHEL64, so there are countless changes. just to note, this new kernel is based on latest stable openvz kernel.

BTW do we know the original OpenVZ version of the current -20 kernel, that we updated to last week?
I need it for my bugreport at the OpenVZ bugzilla.

see changelog.

Code:
zless /usr/share/doc/pve-kernel-2.6.32-20-pve/changelog.Debian.gz
 
can you check if you also see the same with CFQ instead of deadline?
 
see changelog.

If I recall correctly, PVE 3.0 had 2.6.32-98 when we upgraded our PVE 2.3 cluster 2-3 weeks ago.

The problem did not appear since we upgraded to -99 or -100, so the culprit probably will be in these two kernels - or we simply did not give the -98 kernel enough time for the problem to manifest itself.

We know that:
- the problem is in vzkernel-2.6.32-042stab076.7.
- we use ext4 filesystem and deadline scheduler
- the problem was induced by a client starting a database import that parses a directory with thousands of small files, during vzdump snapshot backups


Code:
root@proxmox:~# zless /usr/share/doc/pve-kernel-2.6.32-20-pve/changelog.Debian.gz
pve-kernel-2.6.32 (2.6.32-100) unstable; urgency=low
* fix CVE-2013-2094
-- Proxmox Support Team <support@proxmox.com>  Wed, 15 May 2013 08:09:23 +0200

pve-kernel-2.6.32 (2.6.32-99) unstable; urgency=low
* set default scheduler to 'deadline'
-- Proxmox Support Team <support@proxmox.com>  Tue, 07 May 2013 06:42:22 +0200

pve-kernel-2.6.32 (2.6.32-98) unstable; urgency=low
* update to vzkernel-2.6.32-042stab076.7.src.rpm
* remove xfs-trans-ail-fix.patch (fixed upstream)
* update e1000e to 2.3.2
* update igb to 4.2.16
* update ixgbe to 3.14.5
-- Proxmox Support Team <support@proxmox.com>  Fri, 03 May 2013 19:51:48 +0200

As we use the deadline scheduler for years with every single kernel, I'm not sure there is any point in checking CFQ.
First I will try to downgrade to -98 and try to see if the problem persists.
 
Last edited:
We have downgraded to pve-kernel-2.6.32-19-pve: 2.6.32-96, and the problem is still appearing.

It seems to be tied to a single specific VE, and only during snapshot backups. If this VE is NOT on these ext4 nodes, then they work fine, if I migrate it back then the nodes freeze during the first nightly backups, at a specific time, possibly at the same time when this VE is being backed up.
 
Last edited:
If it hasn't already been posted here - see the following thread: http://forum.proxmox.com/threads/9097-vzdump-using-lvm-snapshot-kills-the-box?p=77084#post77084

I have made several posts related to a very similar issue with the newest ProxMox 3 / snapshots locking up the entire server. We have tried both deadline and CFQ with the same results. We are using EXT4 / LVM / iSCSI and NFS - this only happens on one of our Prox 3 servers, we have another independent cluster of 6 servers that back up fine locally. See if any of my issues look familiar - we are still trying to get a fix, we cannot back up one of our production ProxMox servers because of this issue.

Cheers,
Joe

We have downgraded to pve-kernel-2.6.32-19-pve: 2.6.32-96, and the problem is still appearing.

It seems to be tied to a single specific VE, and only during snapshot backups. If this VE is NOT on these ext4 nodes, then they work fine, if I migrate it back then the nodes freeze during the first nightly backups, at a specific time, possibly at the same time when this VE is being backed up.
 
Joe, the thread you were referring to indeed seems to be about the same problem. I have filed an OpenVZ bugreport, unfortunately with not a lot of reproducible information:
https://bugzilla.openvz.org/show_bug.cgi?id=2645

What we now know so far:
- server IO locks up during vzdump, load climbs to the sky, only hard reset can solve the problem
- it has started to appear in Proxmox 2.x kernels for some, 3.x for others
- it only happens during LVM snapshot backups
- it is more likely to happen on EXT4 (we are using EXT4 and DEADLINE, but others have reported it on EXT3 and/or CFQ as well)
- for us, it is triggered during the backup of a single specific VE
- it needs some IO load on the VE to happen

I can provide access for the Proxmox devs to a server where I can trigger this event.
 
Last edited:
@gkovacs: if you use adaptec raid, test with latest firmware on your controller and also upgrade the aacraid driver to aacraid driver v1.2.1-30200.

and report your results.
 
@informant stop spamming every thread with your whining. This is definitely a bug, but most likely not in Proxmox but in the OpenVZ kernel, which has it's own bugzilla.

Do not post here until you have new information about this subject.
 
@gkovacs

we have create a ticket too before, but the owner have cancel this. it´s not a bug, is her message.
-> https://bugzilla.proxmox.com/show_bug.cgi?id=407
and a other user
-> https://bugzilla.proxmox.com/show_bug.cgi?id=411

everytime: it´s not a bug, said the proxmox team, but what than?

regards

HowTo use our bugtracker:
1. File a bug if one of our staff member asks you to do it
2. If you have a re-producible problem, file a bug and include detailed steps to see the issue. If it works here, the bug will be closed with (works here).

No one said that we ignore the problem but as we told you multiple times, reporting it in the way you do it makes no sense and does not help in finding a solutions. You need to isolate the issue, find the reason. If you can´t do it, ask someone else for help. The issue seems complex, deep analysis on your servers is needed.

Also, you spam this forum with tens of threads, always with the same content - this a waste of time and against the forum rules.
 
@gkovacs: if you use adaptec raid, test with latest firmware on your controller and also upgrade the aacraid driver to aacraid driver v1.2.1-30200.

and report your results.

We have updated the Adaptec RAID controller's firmware on both nodes to the latest version 5.2.0 Build 19109 (21 Dec 2012) 5 days ago, and since then the problem has not appeared.
Firmware versions 18512 and 19076 both exhibited the issue, so it seems it was fixed in the most recent build.

We have not touched the driver, we are still running 1.2-1[29900].

Will update this thread if the problem comes back again.
Thanks for the suggestion Tom!
 
Unfortunately the problem is back after 6 days.
It seems the Adaptec firmware update has decreased the chance of it happening, but did not eliminate it totally.

Next thing to check is the 30200 driver in the new PVE kernel. Any timeframe on when it's getting released into stable? We are not really happy running test kernels on production servers.
 
The new -107 kernel did help a little, but unfortunately did not completely solve the problem.
Now we have a freeze once every week, instead of every night.

Problem seems to be connected to a single VE having thousands of files in a folder.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!