[CRITICAL] Huge IO load causes freezing during backups

gkovacs · Jun 28, 2013

We have a new critical problem that started to appear on more than one of our PVE 3.0 nodes.

During nightly vzdump snapshot backups, possibly when a heavy file IO operation occurs inside a container, the entire server freezes in iowait. Processes still run, but seemingly no disk operations finish, load average crawls up to the hundreds. Only a hard reset solves the problem (even shutdown -rn is unable to successfully restart the server).

Problem started to appear since we upgraded to the new 2.6.32-20 kernel last week!

htop shows huge kernel iowait
iotop shows no userland io operations
console messages show hung task timeout

Environment
Intel Core i7, Adaptec HW RAID
Proxmox VE 3.0
ext4 filesystem, deadline scheduler

Code:

pve-manager: 3.0-23 (pve-manager/3.0/957f0862)
running kernel: 2.6.32-20-pve
proxmox-ve-2.6.32: 3.0-100
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-18-pve: 2.6.32-88
lvm2: 2.02.95-pve3
clvm: 2.02.95-pve3
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-4
qemu-server: 3.0-20
pve-firmware: 1.0-22
libpve-common-perl: 3.0-4
libpve-access-control: 3.0-4
libpve-storage-perl: 3.0-8
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-13
ksm-control-daemon: 1.1-1

spirit · Jun 28, 2013

Hi, can you test last 2.6.32-21-pve kernel from pvetest repository ?

gkovacs · Jun 28, 2013

spirit said:
Hi, can you test last 2.6.32-21-pve kernel from pvetest repository ?

Why, do you know any kernel modifications that could be connected to this problem?
If not, then no, these are production servers, so I'm not deliberately trying to induce problems like this.

BTW do we know the original OpenVZ version of the current -20 kernel, that we updated to last week?
I need it for my bugreport at the OpenVZ bugzilla.

tom · Jun 28, 2013

gkovacs said:
Why, do you know any kernel modifications that could be connected to this problem?
If not, then no, these are production servers, so I'm not deliberately trying to induce problems like this.

the new kernel in pvetest is based on RHEL64, so there are countless changes. just to note, this new kernel is based on latest stable openvz kernel.

gkovacs said:
BTW do we know the original OpenVZ version of the current -20 kernel, that we updated to last week?
I need it for my bugreport at the OpenVZ bugzilla.

see changelog.

Code:

zless /usr/share/doc/pve-kernel-2.6.32-20-pve/changelog.Debian.gz

tom · Jun 28, 2013

can you check if you also see the same with CFQ instead of deadline?

gkovacs · Jun 28, 2013

tom said:
see changelog.

If I recall correctly, PVE 3.0 had 2.6.32-98 when we upgraded our PVE 2.3 cluster 2-3 weeks ago.

The problem did not appear since we upgraded to -99 or -100, so the culprit probably will be in these two kernels - or we simply did not give the -98 kernel enough time for the problem to manifest itself.

We know that:
- the problem is in vzkernel-2.6.32-042stab076.7.
- we use ext4 filesystem and deadline scheduler
- the problem was induced by a client starting a database import that parses a directory with thousands of small files, during vzdump snapshot backups

Code:

root@proxmox:~# zless /usr/share/doc/pve-kernel-2.6.32-20-pve/changelog.Debian.gz
pve-kernel-2.6.32 (2.6.32-100) unstable; urgency=low
* fix CVE-2013-2094
-- Proxmox Support Team <support@proxmox.com>  Wed, 15 May 2013 08:09:23 +0200

pve-kernel-2.6.32 (2.6.32-99) unstable; urgency=low
* set default scheduler to 'deadline'
-- Proxmox Support Team <support@proxmox.com>  Tue, 07 May 2013 06:42:22 +0200

pve-kernel-2.6.32 (2.6.32-98) unstable; urgency=low
* update to vzkernel-2.6.32-042stab076.7.src.rpm
* remove xfs-trans-ail-fix.patch (fixed upstream)
* update e1000e to 2.3.2
* update igb to 4.2.16
* update ixgbe to 3.14.5
-- Proxmox Support Team <support@proxmox.com>  Fri, 03 May 2013 19:51:48 +0200

As we use the deadline scheduler for years with every single kernel, I'm not sure there is any point in checking CFQ.
First I will try to downgrade to -98 and try to see if the problem persists.

gkovacs · Jul 9, 2013

We have downgraded to pve-kernel-2.6.32-19-pve: 2.6.32-96, and the problem is still appearing.

It seems to be tied to a single specific VE, and only during snapshot backups. If this VE is NOT on these ext4 nodes, then they work fine, if I migrate it back then the nodes freeze during the first nightly backups, at a specific time, possibly at the same time when this VE is being backed up.

subversion · Jul 9, 2013

If it hasn't already been posted here - see the following thread: http://forum.proxmox.com/threads/9097-vzdump-using-lvm-snapshot-kills-the-box?p=77084#post77084

I have made several posts related to a very similar issue with the newest ProxMox 3 / snapshots locking up the entire server. We have tried both deadline and CFQ with the same results. We are using EXT4 / LVM / iSCSI and NFS - this only happens on one of our Prox 3 servers, we have another independent cluster of 6 servers that back up fine locally. See if any of my issues look familiar - we are still trying to get a fix, we cannot back up one of our production ProxMox servers because of this issue.

Cheers,
Joe

gkovacs said:
We have downgraded to pve-kernel-2.6.32-19-pve: 2.6.32-96, and the problem is still appearing.

It seems to be tied to a single specific VE, and only during snapshot backups. If this VE is NOT on these ext4 nodes, then they work fine, if I migrate it back then the nodes freeze during the first nightly backups, at a specific time, possibly at the same time when this VE is being backed up.

gkovacs · Jul 10, 2013

Joe, the thread you were referring to indeed seems to be about the same problem. I have filed an OpenVZ bugreport, unfortunately with not a lot of reproducible information:
https://bugzilla.openvz.org/show_bug.cgi?id=2645

What we now know so far:
- server IO locks up during vzdump, load climbs to the sky, only hard reset can solve the problem
- it has started to appear in Proxmox 2.x kernels for some, 3.x for others
- it only happens during LVM snapshot backups
- it is more likely to happen on EXT4 (we are using EXT4 and DEADLINE, but others have reported it on EXT3 and/or CFQ as well)
- for us, it is triggered during the backup of a single specific VE
- it needs some IO load on the VE to happen

I can provide access for the Proxmox devs to a server where I can trigger this event.

tom · Jul 10, 2013

@gkovacs: if you use adaptec raid, test with latest firmware on your controller and also upgrade the aacraid driver to aacraid driver v1.2.1-30200.

and report your results.

informant · Jul 10, 2013

@gkovacs

we have create a ticket too before, but the owner have cancel this. it´s not a bug, is her message.
-> https://bugzilla.proxmox.com/show_bug.cgi?id=407
and a other user
-> https://bugzilla.proxmox.com/show_bug.cgi?id=411

everytime: it´s not a bug, said the proxmox team, but what than?

regards

gkovacs · Jul 10, 2013

@informant stop spamming every thread with your whining. This is definitely a bug, but most likely not in Proxmox but in the OpenVZ kernel, which has it's own bugzilla.

Do not post here until you have new information about this subject.

tom · Jul 10, 2013

informant said:
@gkovacs

we have create a ticket too before, but the owner have cancel this. it´s not a bug, is her message.
-> https://bugzilla.proxmox.com/show_bug.cgi?id=407
and a other user
-> https://bugzilla.proxmox.com/show_bug.cgi?id=411

everytime: it´s not a bug, said the proxmox team, but what than?

regards

HowTo use our bugtracker:
1. File a bug if one of our staff member asks you to do it
2. If you have a re-producible problem, file a bug and include detailed steps to see the issue. If it works here, the bug will be closed with (works here).

No one said that we ignore the problem but as we told you multiple times, reporting it in the way you do it makes no sense and does not help in finding a solutions. You need to isolate the issue, find the reason. If you can´t do it, ask someone else for help. The issue seems complex, deep analysis on your servers is needed.

Also, you spam this forum with tens of threads, always with the same content - this a waste of time and against the forum rules.

informant · Jul 10, 2013

This is definitely a bug, but most likely not in Proxmox but in the OpenVZ kernel, which has it's own bugzilla.

thanks for this informationen.

regards

gkovacs · Jul 16, 2013

tom said:
@gkovacs: if you use adaptec raid, test with latest firmware on your controller and also upgrade the aacraid driver to aacraid driver v1.2.1-30200.

and report your results.

We have updated the Adaptec RAID controller's firmware on both nodes to the latest version 5.2.0 Build 19109 (21 Dec 2012) 5 days ago, and since then the problem has not appeared.
Firmware versions 18512 and 19076 both exhibited the issue, so it seems it was fixed in the most recent build.

We have not touched the driver, we are still running 1.2-1[29900].

Will update this thread if the problem comes back again.
Thanks for the suggestion Tom!

gkovacs · Jul 18, 2013

Unfortunately the problem is back after 6 days.
It seems the Adaptec firmware update has decreased the chance of it happening, but did not eliminate it totally.

Next thing to check is the 30200 driver in the new PVE kernel. Any timeframe on when it's getting released into stable? We are not really happy running test kernels on production servers.

tom · Jul 18, 2013

gkovacs said:
Any timeframe on when it's getting released into stable? ...

later today.

stef1777 · Jul 18, 2013

Just to say that I have the same problem with Dell R420 servers. I also have a HP DL360G7 server in the pve cluster and it don't crash.

http://forum.proxmox.com/threads/14678-General-crash-during-snaphot-backup-of-CTs

Crash append all 5/7 days.

e100 · Aug 1, 2013

I am seeing a similar problem myself, Areca 1880 controller.
Did the new kernel help?

gkovacs · Aug 1, 2013

The new -107 kernel did help a little, but unfortunately did not completely solve the problem.
Now we have a freeze once every week, instead of every night.

Problem seems to be connected to a single VE having thousands of files in a folder.

[CRITICAL] Huge IO load causes freezing during backups

Renowned Member

Distinguished Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

Renowned Member

New Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Proxmox Staff Member

Active Member

Renowned Member

Renowned Member