[CRITICAL] Huge IO load causes freezing during backups

Voleatech · Aug 7, 2013

I am using a Synology NAS for backup. I have 2 WD Red in a Raid 1 in there.
My Dell Server is T420 Perc H710 I have 2x1 TB WD RE and 2x3 TB WD RE, each of them in a raid 1. I use the standard Dell onboard Broadcom 5720 in a bond configuration. So far the error just appears when I do snapshots. I am on suspend backups ever since and never had an issue with them.

And no I didn't try to go back to noop scheduler. I wanted to wait for the new Kernel and the updated LSI Raid drivers to start testing again.
Anyone seeing some similarities in our setups?

gkovacs · Aug 7, 2013

mir said:
Just wondering. Has somebody with these problems tried the scheduler 'noop'?

I have switched one of our nodes (that was affected by the bug) to noop scheduler.
Will report back, but since upgrading the Adaptec firmware and -107 kernel, we only have the bug once every 8-10 days, so it will be hard to find a connection.

e100 · Aug 7, 2013

gkovacs said:
This is a reclusive kernel issue. It is present on all kinds of hardware (sata, scsi, adaptec/lsi/intel/hp raid, etc.), different filesystems (ext3 and ext4) and kernel versions.

I agree 100%, the 2.6.32 kernel from red hat, where our kernel comes from, has been plagued with one IO issue after another.

A simple Google search makes this obvious.
http://lmgtfy.com/?q=2.6.32+io+load+el6

spirit · Aug 7, 2013

e100 said:
I agree 100%, the 2.6.32 kernel from red hat, where our kernel comes from, has been plagued with one IO issue after another.

A simple Google search makes this obvious.
http://lmgtfy.com/?q=2.6.32+io+load+el6

I'm not sure it's so simple.
The problem not occur for all users. (I personnally don't have any problem with kvm guest, kvm backup, using nfs or lsi card. But I'm not using openvz or lvm snapshot backup)

It could be a bug in openvz kernel patches, or in lvm snapshots.

gkovacs · Aug 7, 2013

dietmar said:
It would really help if we have a reproducible test case. Maybe you can try to write a script to generate
many thousands of files in a folder, so that the bug triggers after running that script in a standard debian template?

I'm not sure how to do that, but I can give you access to the VE that triggered this on previous kernels (on -107 it rarely gives us the error, maybe once every 2 weeks).

If the VE was stopped, there was no freezing (maybe lvm snapshot backup only runs when the VE is running, otherwise it's a simple tar only?)... but more likely some IO load should be present on the snapshot area.

It looks like a complex kernel bug that is somehow related to LVM and the disk IO subsystem.

informant · Aug 13, 2013

hi, we have the same issue again, with the new pvetest version.
on backup of ct´s the io load is very hight.

what can we do or is this a bug of redhat?

here a backuplog and a screen:

Code:

...
Aug 13 00:45:03 pegasus kernel: ext3_orphan_cleanup: deleting unreferenced inode 177766479
Aug 13 00:45:03 pegasus kernel: EXT3-fs (dm-3): 27 orphan inodes deleted
Aug 13 00:45:03 pegasus kernel: EXT3-fs (dm-3): recovery complete
Aug 13 00:45:03 pegasus kernel: EXT3-fs (dm-3): mounted filesystem with ordered data mode
Aug 13 00:46:51 pegasus pmxcfs[2484]: [status] notice: received log
Aug 13 01:07:07 pegasus pmxcfs[2484]: [dcdb] notice: data verification successful
Aug 13 01:07:07 pegasus rrdcached[2468]: flushing old values
Aug 13 01:07:07 pegasus rrdcached[2468]: rotating journals
Aug 13 01:07:07 pegasus rrdcached[2468]: started new journal /var/lib/rrdcached/journal/rrd.journal.1376348827.841279
Aug 13 01:07:07 pegasus rrdcached[2468]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1376341627.841287
Aug 13 01:17:01 pegasus /USR/SBIN/CRON[30528]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 13 01:30:01 pegasus /USR/SBIN/CRON[34433]: (root) CMD (vzdump 5133 --quiet 1 --mode snapshot --mailto info@domain.de --node pegasus --compress gzip --storage SLS-001)
Aug 13 01:30:01 pegasus vzdump[34492]: INFO: trying to get global lock - waiting...
Aug 13 01:30:01 pegasus vzdump[34434]: <root@pam> starting task UPID:pegasus:000086BC:033A17A5:52096FF9:vzdump::root@pam:
Aug 13 02:00:20 pegasus pmxcfs[2484]: [status] notice: received log
Aug 13 02:05:02 pegasus pmxcfs[2484]: [status] notice: received log
Aug 13 02:07:07 pegasus pmxcfs[2484]: [dcdb] notice: data verification successful
Aug 13 02:07:07 pegasus rrdcached[2468]: flushing old values
Aug 13 02:07:07 pegasus rrdcached[2468]: rotating journals
Aug 13 02:07:07 pegasus rrdcached[2468]: started new journal /var/lib/rrdcached/journal/rrd.journal.1376352427.841298
Aug 13 02:07:07 pegasus rrdcached[2468]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1376345227.841285
Aug 13 02:17:01 pegasus /USR/SBIN/CRON[49642]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 13 03:06:54 pegasus kernel: INFO: task kjournald:1554 blocked for more than 120 seconds.
Aug 13 03:06:54 pegasus kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 13 03:06:54 pegasus kernel: kjournald D ffff880e24ce21c0 0 1554 2 0 0x00000000
Aug 13 03:06:54 pegasus kernel: ffff880e2704fc70 0000000000000046 ffff880e2704fc34 ffff880784e00dc0
Aug 13 03:06:54 pegasus kernel: ffff880e2704fc30 ffffffff8160c0e0 0000000000100000 ffff880028250e50
Aug 13 03:06:54 pegasus kernel: 0000000000000000 0000000120959fdf ffff880e2704ffd8 ffff880e2704ffd8
Aug 13 03:06:54 pegasus kernel: Call Trace:
Aug 13 03:06:54 pegasus kernel: [<ffffffff811d56f0>] ? sync_buffer+0x0/0x50
Aug 13 03:06:54 pegasus kernel: [<ffffffff81540773>] io_schedule+0x73/0xc0
Aug 13 03:06:54 pegasus kernel: [<ffffffff811d5730>] sync_buffer+0x40/0x50
Aug 13 03:06:54 pegasus kernel: [<ffffffff81541680>] __wait_on_bit+0x60/0x90
Aug 13 03:06:54 pegasus kernel: [<ffffffff811d56f0>] ? sync_buffer+0x0/0x50
Aug 13 03:06:54 pegasus kernel: [<ffffffff8154172c>] out_of_line_wait_on_bit+0x7c/0x90
...

can you answer please. very thanks.

regards

naja7host · Aug 13, 2013

you have backup in a NSF share ?

informant · Aug 13, 2013

Hi,

yes, backup in a nfs storage. it´s a bug on redhat?

naja7host said:
you have backup in a NSF share ?

mir · Aug 13, 2013

How is the configuration of the nfs on
nfs-server: sync or async? (default is sync if nothing is configured)
pve node: udp or tcp (default is udp) lock or nolock (default is lock)

informant · Aug 13, 2013

Hi @mir,

here the config. It´s the default config out of proxmox webinterface:

Code:

10.11.12.50:/volume1/storage /mnt/pve/SLS-001 nfs rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.11.12.50,mountvers=3,mountport=892,mountproto=udp,local_lock=none,addr=10.11.12.50 0 0

lynchie · Aug 13, 2013

I had the same issue again over the weekend. It appears the lvm remove command fails and the main vg gets suspended. I was able to issue "dmsetup resume <vg>" and the blocked processes continued.
From reading, it appears to be a Debian bug http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=718582 It has been fixed in unstable lvm2 (2.02.98-5). It appears the fixes were around some udev rules (55-dm.rules and 56-lvm.rules) . Hopefully we could get these changes into the next proxmox lvm package?

dietmar · Aug 13, 2013

lynchie said:
It appears the fixes were around some udev rules (55-dm.rules and 56-lvm.rules) . Hopefully we could get these changes into the next proxmox lvm package?

I am working on that.

informant · Aug 13, 2013

hi dietmar and lynchie,

this are very nice information. thanks for this informations. if it the solution, it was very nice. thanks for your support.

regards

informant · Aug 14, 2013

sorry for referencing.
On our test the new pve update does not work, look the link @dietmar http://forum.proxmox.com/threads/15327-New-Kernel-for-Proxmox-VE-3-0-(pvetest)

regards

stef1777 · Aug 22, 2013

Hello folks!

Back from my vacations today. Does this bug finally solved?

spirit · Aug 22, 2013

stef1777 said:
Hello folks!

Back from my vacations today. Does this bug finally solved?

I think yes, the lvm package have been updated in proxmox 3.1

gkovacs · Aug 24, 2013

spirit said:
I think yes, the lvm package have been updated in proxmox 3.1

We have updated our nodes today to PVE 3.1.

As the bug has only appeared once every two weeks lately, we will only know for sure it's finally fixed if it doesn't happen during the next month.
Will keep this thread updated with our findings.

subversion · Aug 27, 2013

I upgraded all my production ProxMox hosts to 3.1, last night I ran a full set of back-ups on a server that would crash without fail every time. The backup completely successfully, which hasn't happened in about 2 months! I will continue to run backups and alert if there are any other issues. Hopefully the update will fix issues others are experiencing as well.

Thanks ProxMox team, it's good to have this working again - certainly seems to be kernel related / megaraid driver.

Cheers,
Joe Jenkins

gutter007 · Sep 5, 2013

Is there a work around for this?

I'm having the same issue. I run vzdump to migrate a kvm or vz container, and sometimes the io load spikes, then the server hangs. The only way I fix it is to restart the server, thus bringing down my production environment.

I'm trying to migrate virtual machines from a 2.0 server to a new 3.1 server with a shared SAN. This will give me much more flexibility. Because of our setup, I can't upgrade the 2.0 server to 3.1 because I can't reboot the server.

I do need to migrate individual servers over to the new environment, and I can shut down individual vms as I go along.

Is there a way to migrate the vms or vz containers without vzdump?

or

Is there a way to patch the fix without upgrading?

thanks.
myles.

gutter007 · Sep 19, 2013

answered my own question.

To Migrate by hand.

1. Shut Down VM
2. transfer image file to new server
3. In gui create container, with identical info to the one you just stopped
4. Copy File to the new image file location, or if using a san, use dd to copy the image to the san.
5. Turn on the new vm.

gutter007 said:
Is there a work around for this?

I'm having the same issue. I run vzdump to migrate a kvm or vz container, and sometimes the io load spikes, then the server hangs. The only way I fix it is to restart the server, thus bringing down my production environment.

I'm trying to migrate virtual machines from a 2.0 server to a new 3.1 server with a shared SAN. This will give me much more flexibility. Because of our setup, I can't upgrade the 2.0 server to 3.1 because I can't reboot the server.

I do need to migrate individual servers over to the new environment, and I can shut down individual vms as I go along.

Is there a way to migrate the vms or vz containers without vzdump?

or

Is there a way to patch the fix without upgrading?

thanks.
myles.

[CRITICAL] Huge IO load causes freezing during backups

Renowned Member

Renowned Member

Famous Member

Distinguished Member

Renowned Member

Renowned Member

Member

Renowned Member

Famous Member

Renowned Member

New Member

Proxmox Staff Member

Renowned Member

Renowned Member

Active Member

Distinguished Member

Renowned Member

New Member

New Member

New Member

We value your privacy