LVM snapshot lockups on several servers since upgrading to Proxmox 2.3

HostVPS

New Member
Feb 6, 2013
18
0
1
Hi,

Since we upgraded our nodes to PVE-2.3 we cannot use LVM snapshot backups anymore on OpenVZ CT's, if we do the LVM sub-system just locks up, we lose all drive IO and are forced to reboot the node.

At first I thought this was an issue with the LSI RAID card drivers on two of our newer nodes, but then confirmed the driver had not changed between PVE-2.2 & 2.3 and over the last week I've been testing this issue on a couple of other nodes that are running on completely different hardware RAID cards, main boards and CPUs, even at zero load we still get an LVM lockup when using LVM snapshot backups on a OpenVZ CT.

All of these servers were one hundred percent stable before the upgrade from PVE-2.2 to PVE-2.3 and have performed hundreds of hours of LVM snapshot backups without a single problem. What has changed in PVE-2.3 that could be causing this new instability with LVM snapshots?

We are enjoying the new non-LVM based live backups for KVM VPS ( which is brilliant BTW! ) that Proxmox team introduced and will be supporting the teams work with some community support subscriptions for our nodes, but still need get this issue resolved as we now have no way of doing zero downtime backups on our OpenVZ CT's at the moment due to this issue.

Regards,

Bob
 
Last edited:
Sorry, I thought the details had changed enough to warrant a new thread as this fault is now affecting several different server configurations and not just an issue with LSI RAID cards drivers as I first thought.

I have spent a lot of time over the last week trying to find more facts. The only change I can find is the new PVE version and now we are seeing this on several completely different hardware configs, I don't know where to go from here and was hoping you guys would have some more advise on how to resolve this issue.

Regards,

Bob
 
Hi Tom,

Thanks for your reply, but I don't think a remote support login to just one of our servers with the LVM problem is going to help as it's affecting several servers with completely different hardware configurations, the only common thing is the new PVE version.

I've just found that if I use PVE-2.3 with the last PVE-2.2 Kernel - 2.6.32-17-pve instead of 2.6.32-18-pve that came with PVE-2.3, LVM snapshot backups work again on each of the previously affected nodes, it took a while to be able to test this as we had to move a lot of clients VPS to other nodes to be free to test this out, but it does look like this LVM snapshot issue is caused by the latest 2.6.32-18-pve kernel.

Regards,

Bob
 
Last edited:
I've just found that if I use PVE-2.3 with the last PVE-2.2 Kernel - 2.6.32-17-pve instead of 2.6.32-18-pve that came with PVE-2.3, LVM snapshot backups work again on each of the previously affected nodes

This is strange, because the changelogs does not mention any LVM related change. Maybe you can also test with latest kernel from pvetest repository?
 
This is strange, because the changelogs does not mention any LVM related change. Maybe you can also test with latest kernel from pvetest repository?

Hi Dietmar,

I installed the latest 2.6.32-19-pve kernel from pvetest last night and left two previously affected nodes with different hardware configurations ( one box running a 3WARE 9690SA RAID card the other LSI MR9260 with different CPUs and main boards ) running overnight doing large LVM snapshot backups to our NFS backup servers, we have not had any LVM snapshot issues at all with 2.6.32-17-pve or the latest 2.6.32-19-pve kernel. So it does seem to be the 2.6.32-18-pve kernel that's causing the issue.

Best regards,

Bob
 
Last edited:
Update: The LVM snapshot problem still exists when using the latest testing PVE kernel 2.6.32-19-pve ( I have tried both 2.6.32-19-pve_2.6.32-92 and 2.6.32-19-pve_2.6.32-93 ). The PVE volume group locks as soon as you start the snapshot backup and occurs most of the time apart from the odd occasion when it works without crashing.

The last PVE-2.2 Kernel - 2.6.32-17-pve is the only kernel that works without an LVM crash when doing a LVM snapshot on our servers now and this is happening to three different hardware configurations over several servers and is completely repeatable.

I hope this feedback helps resolve this matter in future versions of the PVE kernel and please do let me know if you want me to try out any new kernel builds, I'll keep a few nodes free for testing this issue.


Best regards,

Bob
 
as long as we can´t see this in our test lab its more or less impossible to find the reason for your issue. we need a re-produce-able test case - as we do not see it here and you get it always, there must be something different on all your boxes.
 
Hi Tom,

I completely follow what your saying. Just to confirm the details, these nodes were simply installed using the latest version of Proxmox 2.* with no changes to any of the standard Proxmox configuration. Our servers are all running on the last three generations of Intel main boards and CPU's and LSI and 3Ware RAID cards and have never had an issue with LVM backups until using the 2.6.32-18-pve in PVE-2.3 or higher testing kernel versions.

This must be something in the upstream kernel code if you guys don't know anything about it, how can I confirm which version of the Proxmox kernels are based on which upstream version? Many thanks for your help.

Regards,

Bob
 
as long as we can´t see this in our test lab its more or less impossible to find the reason for your issue. we need a re-produce-able test case - as we do not see it here and you get it always, there must be something different on all your boxes.

Do you guys use and test against both ext3 and ext4 on your test lab servers? As we are running on ext4 and this is a non-default install option that we told the Proxmox installer to use when setting all these nodes up.
 
For OpenVZ, we always recommend ext3. As its the default, its the most tested setup.

In our labs, we run also tests with several file-systems - the results: we recommend ext3 for OpenVZ.
 
For OpenVZ, we always recommend ext3. As its the default, its the most tested setup.

In our labs, we run also tests with several file-systems - the results: we recommend ext3 for OpenVZ.

Thanks Tom, I came to the same conclusion and rebuilt one of the affected nodes using ext3, I'm testing now on a fresh PVE-2.3 install with the current default kernel and have not seen any LVM snapshot problems so far, weird that LVM snapshots with ext4 had run without issue until this latest kernel....

Best regards,

Bob
 
Last edited:
so you report that as soon as you run with ext4, LVM snapshots backup for CT´s stops working.

do you run also other non-default configs? pls let me know, will do more tests here.
 
Hi Tom,

I've done over a dozen LVM snapshot backup tests on both nodes now and I have not seen a single crash or problem with PVE-2.3 with the current 2.6.32-18-pve kernel as long as you run ext3 NOT ext4.

Regarding configs: I reinstalled two of the affected nodes, one box running a 3WARE 9690SA RAID card the other with a LSI MR9260 and both with different generations of Intel CPUs and server main-boards. I used "linux maxroot=16 swapsize=16" instead of our normal setup of "linux ext4 maxroot=16 swapsize=16" and we also add "size: 16380" to vzdump.conf on most nodes (but still had crashes on boxes with standard vzdump settings too when using ext4) and setup NFS shares on a private IP over a second network card. That's all the changes we make to the Proxmox side of things, I also configure ebtables and a few other details for end user hosting purposes on the nodes, but all tests were run with just the basic Proxmox install as detailed here.

It's looks like the current 2.6.32-18-pve kernel and ext4 make LVM snapshots completely unstable, please let me know if you can reproduce the same issue in your labs as I've lost a lot of time this week due to this problem and would like to know if we finally have this confirmed as a real bug now - many thanks for your help.

Best regards,

Bob
 
Last edited:
I still cannot re-produce it here. this test host (Adaptec 6805, raid5 with 6 x 250 GB sata) use a installation with ext4. Backup target is a NFS Server (v3) on Debian Squeeze (with pve kernel).

Code:
[COLOR=#000000][FONT=tahoma]INFO: starting new backup job: vzdump 110 --remove 0 --mode snapshot --compress lzo --storage store-nfs-mits2-hp --node proxmox-7-106[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Starting Backup of VM 110 (openvz)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: CTID 110 exist mounted running[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status = running[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: backup mode: snapshot[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: ionice priority: 7[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: creating lvm snapshot of /dev/mapper/pve-data ('/dev/pve/vzsnap-proxmox-7-106-0')[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO:   Logical volume "vzsnap-proxmox-7-106-0" created[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: creating archive '/mnt/pve/store-nfs-mits2-hp/dump/vzdump-openvz-110-2013_03_19-09_04_07.tar.lzo'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Total bytes written: 146093312000 (137GiB, 44MiB/s)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: archive file size: 71.16GB[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Finished Backup of VM 110 (00:53:35)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Backup job finished successfully[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK OK

[/FONT][/COLOR]

Code:
root@proxmox-7-106:/# cat /proc/mounts
/dev/mapper/pve-data /var/lib/vz ext4 rw,relatime,barrier=1,data=ordered 0 0

root@proxmox-7-106:/# uname -a
Linux proxmox-7-106 2.6.32-18-pve #1 SMP Mon Jan 21 12:09:05 CET 2013 x86_64 GNU/Linux
 
I still cannot re-produce it here. this test host (Adaptec 6805, raid5 with 6 x 250 GB sata) use a installation with ext4. Backup target is a NFS Server (v3) on Debian Squeeze (with pve kernel).

Code:
[COLOR=#000000][FONT=tahoma]INFO: starting new backup job: vzdump 110 --remove 0 --mode snapshot --compress lzo --storage store-nfs-mits2-hp --node proxmox-7-106[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Starting Backup of VM 110 (openvz)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: CTID 110 exist mounted running[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status = running[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: backup mode: snapshot[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: ionice priority: 7[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: creating lvm snapshot of /dev/mapper/pve-data ('/dev/pve/vzsnap-proxmox-7-106-0')[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO:   Logical volume "vzsnap-proxmox-7-106-0" created[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: creating archive '/mnt/pve/store-nfs-mits2-hp/dump/vzdump-openvz-110-2013_03_19-09_04_07.tar.lzo'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Total bytes written: 146093312000 (137GiB, 44MiB/s)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: archive file size: 71.16GB[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Finished Backup of VM 110 (00:53:35)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Backup job finished successfully[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK OK

[/FONT][/COLOR]

Code:
root@proxmox-7-106:/# cat /proc/mounts
/dev/mapper/pve-data /var/lib/vz ext4 rw,relatime,barrier=1,data=ordered 0 0

root@proxmox-7-106:/# uname -a
Linux proxmox-7-106 2.6.32-18-pve #1 SMP Mon Jan 21 12:09:05 CET 2013 x86_64 GNU/Linux

Many thanks for the info Tom, it's back to the drawing board for me then!

We were fine with ext4 until PVE-2.3 with the 2.6.32-18-pve kernel, even running PVE-2.3 works with 2.6.32-17-pve, but 2.6.32-18-pve and the testing 2.6.32-19-pve kernels are not stable over three different hardware configurations for us when using ext4, but are all fine with ext3 - weird!

I cannot expect anymore from you guys as it must be something in the upstream / openvz kernel that's having issues with our hardware, but it would be much easier to accept this if it was just one type of hardware setup / RAID card having the problems, but over three different hardware platforms, very strange indeed!

I will post again if I ever get to the bottom of this issue, but we may just have to accept it and drop back to ext3 on our systems that need to do OpenVZ LVM snapshots.

Best regards,

Bob
 
Last edited:
You might want to report the problem to the OpenVZ bugzilla. The folks there are very knowledgeable, know the kernel inside out:
https://bugzilla.openvz.org/

Just don't forget to report the original OpenVZ kernel version numbers together with the Proxmox version numbers, so they could identify the affected kernel easily.
 
Last edited:
You might want to report the problem to the OpenVZ bugzilla. The folks there are very knowledgeable, know the kernel inside out:
https://bugzilla.openvz.org/

Just don't forget to report the original OpenVZ kernel version numbers together with the Proxmox version numbers, so they could identify the changes easily.

Thanks for the good advice, I'll look into doing that.

Regards,

Bob
 
I updated one of our prevously affected nodes to the new kernel pve-kernel-2.6.32-19 (2.6.32-95) from the pvetest repository today, which is based on the latest OpenVZ Kernel vzkernel-2.6.32-042stab076.5.src.rpm and we can now run OpenVZ LVM Snapshot backups again using ext4 without the LVM lockup issue, so hopefully this kernel bug that affected our hardware has now been resolved.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!