Snapshot removal fails after backup

We have the exact same problem, running LVM on top of an Intel Modular Server's shared LUN (shared across 3 nodes) using Proxmox 1.7 (KVM only). Sometimes it doesn't happen for months, but now it happened yesterday AND today. That would be about the 9th time in 10 months. It's becoming a real showstopper; it's getting hard to explain that a solution designed to prevent downtime (among other goals) has only caused more downtime.

By the way, we were planning to upgrade to Proxmox 2.1 (downtime again...) but honestly I'm reluctant: it doesn't sound like this will be solved, and if it is, it will only be by happy coincidence? I also don 't feel like a paid support subscription would help here?

@dietmar: FYI, once the problem has occurred, dmsetup resume didn't help us (it hangs as well).

@e100: did you try the proposed workaround from wyztix? On our side we tried another proposed workaround to do a dmsetup suspend before the lvremove, and then a dmsetup resume afterwards. It also didn't help...
 
I'm afraid that's probably a different issue being solved in bug 127. In the case of the crash, the lvremove command doesn't even terminate, so there's no chance of executing the second lvremove. The lvremove command hanging and causing the kernel crash is the real problem here; the snapshot not being removed is a consequence I believe, not the problem itself.

We also didn't have it for a long time (we therefore thought our workaround was succesful), until it suddenly happened again. It's very random: it can take months or just one day to happen again.
 
Yes we do, still on 1.7. We want to have enough confidence that it's really solved before upgrading, because testing and upgrading also takes a lot of preparation and additional downtime.

nc1-node2:~# pveversion -v
pve-manager: 1.7-10 (pve-manager/1.7/5323)
running kernel: 2.6.32-4-pve
proxmox-ve-2.6.32: 1.7-30
pve-kernel-2.6.32-4-pve: 2.6.32-30
pve-kernel-2.6.18-2-pve: 2.6.18-5
qemu-server: 1.1-28
pve-firmware: 1.0-10
libpve-storage-perl: 1.0-16
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-10
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.13.0-3
ksm-control-daemon: 1.0-4
 
I have over 20 nodes running 2.1, not seen this issue since the dietmar released the new packages to pve-test

From what I remember bug 127 resulted in changes to udev, I believe there was some race condition where udev was doing something and if that happened at the same time as lvremove the lockup would occur.
I suppose it is possible the bug still lingers and I have just not seen it, but I highly doubt that given the number of nodes I have running and the number of snapshot backups I perform weekly.
 
Okay, that's a different story... Sounds good in any case. Guess this gives me enough to advocate upgrading. Thanks!
 
Okay, that's a different story... Sounds good in any case. Guess this gives me enough to advocate upgrading. Thanks!

We only have two servers here but were plagued with the problem on a fairly regular basis. The problems have completely disappeared. I heavily recommend upgrading.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!