lvremove hangs on backup, uninterruptable sleep, two times in a row now

Ray

New Member
Jul 9, 2009
19
0
1
Hello,

I use proxmox with latest kernel 2.6.32 (see detailed version informations below) and I have on this machine some KVM's, which have their raw image on an iSCSI LVM volume.

This machine is running with this scenario for over a month now. On Saturday one KVM Machine (a Zimbra Server) suddenly stopped. On Monday I could not enter the KVM from ssh and proxmox did not let me login. I tried from another node but could not do anything with the node in question.

After restarting the pvedaemon I noticed vgs and vgscan stuff laying dormant around. I investigated further and found that the cron-jobs doing the backups is hanging "uninterruptable" with lvremove.

I had nothing really relevant in the log-files, the output from lsscsi is fine, however all lvm tools (vgdisplay, pvdisplay) are hanging in regard to the volume (vmstorage), all other pv's and vg's display fine (after a CTRL-C).

The other KVM-Machines seem to be fine as well as I can ssh into them, etc.

I rebooted the node for the better and everything worked fine, until the nightly backup, again the same problem. Now I am worried to why, because in the whole Setup *nothing* has changed.

All KVM Machines are running either debian or ubuntu LTS 8.04

All logs insinde the Zimbra KVM machine just stop at a specific time. Around this Time I can not find anything unusual in the Logs from the Host machine.


here are some more details gathered from the Host machine:
root 3064 0.0 0.0 19832 1040 ? Ss Mar15 0:00 /usr/sbin/cron
root 8857 0.0 0.0 28372 992 ? S Mar15 0:00 \_ /USR/SBIN/CRON
root 8859 0.0 0.1 47824 13096 ? Ss Mar15 0:00 \_ /usr/bin/perl -w /usr/sbin/vzdump --quiet --node 1 --snapshot --compress --storage backup-bagdad
root 21200 0.0 0.0 38668 7892 ? S Mar15 0:00 \_ /usr/bin/perl -w /usr/sbin/pvesm lock KVM 60
root 21201 0.0 0.1 25840 13568 ? D<L Mar15 0:00 \_ lvremove -f /dev/vmstorage/vzsnap-node-04-0

Hanging pvedaemon (but interruptable)
root 21246 0.0 0.0 15492 1516 ? S Mar15 0:00 | \_ /sbin/vgs --separator : --noheadings --units k --unbuffered --nosuffix --options vg_name,vg_size
root 4494 0.0 0.2 88116 24096 ? S Mar15 0:08 \_ pvedaemon worker
root 21219 0.0 0.0 15492 1516 ? S Mar15 0:00 \_ /sbin/vgs --separator : --noheadings --units k --unbuffered --nosuffix --options vg_name,vg_size


iSCSI Information:
Loading iSCSI transport class v2.0-870.
iscsi: registered transport (tcp)
iscsi: registered transport (iser)
scsi5 : iSCSI Initiator over TCP/IP
scsi6 : iSCSI Initiator over TCP/IP
scsi7 : iSCSI Initiator over TCP/IP
scsi 5:0:0:0: Direct-Access OPNFILER VIRTUAL-DISK 0 PQ: 0 ANSI: 4
sd 5:0:0:0: Attached scsi generic sg4 type 0
scsi 6:0:0:0: Direct-Access OPNFILER VIRTUAL-DISK 0 PQ: 0 ANSI: 4
sd 6:0:0:0: Attached scsi generic sg5 type 0
sd 6:0:0:0: [sdc] 52822016 512-byte logical blocks: (27.0 GB/25.1 GiB)
sd 5:0:0:0: [sdb] 246743040 512-byte logical blocks: (126 GB/117 GiB)
sd 5:0:0:0: [sdb] Write Protect is off
sd 5:0:0:0: [sdb] Mode Sense: 77 00 00 08
sd 6:0:0:0: [sdc] Write Protect is off
sd 6:0:0:0: [sdc] Mode Sense: 77 00 00 08
sd 5:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
sd 6:0:0:0: [sdc] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
sdb:
sdc: sdc1
sd 6:0:0:0: [sdc] Attached SCSI disk
unknown partition table
sd 5:0:0:0: [sdb] Attached SCSI disk

Kernel:
Linux node-04 2.6.32-1-pve #1 SMP Fri Jan 15 11:37:39 CET 2010 x86_64 GNU/Linux

node-04:/var/log# dpkg -l | egrep "(lvm|devm)"
ii libdevmapper1.02.1 2:1.02.27-4 The Linux Kernel Device Mapper userspace library
ii lvm2 2.02.39-7 The Linux Logical Volume Manager


node-04:/var/log# cat /etc/debian_version
5.0.4


lsscsi output:
node-04:/etc# lsscsi --long
[0:0:0:0] cd/dvd TSSTcorp CDDVDW TS-L633B IB03 /dev/sr0
state=running queue_depth=1 scsi_level=6 type=5 device_blocked=0 timeout=30
[4:0:0:0] disk ATA ST9320423AS SDM1 -
state=running queue_depth=64 scsi_level=6 type=0 device_blocked=0 timeout=0
[4:0:1:0] disk ATA ST9320423AS SDM1 -
state=running queue_depth=64 scsi_level=6 type=0 device_blocked=0 timeout=0
[4:1:2:0] disk LSILOGIC Logical Volume 3000 /dev/sda
state=running queue_depth=64 scsi_level=3 type=0 device_blocked=0 timeout=30
[5:0:0:0] disk OPNFILER VIRTUAL-DISK 0 /dev/sdb
state=running queue_depth=32 scsi_level=5 type=0 device_blocked=0 timeout=30
[6:0:0:0] disk OPNFILER VIRTUAL-DISK 0 /dev/sdc
state=running queue_depth=32 scsi_level=5 type=0 device_blocked=0 timeout=30


Anyone experiencing the same? Any solutions? Everything i found on google relates to older LVM software problems which should be fixed in the releases installed on this node.

best
Ray
 
I had the exact same issue this morning .. i backup all my vm's from 1:00 to 4:00 .. with 1 hour delay between them. As of 3:00 that specific vm wasn't working anymore, the other vm's worked just fine, and the lvremove proc was hanging, couldn't kill it, an fdisk would stall the ssh connection.

Only difference in my setup is that i'm not using iSCSI, but LVM over MDraid and drbd to the slave server in the cluster and libdevmapper is newer:

vm01:/var/log# dpkg -l | egrep "(lvm|devm)"
ii libdevmapper1.02.1 2:1.02.38-2.1~bpo50+1 The Linux Kernel Device Mapper userspace lib
ii lvm2 2.02.39-7 The Linux Logical Volume Manager

Anyway, i ended up rebooting the server, which also comes to my next question:

Why couldn't i start the VM that was crashed on the slave server? Because, what's the use of clustering, using DRBD to keep the slave up to date if it's not possible to start it (i got some "write access denied" error) on the slave server?

If the master dies for some reason, i can't start the VM's on the slave.
 
Hi, for me going back to proxmox kernel 2.6.18 stabilized the problem somewhat, saying, it did not occour once since going back to 2.6.18

So my best guess it's some problem with the kernel, and I am afraid of upgrading my production systems to 2.6.32, which is also problematic because I have the problem that the new Lucid LTS will not work properly in the 2.6.18 kernel.

but my kernel problems are for another thread :)

best

update:
I think it could be related to 2.6.32 and the way it handles the lvm volumes. I guess you access the same LVM volume on cluster Secundary as the one on clusternode Primary? If so it seems that (sure) if the LVM remove hangs uninterruptable, the LVM volume is locked, which would explain the write access denied from the Secondary ...

I do not know if it would be possible to clear the lock on the LVM Volume from the secondary node, and I do not know if it's safe either, but if only I would have the time to play in a lab with these problems *sigh* ...

Where to report this problem to?
 
Last edited:
I'l have this posted with a reference to this thread to the LVM Mailinglist as well, let's if they have something on it, i'l keep this thread updated.

best
 
Cool thanks, ray, i'm keeping track of this one, hope to see some replies soon.
 
Hi
I also had the same issue.
Does anyone have a solution or an advice.
I can't downgrade my proxmox until a long time because it's a production environment.
What are you recommendations to avoid theses crashes. I couldn't keep logs of the crash and I cannot say exactly if 2 snapshots where running at the same time.
I use bacula; it nightly run a backup job with creation of a snapshot in a run before job script. LVM crashed during the post backup script.

Thanks for your help

Hugo
 
Hi
I posted on the pve mailing list,
I have done some tries in various conditions.
My script was doing a kpartx -d and then a lvremove. It crashed a lot of times with the same symptoms as you.
I was on kernel package 2.6.32-30.
After upgrading to 2.6.32-32, the commands seems to run slowly but it's stable. :)

Did you tried with this kernel ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!