ISCSI LVMs issue

rgproxmox1

Member
Feb 4, 2013
41
0
6
I have configured as storage an ISCSI - LVM and things seem to be working OK most of the time, but what's weird is that I keep seeing messages like this one:

root@prod-pmx6:~# pvscan
Found duplicate PV JGGQcMtzRlxRfdyjZSAxSMFat7ZnO1m6: using /dev/sdc not /dev/sdb
Found duplicate PV UXnEfTJrojVAGmfLHV9hxlf6hBOcXbpV: using /dev/sde not /dev/sdd
Found duplicate PV JGGQcMtzRlxRfdyjZSAxSMFat7ZnO1m6: using /dev/sdf not /dev/sdc
Found duplicate PV UXnEfTJrojVAGmfLHV9hxlf6hBOcXbpV: using /dev/sdg not /dev/sde
PV /dev/sdg VG nas-iscsi-vg2 lvm2 [1.95 TiB / 925.00 GiB free]
PV /dev/sdf VG nas-iscsi-vg lvm2 [2.50 TiB / 1.81 TiB free]
Total: 2 [4.45 TiB] / in use: 2 [4.45 TiB] / in no VG: 0 [0 ]
root@prod-pmx6:~# lvscan
Found duplicate PV JGGQcMtzRlxRfdyjZSAxSMFat7ZnO1m6: using /dev/sdc not /dev/sdb
Found duplicate PV UXnEfTJrojVAGmfLHV9hxlf6hBOcXbpV: using /dev/sde not /dev/sdd
Found duplicate PV JGGQcMtzRlxRfdyjZSAxSMFat7ZnO1m6: using /dev/sdf not /dev/sdc
Found duplicate PV UXnEfTJrojVAGmfLHV9hxlf6hBOcXbpV: using /dev/sdg not /dev/sde
inactive '/dev/nas-iscsi-vg2/vm-100-disk-1' [15.00 GiB] inherit
inactive '/dev/nas-iscsi-vg2/vm-108-disk-1' [500.00 GiB] inherit
inactive '/dev/nas-iscsi-vg2/vm-108-disk-2' [100.00 GiB] inherit
inactive '/dev/nas-iscsi-vg2/vm-108-disk-3' [460.00 GiB] inherit
inactive '/dev/nas-iscsi-vg/vm-101-disk-1' [130.00 GiB] inherit
inactive '/dev/nas-iscsi-vg/vm-102-disk-1' [20.00 GiB] inherit
inactive '/dev/nas-iscsi-vg/vm-103-disk-1' [20.00 GiB] inherit
inactive '/dev/nas-iscsi-vg/vm-112-disk-1' [10.00 GiB] inherit
inactive '/dev/nas-iscsi-vg/vm-104-disk-1' [100.00 GiB] inherit
inactive '/dev/nas-iscsi-vg/vm-151-disk-1' [25.00 GiB] inherit
ACTIVE '/dev/nas-iscsi-vg/vm-106-disk-1' [100.00 GiB] inherit
ACTIVE '/dev/nas-iscsi-vg/vm-106-disk-2' [100.00 GiB] inherit
inactive '/dev/nas-iscsi-vg/vm-107-disk-1' [100.00 GiB] inherit
inactive '/dev/nas-iscsi-vg/vm-107-disk-2' [100.00 GiB] inherit


and it seems to be affecting the overall health of the system, since I have issues trying to delete unused disks, like this one: /dev/nas-iscsi-vg2/vm-108-disk-1. In fact, every time I try to delete that LVM I see the Proxmox host crash and I have to reset it.

Any ideas on what might be configured incorrectly? This is a 3 host cluster (non-HA). and I'm running 3.2.4

Thanks in advance!
 
By the way, more information...

We have a second 3 host cluster (non-HA also).

Both clusters can "see" 2 NAS servers, but each have its own ISCSI/LVM area, which is not supposed to be shared between the 2 of them. The other cluster also reports similar warnings when running "pvscan" and "lvscan". I noticed the warnings first when I was checking the output of the backups, where as it processes each server I can see the messages. Thanks again
 
The last thing I tried was to do an:

lvremove /dev/nas-iscsi-vg2/vm-107-disk-1

from the proxmox host command line and the command hung until I killed the process. The command hung for more than 10 mins and nothing happened. I ended up losing the host again this way.
 
Any input in this problem? This is happening even in another cluster with only one NAS and I'm getting pressure to fix this.
 
I have a fairly similar problem here.

I have 2 identical nodes, and one iscsi NAS, vm disks from both nodes are on a single iscsi LUN. backups are made on a nfs share on the same NAS.
one node shows the problem, the other not :S

when I see the web gui backup logs from that node, I see that all (apparently successful) backups start like

"INFO: Starting Backup of VM 102 (qemu)
INFO: status = running
INFO: update VM 102: -lock backup
Found duplicate PV dB0Su2lTwsYfbcJPhby21PekoyeN3hHS: using /dev/sdc not /dev/sdb
Found duplicate PV dB0Su2lTwsYfbcJPhby21PekoyeN3hHS: using /dev/sdc not /dev/sdb

INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/pve_ts879/dump/vzdump-qemu-102-2015_01_14-01_00_02.vma.lzo'
INFO: started backup task 'f373a23c-82d3-4cd6-a5af-382858f0ac91'
INFO: status: 0% (36044800/12884901888), sparse 0% (3534848), duration 3, 12/10 MB/s"

while the backup log sent by mail states (same exact job)
102: Jan 14 01:00:02 INFO: Starting Backup of VM 102 (qemu)
102: Jan 14 01:00:02 INFO: status = running
102: Jan 14 01:00:03 INFO: update VM 102: -lock backup
102: Jan 14 01:00:03 INFO: backup mode: snapshot
102: Jan 14 01:00:03 INFO: ionice priority: 7
102: Jan 14 01:00:03 INFO: creating archive '/mnt/pve/pve_ts879/dump/vzdump-qemu-102-2015_01_14-01_00_02.vma.lzo'
102: Jan 14 01:00:04 INFO: started backup task 'f373a23c-82d3-4cd6-a5af-382858f0ac91'
102: Jan 14 01:00:07 INFO: status: 0% (36044800/12884901888), sparse 0% (3534848), duration 3, 12/10 MB/s

backups logs running on the other node are just fine, no "found duplicate" whatsoever.

digging the first node logs from the gui, all related logs show this, but since the email apparenly removed them, I never noticed.

I could find other traces in other system logs perhaps? where?

edit: also, good node shows
#fdisk -l | grep "/dev/sd"
Disk /dev/mapper/pve-root doesn't contain a valid partition table
Disk /dev/mapper/pve-swap doesn't contain a valid partition table
Disk /dev/mapper/pve-data doesn't contain a valid partition table
Disk /dev/sdb doesn't contain a valid partition table
Disk /dev/sda: 72.0 GB, 71999422464 bytes
/dev/sda1 * 2048 1048575 523264 83 Linux
/dev/sda2 1048576 140623871 69787648 8e Linux LVM
Disk /dev/sdb: 1073.7 GB, 1073741824000 bytes

bad node shows
fdisk -l | grep "/dev/sd"
Disk /dev/mapper/pve-root doesn't contain a valid partition table
Disk /dev/mapper/pve-swap doesn't contain a valid partition table
Disk /dev/mapper/pve-data doesn't contain a valid partition table
Disk /dev/sdb doesn't contain a valid partition table
Disk /dev/sdc doesn't contain a valid partition table
Disk /dev/sda: 72.0 GB, 71999422464 bytes
/dev/sda1 * 2048 1048575 523264 83 Linux
/dev/sda2 1048576 140623871 69787648 8e Linux LVM
Disk /dev/sdb: 1073.7 GB, 1073741824000 bytes
Disk /dev/sdc: 1073.7 GB, 1073741824000 bytes


Marco
 
Last edited:
What is very confusing to me is the following:

If I run "lvscan" on one of the nodes, I get

Found duplicate PV JGGQcMtzRlxRfdyjZSAxSMFat7ZnO1m6: using /dev/sdd not /dev/sdc
Found duplicate PV UXnEfTJrojVAGmfLHV9hxlf6hBOcXbpV: using /dev/sdf not /dev/sde
Found duplicate PV JGGQcMtzRlxRfdyjZSAxSMFat7ZnO1m6: using /dev/sdg not /dev/sdd
Found duplicate PV UXnEfTJrojVAGmfLHV9hxlf6hBOcXbpV: using /dev/sdh not /dev/sdf

Notice /dev/sdd is listed as both what I think is a "good" reference for a PV and "bad" reference. I can't build a filter in /etc/lvm/lvm.conf to discard /dev/sdd then (as some articles found while "googling" the issue mention as a workaround)!!!
 
I just tried the command:

dmsetup remove /dev/<vg group>/<lvm>

for one of the VMs that it complains that it can't delete because it can't find the disk for and I got:

Device /dev/<vg_group>/<lvm> not found
Command failed

so I'm stuck not able to delete VMs that contain iSCSI LVMs or inactive iSCSI disks
 
so I'm stuck not able to delete VMs that contain iSCSI LVMs or inactive iSCSI disks

well, delete VMs can be done from the command line, removing the .conf file (and of course all disks, usually...)

try removing your .conf file under /etc/pve/nodes/<nameofthenode>/qemu-server/

Marco
 
Thanks Marco. I've thought about that, but this workaround is very impractical in my organization. Most of my users don't have access to the configuration files and I wouldn't even dream of them executing the "dmsetup" command (if it were to work). My management is pretty adamant on getting this issue fixed the proper way so that people can delete disks/VMs from the Web Interface and I have run out of ideas on how to do that.
 
My management is pretty adamant on getting this issue fixed the proper way so that people can delete disks/VMs from the Web Interface and I have run out of ideas on how to do that.

well, in my case (similar iscsi warning) I believe that the NAS that serves the LUN may have problems (with one node in particular, since the other is working well - see also my other thread about this): when everything between pve and the ISCSI target works in the right way, pve should have no problem removing the VMs and their disks from the web gui (if users have enough/correct permissions in their roles)! I feel that liki in my case another problem is making pve behaving weird...

Marco
 
After upgrading to the Proxmox 3.4 release, I tried again the deletion of a VM that had iSCSI disks associated with it. I ended up having a kernel crash when lvremove was executing, which is very concerning:

May 12 15:49:51 proxmox1 kernel: lvremove D ffff880087916040 0 544277 1 0 0x00000000
May 12 15:49:51 proxmox1 kernel: ffff880296885c28 0000000000000086 ffff880296885be8 ffff880415180fb8
May 12 15:49:51 proxmox1 kernel: ffff880432e5bcc0 0000000000000000 ffff880296885bb8 ffffffff8126ecd2
May 12 15:49:51 proxmox1 kernel: ffff880415180fb8 0000000000000000 0000000000000000 0000000000000003
May 12 15:49:51 proxmox1 kernel: Call Trace:
May 12 15:49:51 proxmox1 kernel: [<ffffffff8126ecd2>] ? elv_insert+0x102/0x1c0
May 12 15:49:51 proxmox1 kernel: [<ffffffff81277613>] ? blk_queue_bio+0x123/0x5d0
May 12 15:49:51 proxmox1 kernel: [<ffffffff81560d94>] schedule_timeout+0x204/0x300
May 12 15:49:51 proxmox1 kernel: [<ffffffff81014e89>] ? read_tsc+0x9/0x20
May 12 15:49:51 proxmox1 kernel: [<ffffffff810b12f4>] ? ktime_get_ts+0xb4/0xf0
May 12 15:49:51 proxmox1 kernel: [<ffffffff8155f296>] io_schedule_timeout+0x86/0xe0
May 12 15:49:51 proxmox1 kernel: [<ffffffff815606e7>] wait_for_completion_io+0xd7/0x110
May 12 15:49:51 proxmox1 kernel: [<ffffffff8106d750>] ? default_wake_function+0x0/0x20
May 12 15:49:51 proxmox1 kernel: [<ffffffff8127d497>] blkdev_issue_discard+0x207/0x230
May 12 15:49:51 proxmox1 kernel: [<ffffffff8127dfca>] blkdev_ioctl+0x59a/0x6d0
May 12 15:49:51 proxmox1 kernel: [<ffffffff811edd61>] block_ioctl+0x41/0x50
May 12 15:49:51 proxmox1 kernel: [<ffffffff811c486a>] vfs_ioctl+0x2a/0xa0
May 12 15:49:51 proxmox1 kernel: [<ffffffff811c4e9e>] do_vfs_ioctl+0x7e/0x5a0
May 12 15:49:51 proxmox1 kernel: [<ffffffff811b4b3a>] ? sys_newfstat+0x2a/0x40
May 12 15:49:51 proxmox1 kernel: [<ffffffff811c540f>] sys_ioctl+0x4f/0x80
May 12 15:49:51 proxmox1 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
May 12 15:49:59 proxmox1 kernel: sd 4:0:0:0: [sdc] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
May 12 15:49:59 proxmox1 kernel: sd 4:0:0:0: [sdc] CDB: Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00

(followed by repetition of the DID_ABORT and the Unmap/Read sub-channel)

Is this a symptom of a bug in the debian kernel that Proxmox is using? Is there a fix to this on later releases? I couldn't find a lot of information about it. The Kernel my installation is running is:
2.6.32-37-pve

Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!