[SOLVED] Random migration errors

NdK73

Renowned Member
Jul 19, 2012
97
6
73
Bologna, Italy
www.csshl.net
Hello.

After last upgrade, I sometimes get migration errors.
I'm using a shared storage (Dell MD3200) system.

This one seems a race condition:
Code:
2019-04-16 12:29:34 migration status: completed
can't deactivate LV '/dev/DataBox1_r6/vm-116-disk-0': Logical volume DataBox1_r6/vm-116-disk-0 is used by another device.
2019-04-16 12:29:37 ERROR: volume deactivation failed: DataBox1_r6:vm-116-disk-0 at /usr/share/perl5/PVE/Storage.pm line 1087.
2019-04-16 12:29:38 ERROR: migration finished with problems (duration 00:00:40)
TASK ERROR: migration problems

At least it leaves the VM running on the target node.

Other times I'm left with the VM running but locked on the source node and I have to use
Code:
qm unlock VMID
wait a bit then I can
Code:
qm migrate VMID DEST --online
and it usually works.

I once even saw a "no quorum" message after migrating a machine with quite intense activity.

Code:
# pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Servers only have 2 * 1Gb network interfaces (the only PCIe slot is used by the HBA to access the MD3200). Network is configured as advanced-alb bond of the two interfaces, with VLANs for the different networks.

Which other informations should I collect to better pin the issues and possibly have 'em resolved in next version?

Tks,
Diego
 
can't deactivate LV '/dev/DataBox1_r6/vm-116-disk-0': Logical volume DataBox1_r6/vm-116-disk-0 is used by another device.
hmm - after a quick search most results hint at some other dm-mapping for that device. On a quick guess - what is vm-116 and does it by any chance have a LVM-VG (for the guest inside)?

please post the output of:
* `pvs`
* `vgs`
* `lvs -a`
* `dmsetup ls`

if my assumption is correct it is probably fixed by adding the devicenodes beneath DataBox1_r6 to the global_blacklist in lvm.conf

Is the MD3200 connected via SAS/FC/iSCSI? Do you use multipath?

Hope this helps!
 
hmm - after a quick search most results hint at some other dm-mapping for that device. On a quick guess - what is vm-116 and does it by any chance have a LVM-VG (for the guest inside)?

please post the output of:
* `pvs`
Code:
# pvs
  PV                             VG             Fmt  Attr PSize    PFree
  /dev/DataBox1_r6/vm-116-disk-0 Dati           lvm2 a--  1024.00g     0
  /dev/Ricerca/vm-104-disk-1     str957-cluster lvm2 a--    40.02t     0
  /dev/mapper/mp_MD3200_Ricerca  Ricerca        lvm2 a--    40.02t     0
  /dev/mapper/mp_MD3200_r6_0     DataBox1_r6    lvm2 a--    36.38t  1.04t
  /dev/mapper/mp_MD3800i         Databox2_r6    lvm2 a--    32.74t 28.60t
  /dev/sde3                      pve            lvm2 a--   297.84g 16.00g

Code:
# vgs
  VG             #PV #LV #SN Attr   VSize    VFree
  DataBox1_r6      1  29   0 wz--n-   36.38t  1.04t
  Databox2_r6      1   8   0 wz--n-   32.74t 28.60t
  Dati             1   1   0 wz--n- 1024.00g     0
  Ricerca          1   1   0 wz--n-   40.02t     0
  pve              1   3   0 wz--n-  297.84g 16.00g
  str957-cluster   1   1   0 wz--n-   40.02t     0

* `lvs -a`
Code:
# lvs -a
  LV              VG             Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  vm-100-disk-1   DataBox1_r6    -wi-ao----   32.00g                                                   
  vm-100-disk-2   DataBox1_r6    -wi-ao----  512.00g                                                   
  vm-101-disk-1   DataBox1_r6    -wi-------   32.00g                                                   
  vm-101-disk-2   DataBox1_r6    -wi-------  512.00g                                                   
  vm-102-disk-1   DataBox1_r6    -wi-a-----  100.00g                                                   
  vm-102-disk-2   DataBox1_r6    -wi-a-----    1.00t                                                   
  vm-102-disk-3   DataBox1_r6    -wi-a-----    1.86t                                                   
  vm-102-disk-4   DataBox1_r6    -wi-a-----  500.00g                                                   
  vm-104-disk-1   DataBox1_r6    -wi-ao----   32.00g                                                   
  vm-104-disk-2   DataBox1_r6    -wi-ao----    5.00t                                                   
  vm-105-disk-1   DataBox1_r6    -wi-ao----   32.00g                                                   
  vm-105-disk-2   DataBox1_r6    -wi-ao----  500.00g                                                   
  vm-106-disk-1   DataBox1_r6    -wi-a-----  100.00g                                                   
  vm-107-disk-1   DataBox1_r6    -wi-a-----   50.00g                                                   
  vm-107-disk-2   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-108-disk-1   DataBox1_r6    -wi-a-----   50.00g                                                   
  vm-110-disk-1   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-110-disk-2   DataBox1_r6    -wi-a-----  400.00g                                                   
  vm-110-disk-3   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-115-disk-1   DataBox1_r6    -wi-ao----    4.88t                                                   
  vm-116-disk-0   DataBox1_r6    -wi-ao----    1.00t                                                   
  vm-116-disk-1   DataBox1_r6    -wi-ao----    1.00t                                                   
  vm-116-disk-2   DataBox1_r6    -wi-a-----    9.77t                                                   
  vm-120-disk-1   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-120-disk-2   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-125-disk-1   DataBox1_r6    -wi-a-----   50.00g                                                   
  vm-126-disk-1   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-127-disk-1   DataBox1_r6    -wi-ao----    1.95t                                                   
  vm-200-disk-1   DataBox1_r6    -wi-a-----    5.86t                                                   
  vm-103-disk-0   Databox2_r6    -wi-------   32.00g                                                   
  vm-103-disk-1   Databox2_r6    -wi-------   32.00g                                                   
  vm-103-disk-2   Databox2_r6    -wi-------    1.00t                                                   
  vm-109-disk-0   Databox2_r6    -wi-------   32.00g                                                   
  vm-109-disk-1   Databox2_r6    -wi-------    1.46t                                                   
  vm-113-disk-0   Databox2_r6    -wi-------    1.17t                                                   
  vm-113-disk-1   Databox2_r6    -wi-------  100.00g                                                   
  vm-131-disk-0   Databox2_r6    -wi-------  320.00g                                                   
  Cloud           Dati           -wi-a----- 1024.00g                                                   
  vm-104-disk-1   Ricerca        -wi-ao----   40.02t                                                   
  data            pve            twi-a-tz--  195.59g             0.00   0.05                           
  [data_tdata]    pve            Twi-ao----  195.59g                                                   
  [data_tmeta]    pve            ewi-ao----    2.00g                                                   
  [lvol0_pmspare] pve            ewi-------    2.00g                                                   
  root            pve            -wi-ao----   74.25g                                                   
  swap            pve            -wi-ao----    8.00g                                                   
  home            str957-cluster -wi-a-----   40.02t

* `dmsetup ls`
Code:
# dmsetup ls
DataBox1_r6-vm--102--disk--1   (253:8)
DataBox1_r6-vm--116--disk--2   (253:24)
Dati-Cloud   (253:37)
DataBox1_r6-vm--116--disk--1   (253:22)
mp_MD3800i   (253:39)
DataBox1_r6-vm--100--disk--2   (253:6)
pve-data_tdata   (253:35)
DataBox1_r6-vm--116--disk--0   (253:33)
DataBox1_r6-vm--115--disk--1   (253:26)
DataBox1_r6-vm--100--disk--1   (253:5)
DataBox1_r6-vm--120--disk--2   (253:29)
pve-data_tmeta   (253:34)
DataBox1_r6-vm--120--disk--1   (253:21)
DataBox1_r6-vm--108--disk--1   (253:12)
DataBox1_r6-vm--107--disk--2   (253:30)
str957--cluster-home   (253:38)
DataBox1_r6-vm--107--disk--1   (253:11)
pve-swap   (253:2)
pve-root   (253:3)
DataBox1_r6-vm--200--disk--1   (253:27)
pve-data   (253:36)
Ricerca-vm--104--disk--1   (253:4)
DataBox1_r6-vm--127--disk--1   (253:7)
DataBox1_r6-vm--106--disk--1   (253:18)
DataBox1_r6-vm--105--disk--2   (253:19)
DataBox1_r6-vm--110--disk--3   (253:20)
DataBox1_r6-vm--126--disk--1   (253:31)
DataBox1_r6-vm--105--disk--1   (253:17)
DataBox1_r6-vm--110--disk--2   (253:15)
mp_MD3200_Ricerca   (253:1)
DataBox1_r6-vm--104--disk--2   (253:16)
mp_MD3200_r6_0   (253:0)
DataBox1_r6-vm--102--disk--4   (253:28)
DataBox1_r6-vm--125--disk--1   (253:23)
DataBox1_r6-vm--110--disk--1   (253:13)
DataBox1_r6-vm--104--disk--1   (253:10)
DataBox1_r6-vm--102--disk--3   (253:32)
DataBox1_r6-vm--102--disk--2   (253:14)

if my assumption is correct it is probably fixed by adding the devicenodes beneath DataBox1_r6 to the global_blacklist in lvm.conf
You're probably right. And I'll have to do the same for the other shared storage (MD3800i, iSCSI).

Is the MD3200 connected via SAS/FC/iSCSI? Do you use multipath?
Via SAS and uses multipath.
 
/dev/DataBox1_r6/vm-116-disk-0 Dati lvm2 a-- 1024.00g 0
/dev/Ricerca/vm-104-disk-1 str957-cluster lvm2 a-- 40.02t 0

* Those two PVs from the `pvs` output are disks of guests - and should not be active on the PVE node itself.
* I'm pretty sure that this is the root of the 'is used by another device' migration error (the quorum lost is probably a separate issue)

* Please add all of the shared storages containing guest-images to the lvm.conf in the blacklist (global_filter - there's a quite descriptive comment above it) - and report back if the issue still persists

Hope this helps!
 
I'll have to do some more tests, but you're (quite certainly) right.
I now added "r|/dev/mapper/.*-vm--[0-9]+--disk--[0-9]+|" to the global_filter line.
It excludes all the contents of the disks created by Proxmox from host's LVM visibility. I think it should be safe enough to be included in the default lvm.conf shipped with Proxmox.

Tks a lot !
 
Confirmed: with the given filter there are no more migration errors due to failed deactivations.
Anyway, a node reboot is required (unless you want to meddle with devices "wrongly" created by LVM)!
 
Servers only have 2 * 1Gb network interfaces (the only PCIe slot is used by the HBA to access the MD3200). Network is configured as advanced-alb bond of the two interfaces, with VLANs for the different networks.
balance-alb is an imperfect hammer, and can cause you grief the kind you describe. you'd get better results by either using the interfaces individually, OR make multiple active-backup bonds with alternating preferred masters for your various vlans. Of highest importance is that you leave a path as dedicated as you can to cluster traffic.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!