[SOLVED] Random migration errors

NdK73

Renowned Member
Jul 19, 2012
97
6
73
Bologna, Italy
www.csshl.net
Hello.

After last upgrade, I sometimes get migration errors.
I'm using a shared storage (Dell MD3200) system.

This one seems a race condition:
Code:
2019-04-16 12:29:34 migration status: completed
can't deactivate LV '/dev/DataBox1_r6/vm-116-disk-0': Logical volume DataBox1_r6/vm-116-disk-0 is used by another device.
2019-04-16 12:29:37 ERROR: volume deactivation failed: DataBox1_r6:vm-116-disk-0 at /usr/share/perl5/PVE/Storage.pm line 1087.
2019-04-16 12:29:38 ERROR: migration finished with problems (duration 00:00:40)
TASK ERROR: migration problems

At least it leaves the VM running on the target node.

Other times I'm left with the VM running but locked on the source node and I have to use
Code:
qm unlock VMID
wait a bit then I can
Code:
qm migrate VMID DEST --online
and it usually works.

I once even saw a "no quorum" message after migrating a machine with quite intense activity.

Code:
# pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Servers only have 2 * 1Gb network interfaces (the only PCIe slot is used by the HBA to access the MD3200). Network is configured as advanced-alb bond of the two interfaces, with VLANs for the different networks.

Which other informations should I collect to better pin the issues and possibly have 'em resolved in next version?

Tks,
Diego
 
can't deactivate LV '/dev/DataBox1_r6/vm-116-disk-0': Logical volume DataBox1_r6/vm-116-disk-0 is used by another device.
hmm - after a quick search most results hint at some other dm-mapping for that device. On a quick guess - what is vm-116 and does it by any chance have a LVM-VG (for the guest inside)?

please post the output of:
* `pvs`
* `vgs`
* `lvs -a`
* `dmsetup ls`

if my assumption is correct it is probably fixed by adding the devicenodes beneath DataBox1_r6 to the global_blacklist in lvm.conf

Is the MD3200 connected via SAS/FC/iSCSI? Do you use multipath?

Hope this helps!
 
hmm - after a quick search most results hint at some other dm-mapping for that device. On a quick guess - what is vm-116 and does it by any chance have a LVM-VG (for the guest inside)?

please post the output of:
* `pvs`
Code:
# pvs
  PV                             VG             Fmt  Attr PSize    PFree
  /dev/DataBox1_r6/vm-116-disk-0 Dati           lvm2 a--  1024.00g     0
  /dev/Ricerca/vm-104-disk-1     str957-cluster lvm2 a--    40.02t     0
  /dev/mapper/mp_MD3200_Ricerca  Ricerca        lvm2 a--    40.02t     0
  /dev/mapper/mp_MD3200_r6_0     DataBox1_r6    lvm2 a--    36.38t  1.04t
  /dev/mapper/mp_MD3800i         Databox2_r6    lvm2 a--    32.74t 28.60t
  /dev/sde3                      pve            lvm2 a--   297.84g 16.00g

Code:
# vgs
  VG             #PV #LV #SN Attr   VSize    VFree
  DataBox1_r6      1  29   0 wz--n-   36.38t  1.04t
  Databox2_r6      1   8   0 wz--n-   32.74t 28.60t
  Dati             1   1   0 wz--n- 1024.00g     0
  Ricerca          1   1   0 wz--n-   40.02t     0
  pve              1   3   0 wz--n-  297.84g 16.00g
  str957-cluster   1   1   0 wz--n-   40.02t     0

* `lvs -a`
Code:
# lvs -a
  LV              VG             Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  vm-100-disk-1   DataBox1_r6    -wi-ao----   32.00g                                                   
  vm-100-disk-2   DataBox1_r6    -wi-ao----  512.00g                                                   
  vm-101-disk-1   DataBox1_r6    -wi-------   32.00g                                                   
  vm-101-disk-2   DataBox1_r6    -wi-------  512.00g                                                   
  vm-102-disk-1   DataBox1_r6    -wi-a-----  100.00g                                                   
  vm-102-disk-2   DataBox1_r6    -wi-a-----    1.00t                                                   
  vm-102-disk-3   DataBox1_r6    -wi-a-----    1.86t                                                   
  vm-102-disk-4   DataBox1_r6    -wi-a-----  500.00g                                                   
  vm-104-disk-1   DataBox1_r6    -wi-ao----   32.00g                                                   
  vm-104-disk-2   DataBox1_r6    -wi-ao----    5.00t                                                   
  vm-105-disk-1   DataBox1_r6    -wi-ao----   32.00g                                                   
  vm-105-disk-2   DataBox1_r6    -wi-ao----  500.00g                                                   
  vm-106-disk-1   DataBox1_r6    -wi-a-----  100.00g                                                   
  vm-107-disk-1   DataBox1_r6    -wi-a-----   50.00g                                                   
  vm-107-disk-2   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-108-disk-1   DataBox1_r6    -wi-a-----   50.00g                                                   
  vm-110-disk-1   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-110-disk-2   DataBox1_r6    -wi-a-----  400.00g                                                   
  vm-110-disk-3   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-115-disk-1   DataBox1_r6    -wi-ao----    4.88t                                                   
  vm-116-disk-0   DataBox1_r6    -wi-ao----    1.00t                                                   
  vm-116-disk-1   DataBox1_r6    -wi-ao----    1.00t                                                   
  vm-116-disk-2   DataBox1_r6    -wi-a-----    9.77t                                                   
  vm-120-disk-1   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-120-disk-2   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-125-disk-1   DataBox1_r6    -wi-a-----   50.00g                                                   
  vm-126-disk-1   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-127-disk-1   DataBox1_r6    -wi-ao----    1.95t                                                   
  vm-200-disk-1   DataBox1_r6    -wi-a-----    5.86t                                                   
  vm-103-disk-0   Databox2_r6    -wi-------   32.00g                                                   
  vm-103-disk-1   Databox2_r6    -wi-------   32.00g                                                   
  vm-103-disk-2   Databox2_r6    -wi-------    1.00t                                                   
  vm-109-disk-0   Databox2_r6    -wi-------   32.00g                                                   
  vm-109-disk-1   Databox2_r6    -wi-------    1.46t                                                   
  vm-113-disk-0   Databox2_r6    -wi-------    1.17t                                                   
  vm-113-disk-1   Databox2_r6    -wi-------  100.00g                                                   
  vm-131-disk-0   Databox2_r6    -wi-------  320.00g                                                   
  Cloud           Dati           -wi-a----- 1024.00g                                                   
  vm-104-disk-1   Ricerca        -wi-ao----   40.02t                                                   
  data            pve            twi-a-tz--  195.59g             0.00   0.05                           
  [data_tdata]    pve            Twi-ao----  195.59g                                                   
  [data_tmeta]    pve            ewi-ao----    2.00g                                                   
  [lvol0_pmspare] pve            ewi-------    2.00g                                                   
  root            pve            -wi-ao----   74.25g                                                   
  swap            pve            -wi-ao----    8.00g                                                   
  home            str957-cluster -wi-a-----   40.02t

* `dmsetup ls`
Code:
# dmsetup ls
DataBox1_r6-vm--102--disk--1   (253:8)
DataBox1_r6-vm--116--disk--2   (253:24)
Dati-Cloud   (253:37)
DataBox1_r6-vm--116--disk--1   (253:22)
mp_MD3800i   (253:39)
DataBox1_r6-vm--100--disk--2   (253:6)
pve-data_tdata   (253:35)
DataBox1_r6-vm--116--disk--0   (253:33)
DataBox1_r6-vm--115--disk--1   (253:26)
DataBox1_r6-vm--100--disk--1   (253:5)
DataBox1_r6-vm--120--disk--2   (253:29)
pve-data_tmeta   (253:34)
DataBox1_r6-vm--120--disk--1   (253:21)
DataBox1_r6-vm--108--disk--1   (253:12)
DataBox1_r6-vm--107--disk--2   (253:30)
str957--cluster-home   (253:38)
DataBox1_r6-vm--107--disk--1   (253:11)
pve-swap   (253:2)
pve-root   (253:3)
DataBox1_r6-vm--200--disk--1   (253:27)
pve-data   (253:36)
Ricerca-vm--104--disk--1   (253:4)
DataBox1_r6-vm--127--disk--1   (253:7)
DataBox1_r6-vm--106--disk--1   (253:18)
DataBox1_r6-vm--105--disk--2   (253:19)
DataBox1_r6-vm--110--disk--3   (253:20)
DataBox1_r6-vm--126--disk--1   (253:31)
DataBox1_r6-vm--105--disk--1   (253:17)
DataBox1_r6-vm--110--disk--2   (253:15)
mp_MD3200_Ricerca   (253:1)
DataBox1_r6-vm--104--disk--2   (253:16)
mp_MD3200_r6_0   (253:0)
DataBox1_r6-vm--102--disk--4   (253:28)
DataBox1_r6-vm--125--disk--1   (253:23)
DataBox1_r6-vm--110--disk--1   (253:13)
DataBox1_r6-vm--104--disk--1   (253:10)
DataBox1_r6-vm--102--disk--3   (253:32)
DataBox1_r6-vm--102--disk--2   (253:14)

if my assumption is correct it is probably fixed by adding the devicenodes beneath DataBox1_r6 to the global_blacklist in lvm.conf
You're probably right. And I'll have to do the same for the other shared storage (MD3800i, iSCSI).

Is the MD3200 connected via SAS/FC/iSCSI? Do you use multipath?
Via SAS and uses multipath.
 
/dev/DataBox1_r6/vm-116-disk-0 Dati lvm2 a-- 1024.00g 0
/dev/Ricerca/vm-104-disk-1 str957-cluster lvm2 a-- 40.02t 0

* Those two PVs from the `pvs` output are disks of guests - and should not be active on the PVE node itself.
* I'm pretty sure that this is the root of the 'is used by another device' migration error (the quorum lost is probably a separate issue)

* Please add all of the shared storages containing guest-images to the lvm.conf in the blacklist (global_filter - there's a quite descriptive comment above it) - and report back if the issue still persists

Hope this helps!
 
I'll have to do some more tests, but you're (quite certainly) right.
I now added "r|/dev/mapper/.*-vm--[0-9]+--disk--[0-9]+|" to the global_filter line.
It excludes all the contents of the disks created by Proxmox from host's LVM visibility. I think it should be safe enough to be included in the default lvm.conf shipped with Proxmox.

Tks a lot !
 
Confirmed: with the given filter there are no more migration errors due to failed deactivations.
Anyway, a node reboot is required (unless you want to meddle with devices "wrongly" created by LVM)!
 
Servers only have 2 * 1Gb network interfaces (the only PCIe slot is used by the HBA to access the MD3200). Network is configured as advanced-alb bond of the two interfaces, with VLANs for the different networks.
balance-alb is an imperfect hammer, and can cause you grief the kind you describe. you'd get better results by either using the interfaces individually, OR make multiple active-backup bonds with alternating preferred masters for your various vlans. Of highest importance is that you leave a path as dedicated as you can to cluster traffic.