[SOLVED] Random migration errors

NdK73 · Apr 16, 2019

Hello.

After last upgrade, I sometimes get migration errors.
I'm using a shared storage (Dell MD3200) system.

This one seems a race condition:

Code:

2019-04-16 12:29:34 migration status: completed
can't deactivate LV '/dev/DataBox1_r6/vm-116-disk-0': Logical volume DataBox1_r6/vm-116-disk-0 is used by another device.
2019-04-16 12:29:37 ERROR: volume deactivation failed: DataBox1_r6:vm-116-disk-0 at /usr/share/perl5/PVE/Storage.pm line 1087.
2019-04-16 12:29:38 ERROR: migration finished with problems (duration 00:00:40)
TASK ERROR: migration problems

At least it leaves the VM running on the target node.

Other times I'm left with the VM running but locked on the source node and I have to use

Code:

qm unlock VMID

wait a bit then I can

Code:

qm migrate VMID DEST --online

and it usually works.

I once even saw a "no quorum" message after migrating a machine with quite intense activity.

Code:

# pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Servers only have 2 * 1Gb network interfaces (the only PCIe slot is used by the HBA to access the MD3200). Network is configured as advanced-alb bond of the two interfaces, with VLANs for the different networks.

Which other informations should I collect to better pin the issues and possibly have 'em resolved in next version?

Tks,
Diego

Stoiko Ivanov · Apr 18, 2019

NdK73 said:
can't deactivate LV '/dev/DataBox1_r6/vm-116-disk-0': Logical volume DataBox1_r6/vm-116-disk-0 is used by another device.

hmm - after a quick search most results hint at some other dm-mapping for that device. On a quick guess - what is vm-116 and does it by any chance have a LVM-VG (for the guest inside)?

please post the output of:
* `pvs`
* `vgs`
* `lvs -a`
* `dmsetup ls`

if my assumption is correct it is probably fixed by adding the devicenodes beneath DataBox1_r6 to the global_blacklist in lvm.conf

Is the MD3200 connected via SAS/FC/iSCSI? Do you use multipath?

Hope this helps!

NdK73 · Apr 18, 2019

Stoiko Ivanov said:
hmm - after a quick search most results hint at some other dm-mapping for that device. On a quick guess - what is vm-116 and does it by any chance have a LVM-VG (for the guest inside)?

Stoiko Ivanov said:
please post the output of:
* `pvs`

Code:

# pvs
  PV                             VG             Fmt  Attr PSize    PFree
  /dev/DataBox1_r6/vm-116-disk-0 Dati           lvm2 a--  1024.00g     0
  /dev/Ricerca/vm-104-disk-1     str957-cluster lvm2 a--    40.02t     0
  /dev/mapper/mp_MD3200_Ricerca  Ricerca        lvm2 a--    40.02t     0
  /dev/mapper/mp_MD3200_r6_0     DataBox1_r6    lvm2 a--    36.38t  1.04t
  /dev/mapper/mp_MD3800i         Databox2_r6    lvm2 a--    32.74t 28.60t
  /dev/sde3                      pve            lvm2 a--   297.84g 16.00g

Stoiko Ivanov said:
* `vgs`

Code:

# vgs
  VG             #PV #LV #SN Attr   VSize    VFree
  DataBox1_r6      1  29   0 wz--n-   36.38t  1.04t
  Databox2_r6      1   8   0 wz--n-   32.74t 28.60t
  Dati             1   1   0 wz--n- 1024.00g     0
  Ricerca          1   1   0 wz--n-   40.02t     0
  pve              1   3   0 wz--n-  297.84g 16.00g
  str957-cluster   1   1   0 wz--n-   40.02t     0

Stoiko Ivanov said:
* `lvs -a`

Code:

# lvs -a
  LV              VG             Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  vm-100-disk-1   DataBox1_r6    -wi-ao----   32.00g                                                   
  vm-100-disk-2   DataBox1_r6    -wi-ao----  512.00g                                                   
  vm-101-disk-1   DataBox1_r6    -wi-------   32.00g                                                   
  vm-101-disk-2   DataBox1_r6    -wi-------  512.00g                                                   
  vm-102-disk-1   DataBox1_r6    -wi-a-----  100.00g                                                   
  vm-102-disk-2   DataBox1_r6    -wi-a-----    1.00t                                                   
  vm-102-disk-3   DataBox1_r6    -wi-a-----    1.86t                                                   
  vm-102-disk-4   DataBox1_r6    -wi-a-----  500.00g                                                   
  vm-104-disk-1   DataBox1_r6    -wi-ao----   32.00g                                                   
  vm-104-disk-2   DataBox1_r6    -wi-ao----    5.00t                                                   
  vm-105-disk-1   DataBox1_r6    -wi-ao----   32.00g                                                   
  vm-105-disk-2   DataBox1_r6    -wi-ao----  500.00g                                                   
  vm-106-disk-1   DataBox1_r6    -wi-a-----  100.00g                                                   
  vm-107-disk-1   DataBox1_r6    -wi-a-----   50.00g                                                   
  vm-107-disk-2   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-108-disk-1   DataBox1_r6    -wi-a-----   50.00g                                                   
  vm-110-disk-1   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-110-disk-2   DataBox1_r6    -wi-a-----  400.00g                                                   
  vm-110-disk-3   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-115-disk-1   DataBox1_r6    -wi-ao----    4.88t                                                   
  vm-116-disk-0   DataBox1_r6    -wi-ao----    1.00t                                                   
  vm-116-disk-1   DataBox1_r6    -wi-ao----    1.00t                                                   
  vm-116-disk-2   DataBox1_r6    -wi-a-----    9.77t                                                   
  vm-120-disk-1   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-120-disk-2   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-125-disk-1   DataBox1_r6    -wi-a-----   50.00g                                                   
  vm-126-disk-1   DataBox1_r6    -wi-a-----   32.00g                                                   
  vm-127-disk-1   DataBox1_r6    -wi-ao----    1.95t                                                   
  vm-200-disk-1   DataBox1_r6    -wi-a-----    5.86t                                                   
  vm-103-disk-0   Databox2_r6    -wi-------   32.00g                                                   
  vm-103-disk-1   Databox2_r6    -wi-------   32.00g                                                   
  vm-103-disk-2   Databox2_r6    -wi-------    1.00t                                                   
  vm-109-disk-0   Databox2_r6    -wi-------   32.00g                                                   
  vm-109-disk-1   Databox2_r6    -wi-------    1.46t                                                   
  vm-113-disk-0   Databox2_r6    -wi-------    1.17t                                                   
  vm-113-disk-1   Databox2_r6    -wi-------  100.00g                                                   
  vm-131-disk-0   Databox2_r6    -wi-------  320.00g                                                   
  Cloud           Dati           -wi-a----- 1024.00g                                                   
  vm-104-disk-1   Ricerca        -wi-ao----   40.02t                                                   
  data            pve            twi-a-tz--  195.59g             0.00   0.05                           
  [data_tdata]    pve            Twi-ao----  195.59g                                                   
  [data_tmeta]    pve            ewi-ao----    2.00g                                                   
  [lvol0_pmspare] pve            ewi-------    2.00g                                                   
  root            pve            -wi-ao----   74.25g                                                   
  swap            pve            -wi-ao----    8.00g                                                   
  home            str957-cluster -wi-a-----   40.02t

Stoiko Ivanov said:
* `dmsetup ls`

Code:

# dmsetup ls
DataBox1_r6-vm--102--disk--1   (253:8)
DataBox1_r6-vm--116--disk--2   (253:24)
Dati-Cloud   (253:37)
DataBox1_r6-vm--116--disk--1   (253:22)
mp_MD3800i   (253:39)
DataBox1_r6-vm--100--disk--2   (253:6)
pve-data_tdata   (253:35)
DataBox1_r6-vm--116--disk--0   (253:33)
DataBox1_r6-vm--115--disk--1   (253:26)
DataBox1_r6-vm--100--disk--1   (253:5)
DataBox1_r6-vm--120--disk--2   (253:29)
pve-data_tmeta   (253:34)
DataBox1_r6-vm--120--disk--1   (253:21)
DataBox1_r6-vm--108--disk--1   (253:12)
DataBox1_r6-vm--107--disk--2   (253:30)
str957--cluster-home   (253:38)
DataBox1_r6-vm--107--disk--1   (253:11)
pve-swap   (253:2)
pve-root   (253:3)
DataBox1_r6-vm--200--disk--1   (253:27)
pve-data   (253:36)
Ricerca-vm--104--disk--1   (253:4)
DataBox1_r6-vm--127--disk--1   (253:7)
DataBox1_r6-vm--106--disk--1   (253:18)
DataBox1_r6-vm--105--disk--2   (253:19)
DataBox1_r6-vm--110--disk--3   (253:20)
DataBox1_r6-vm--126--disk--1   (253:31)
DataBox1_r6-vm--105--disk--1   (253:17)
DataBox1_r6-vm--110--disk--2   (253:15)
mp_MD3200_Ricerca   (253:1)
DataBox1_r6-vm--104--disk--2   (253:16)
mp_MD3200_r6_0   (253:0)
DataBox1_r6-vm--102--disk--4   (253:28)
DataBox1_r6-vm--125--disk--1   (253:23)
DataBox1_r6-vm--110--disk--1   (253:13)
DataBox1_r6-vm--104--disk--1   (253:10)
DataBox1_r6-vm--102--disk--3   (253:32)
DataBox1_r6-vm--102--disk--2   (253:14)

Stoiko Ivanov said:
if my assumption is correct it is probably fixed by adding the devicenodes beneath DataBox1_r6 to the global_blacklist in lvm.conf

You're probably right. And I'll have to do the same for the other shared storage (MD3800i, iSCSI).

Stoiko Ivanov said:
Is the MD3200 connected via SAS/FC/iSCSI? Do you use multipath?

Via SAS and uses multipath.

Stoiko Ivanov · Apr 18, 2019

NdK73 said:
/dev/DataBox1_r6/vm-116-disk-0 Dati lvm2 a-- 1024.00g 0
/dev/Ricerca/vm-104-disk-1 str957-cluster lvm2 a-- 40.02t 0

* Those two PVs from the `pvs` output are disks of guests - and should not be active on the PVE node itself.
* I'm pretty sure that this is the root of the 'is used by another device' migration error (the quorum lost is probably a separate issue)

* Please add all of the shared storages containing guest-images to the lvm.conf in the blacklist (global_filter - there's a quite descriptive comment above it) - and report back if the issue still persists

Hope this helps!

NdK73 · Apr 19, 2019

I'll have to do some more tests, but you're (quite certainly) right.
I now added "r|/dev/mapper/.*-vm--[0-9]+--disk--[0-9]+|" to the global_filter line.
It excludes all the contents of the disks created by Proxmox from host's LVM visibility. I think it should be safe enough to be included in the default lvm.conf shipped with Proxmox.

Tks a lot !

NdK73 · Apr 19, 2019

Confirmed: with the given filter there are no more migration errors due to failed deactivations.
Anyway, a node reboot is required (unless you want to meddle with devices "wrongly" created by LVM)!

Stoiko Ivanov · Apr 19, 2019

NdK73 said:
"r|/dev/mapper/.*-vm--[0-9]+--disk--[0-9]+|"

Hmm - could you open an enhancement request at https://bugzilla.proxmox.com - so that we can discuss the potential implications - Thanks!

NdK73 · Apr 19, 2019

Done, tks.
Bug 2184.

alexskysilk · Apr 19, 2019

NdK73 said:
Servers only have 2 * 1Gb network interfaces (the only PCIe slot is used by the HBA to access the MD3200). Network is configured as advanced-alb bond of the two interfaces, with VLANs for the different networks.

balance-alb is an imperfect hammer, and can cause you grief the kind you describe. you'd get better results by either using the interfaces individually, OR make multiple active-backup bonds with alternating preferred masters for your various vlans. Of highest importance is that you leave a path as dedicated as you can to cluster traffic.

Search

Search

[SOLVED] Random migration errors

NdK73

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

NdK73

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

NdK73

Renowned Member

NdK73

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

NdK73

Renowned Member

alexskysilk

Distinguished Member

We value your privacy