Migration stopped working

Emceh · Mar 20, 2018

Hi,

I went over the forum and have tried all solutions provided but I none helped.
I have test lab to try out Proxmox. It's 3 servers in a cluster using glusterfs.

On the fresh install live migration worked well.

When I try to live migrate or offline migrate I get TASK OK from ha-manager. I see VM on the destination node for a while and then it comes back to the original node. When I try to do the command manually I get this error:

kvm: could not acquire pid file: Resource temporarily unavailable

But pid file it there and I can read it:

total 8
drwxr-xr-x 2 root root 160 Mar 19 20:49 .
drwxr-xr-x 31 root root 1420 Mar 20 15:58 ..
-rw------- 1 root root 5 Mar 19 19:15 100.pid
srwxr-x--- 1 root root 0 Mar 19 19:15 100.qmp
srwxr-x--- 1 root root 0 Mar 19 19:15 100.vnc
-rw------- 1 root root 5 Mar 19 19:13 101.pid
srwxr-x--- 1 root root 0 Mar 19 19:13 101.qmp
srwxr-x--- 1 root root 0 Mar 19 20:49 101.vnc

cat 101.pid
6166

qm unlock 101 didn't help

I even took VMs down restarted this node, updated packages on all nodes to the latest versions. Still no luck.

Shared storage is glusterfs.

pveversion -v

proxmox-ve: 5.1-42 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-47 (running version: 5.1-47/97a08ab2)
pve-kernel-4.13: 5.1-42
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-3
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-12
pve-cluster: 5.0-21
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-3
pve-xtermjs: 1.0-2
qemu-server: 5.0-23
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9

cluster looks just fine:

Quorum information
------------------
Date: Tue Mar 20 16:03:42 2018
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 1/132
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.10.10
0x00000002 1 10.10.10.20
0x00000003 1 10.10.10.30 (local)

Any idea what else can I try?

Regards,
Martin

Klaus Steinberger · Mar 20, 2018

do you eventually have priorities in the ha groups and "no-fallback" not set?

Then VM's will migrate and the immediately fall back to the node with highest priority

Emceh · Mar 20, 2018

I have one group:

cat /etc/pve/ha/groups.cfg
group: GROUP
nodes n1:1,n2:2,n3:3
nofailback 0
restricted 0

VMs are at n3 unable to move to n1 or n2

So I was understanding that priority 1 if more important than 3 and I wrong?

edit:

Anyway I have created another group with nodes n1 and n2 only - ha manager says migrating to this new group ending with error.
The only difference is that I was able to migrate one in offline mode.

edit2:
as I have offline migrated 2 VMs to the new group. One VM is downloaded as ready to use (VMDK) and now gives me error that this format is not supported for live migration. Another VM is qcow2 and live migrates fine.

Klaus Steinberger · Mar 21, 2018

3 is higher priority than 1 or 2 , so the machines will stay on n3

Create 3 groups with different priority sequence (1 group per node):

pref_n1 n1:3,n2:2,n3:2
pref_n2 n2:3,n1:2,n3:2
pref_n3 n3:3,n1:2,n2:2

the identical priority for other machines is intentional. Reason is the following:

Say n1 fails and you have n2:2,n3:1 then _all_ VM's will be restarted on n2, depending on how much resources are needed this can overburden n2, maybe it fails also and you have a chain reaction going down all services.

if priorities are same: n2:2,n3:2 the VM's will be spread over both servers (hopefully evenly distributed)

you can of course play with more HA-Groups, but better follow KISS and do with one group per node

Now for planned maintenance (the way how I do it):

for example maintenance of n1:
set temporarily no-fallback=1 in pref_n1 group
bulk-migrate one half of VM's to n2
bulk-migrate other half of VMs to n3
do maintenance on n1 (e.g. apt-get dist-upgrade / reboot)
after n1 is up again and in green state unset no-fallback -> VM's will migrate back to n1

Emceh · Mar 21, 2018

OK seems reasonable - I'll try it out - just need to create some more test VMs

Thanks for prompt reply and patience

Search

Search

Migration stopped working

Emceh

Active Member

Klaus Steinberger

Renowned Member

Emceh

Active Member

Klaus Steinberger

Renowned Member

Emceh

Active Member

We value your privacy