Bulk migrate via command line

churnd

Active Member
Aug 11, 2013
43
2
28
Sometimes I want to bulk migrate my VMs from the command line rather than using the web interface. To do this, I just run a for loop like so:

Code:
for vm in $(qm list | awk '{print $1}' | grep -Eo '[0-9]{1,3}'); do qm migrate $vm node2 --online; done

Sometimes this works great, sometimes it doesn't. When it doesn't I get a error in the logs just saying "
service 'vm:205' - migration failed (exit code 1)". It seems to happen on the VM's managed by HA only but I can't say that for 100% certainty. However, I notice if I do them one at a time, it works fine. Is there a problem with using the for loop this way? Is there a better way to bulk migrate via command line?
 
No, should be OK like it is, on PVE4.X atleast.

Can you please post the output from:
Code:
pveversion -v

Code:
proxmox-ve: 4.1-34 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-34
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-30
qemu-server: 4.0-46
pve-firmware: 1.1-7
libpve-common-perl: 4.0-43
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-21
pve-container: 1.0-37
pve-firewall: 2.0-15
pve-ha-manager: 1.0-18
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie

I did wonder if throwing in a "sleep 5" in the loop would help, in that maybe it doesn't like them being submitted that close together.
 
Last edited:
I did wonder if throwing in a "sleep 5" in the loop would help, in that maybe it doesn't like them being submitted that close together.


The difference between a HA managed machine and a "normal" one is here that when we make an Action (shutdown, start, migrate) to an HA Resource (e.g. through `qm` or `pct`) the original task only redirects it to the HA Manager which executes the action in a background worker.

So your script waits for each migration to finish before it starts the next when it's a non-HA managed machine but if it's ha-manage it changes the status from all machines given to migrate and they get all migrated at once, or at least 4 because we limit the possible active HA workers to 4.

I tried to reproduce your problem but failed yet, I'll retry with more VMs.

Can you look in the log (journalctl) to see what happens in the background task? The error message determines that this task failed.
 
  • Like
Reactions: Kingneutron
Thanks for the followup!

I could vaguely reproduce this a few days ago and I'm on it.

The current state would be that when an migration fails, we place it in the started/stopped state (depending if it's enabled or not) on the original node.

Here it comes to an race condition where the API says the migration failed but shortly after that it succeeded non the less.
Thus the ha manager places it in the started state on origin node whereas it already runs on the target node => node mismatch.

Can you please open a bug report on https://bugzilla.proxmox.com/ for the ha-manager whit the summed up infos or at least a link to this thread?
Makes it easier to track the issue :)

A bug regarding failure of concurrent live migrations in our qemu-server package was also already fixed and is in the pvetest repo, this should also fix some issues related to this one.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!