[SOLVED] Error while cold migrating after upgrade.

dtiKev

Well-Known Member
Apr 7, 2018
78
4
48
Both nodes are now upgraded and I've started moving some VMs to the latest upgraded node. One of them seems to be locked into migration. I *think* I saw a little hiccup in the quorum connection when the migration started and it reports as aborted. Looking at the log (below) it seems to not be able to find a file and I can confirm from the cli that it doesn't exist. how can I recover this?

Code:
2020-08-06 14:22:53 starting migration of VM 106 to node 'pvn2' (10.11.12.252)
2020-08-06 14:22:53 found local, replicated disk 'local-zfs:vm-106-disk-1' (in current VM config)
2020-08-06 14:22:53 replicating disk images
2020-08-06 14:22:53 start replication job
2020-08-06 14:22:53 guest => VM 106, running => 0
2020-08-06 14:22:53 volumes => local-zfs:vm-106-disk-1
2020-08-06 14:22:56 create snapshot '__replicate_106-1_1596738173__' on local-zfs:vm-106-disk-1
2020-08-06 14:22:56 using secure transmission, rate limit: none
2020-08-06 14:22:56 incremental sync 'local-zfs:vm-106-disk-1' (__replicate_106-1_1596717257__ => __replicate_106-1_1596738173__)
2020-08-06 14:22:59 send from @__replicate_106-1_1596717257__ to rpool/data/vm-106-disk-1@__replicate_106-1_1596738173__ estimated size is 71.9M
2020-08-06 14:22:59 total estimated size is 71.9M
2020-08-06 14:22:59 TIME        SENT   SNAPSHOT rpool/data/vm-106-disk-1@__replicate_106-1_1596738173__
2020-08-06 14:23:00 rpool/data/vm-106-disk-1@__replicate_106-1_1596717257__    name    rpool/data/vm-106-disk-1@__replicate_106-1_1596717257__    -
2020-08-06 14:23:00 14:23:00   2.10M   rpool/data/vm-106-disk-1@__replicate_106-1_1596738173__
2020-08-06 14:23:01 14:23:01   27.2M   rpool/data/vm-106-disk-1@__replicate_106-1_1596738173__
2020-08-06 14:23:02 14:23:02   63.1M   rpool/data/vm-106-disk-1@__replicate_106-1_1596738173__
2020-08-06 14:23:03 successfully imported 'local-zfs:vm-106-disk-1'
2020-08-06 14:23:03 delete previous replication snapshot '__replicate_106-1_1596717257__' on local-zfs:vm-106-disk-1
2020-08-06 14:23:04 (remote_finalize_local_job) delete stale replication snapshot '__replicate_106-1_1596717257__' on local-zfs:vm-106-disk-1
2020-08-06 14:23:05 end replication job
2020-08-06 14:23:05 copying local disk images
2020-08-06 14:23:05 ERROR: unable to open file '/etc/pve/nodes/pvn1/qemu-server/106.conf.tmp.29700' - Device or resource busy
2020-08-06 14:23:05 aborting phase 1 - cleanup resources
2020-08-06 14:23:05 ERROR: unable to open file '/etc/pve/nodes/pvn1/qemu-server/106.conf.tmp.29700' - Device or resource busy
2020-08-06 14:23:05 ERROR: migration aborted (duration 00:00:12): unable to open file '/etc/pve/nodes/pvn1/qemu-server/106.conf.tmp.29700' - Device or resource busy
TASK ERROR: migration aborted

Do I have to build a new VM and attach the existing qcow somehow? If so, is there a document outlining the right way?
 
I can't get this VM out of this mode. I can't start it, can't stop it. Can't get the migration to try again.

I'm not familiar with lsof but an lsof | grep 106 isn't providing much in the way of help.
 
what does 'systemctl status pve-cluster corosync' say? what about 'pvecm status'?
 
what does 'systemctl status pve-cluster corosync' say? what about 'pvecm status'?

Don't see that specifically... entire output:

Code:
# systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-08-06 08:24:12 EDT; 1 day 1h ago
  Process: 2597 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 2610 (pmxcfs)
    Tasks: 9 (limit: 4915)
   Memory: 63.2M
   CGroup: /system.slice/pve-cluster.service
           └─2610 /usr/bin/pmxcfs

Aug 07 03:38:54 pvn1 pmxcfs[2610]: [status] notice: received log
Aug 07 03:53:54 pvn1 pmxcfs[2610]: [status] notice: received log
Aug 07 03:54:35 pvn1 pmxcfs[2610]: [status] notice: received log
Aug 07 03:54:41 pvn1 pmxcfs[2610]: [status] notice: received log
Aug 07 04:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful
Aug 07 05:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful
Aug 07 06:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful
Aug 07 07:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful
Aug 07 08:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful
Aug 07 09:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-08-06 08:24:12 EDT; 1 day 1h ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
Main PID: 2727 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 153.0M
   CGroup: /system.slice/corosync.service
           └─2727 /usr/sbin/corosync -f

Aug 06 14:50:08 pvn1 corosync[2727]:   [KNET  ] link: host: 3 link: 0 is down
Aug 06 14:50:08 pvn1 corosync[2727]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 06 14:50:08 pvn1 corosync[2727]:   [KNET  ] host: host: 3 has no active links
Aug 06 14:50:09 pvn1 corosync[2727]:   [TOTEM ] Token has not been received in 750 ms
Aug 06 14:50:09 pvn1 corosync[2727]:   [TOTEM ] A processor failed, forming new configuration.
Aug 06 14:50:09 pvn1 corosync[2727]:   [KNET  ] rx: host: 3 link: 0 is up
Aug 06 14:50:09 pvn1 corosync[2727]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 06 14:50:09 pvn1 corosync[2727]:   [TOTEM ] A new membership (1.3ed) was formed. Members
Aug 06 14:50:09 pvn1 corosync[2727]:   [QUORUM] Members[2]: 1 3
Aug 06 14:50:09 pvn1 corosync[2727]:   [MAIN  ] Completed service synchronization, ready to provide service.

And I'll add that I have successfully migrated several hosts using the gui already.
 
and what happens if you run 'qm config 106'?
 
and what happens if you run 'qm config 106'?

It spits out the config file.

Note: I'm running these commands from host1 which is where the VM was/is currently and I was trying to migrate to host2. Not sure where I should run these commands.

Code:
~# qm config 106
bootdisk: sata0
cores: 2
ide2: none,media=cdrom
lock: migrate
memory: 2048
name: mMachine
net0: e1000=56:C4:65:88:70:AF,bridge=vmbr0
net1: rtl8139=8A:B1:34:95:9D:54,bridge=vmbr0
numa: 0
ostype: l26
sata0: local-zfs:vm-106-disk-1,size=76G
scsihw: virtio-scsi-pci
smbios1: uuid=9068dae4-5914-4034-9415-ab10b722563d
sockets: 2


Thanks,
dK
 
and 'qm start 106' prints an error?
 
then you can do 'qm unlock 106', and start/stop/migrate again (hopefully).
 
then you can do 'qm unlock 106', and start/stop/migrate again (hopefully).

Wow... That was easy. Any idea what would/could cause it to get to that state? If there was a sync error between nodes at just the right time, could that have done it? I can't say for certain but on the GUI I could have sworn I saw the green check on the second host grey for a moment then re-green.