[SOLVED] Error while cold migrating after upgrade.

dtiKev · Aug 7, 2020

Both nodes are now upgraded and I've started moving some VMs to the latest upgraded node. One of them seems to be locked into migration. I *think* I saw a little hiccup in the quorum connection when the migration started and it reports as aborted. Looking at the log (below) it seems to not be able to find a file and I can confirm from the cli that it doesn't exist. how can I recover this?

Code:

2020-08-06 14:22:53 starting migration of VM 106 to node 'pvn2' (10.11.12.252)
2020-08-06 14:22:53 found local, replicated disk 'local-zfs:vm-106-disk-1' (in current VM config)
2020-08-06 14:22:53 replicating disk images
2020-08-06 14:22:53 start replication job
2020-08-06 14:22:53 guest => VM 106, running => 0
2020-08-06 14:22:53 volumes => local-zfs:vm-106-disk-1
2020-08-06 14:22:56 create snapshot '__replicate_106-1_1596738173__' on local-zfs:vm-106-disk-1
2020-08-06 14:22:56 using secure transmission, rate limit: none
2020-08-06 14:22:56 incremental sync 'local-zfs:vm-106-disk-1' (__replicate_106-1_1596717257__ => __replicate_106-1_1596738173__)
2020-08-06 14:22:59 send from @__replicate_106-1_1596717257__ to rpool/data/vm-106-disk-1@__replicate_106-1_1596738173__ estimated size is 71.9M
2020-08-06 14:22:59 total estimated size is 71.9M
2020-08-06 14:22:59 TIME        SENT   SNAPSHOT rpool/data/vm-106-disk-1@__replicate_106-1_1596738173__
2020-08-06 14:23:00 rpool/data/vm-106-disk-1@__replicate_106-1_1596717257__    name    rpool/data/vm-106-disk-1@__replicate_106-1_1596717257__    -
2020-08-06 14:23:00 14:23:00   2.10M   rpool/data/vm-106-disk-1@__replicate_106-1_1596738173__
2020-08-06 14:23:01 14:23:01   27.2M   rpool/data/vm-106-disk-1@__replicate_106-1_1596738173__
2020-08-06 14:23:02 14:23:02   63.1M   rpool/data/vm-106-disk-1@__replicate_106-1_1596738173__
2020-08-06 14:23:03 successfully imported 'local-zfs:vm-106-disk-1'
2020-08-06 14:23:03 delete previous replication snapshot '__replicate_106-1_1596717257__' on local-zfs:vm-106-disk-1
2020-08-06 14:23:04 (remote_finalize_local_job) delete stale replication snapshot '__replicate_106-1_1596717257__' on local-zfs:vm-106-disk-1
2020-08-06 14:23:05 end replication job
2020-08-06 14:23:05 copying local disk images
2020-08-06 14:23:05 ERROR: unable to open file '/etc/pve/nodes/pvn1/qemu-server/106.conf.tmp.29700' - Device or resource busy
2020-08-06 14:23:05 aborting phase 1 - cleanup resources
2020-08-06 14:23:05 ERROR: unable to open file '/etc/pve/nodes/pvn1/qemu-server/106.conf.tmp.29700' - Device or resource busy
2020-08-06 14:23:05 ERROR: migration aborted (duration 00:00:12): unable to open file '/etc/pve/nodes/pvn1/qemu-server/106.conf.tmp.29700' - Device or resource busy
TASK ERROR: migration aborted

Do I have to build a new VM and attach the existing qcow somehow? If so, is there a document outlining the right way?

matrix · Aug 7, 2020

dtiKev said:
Device or resource busy

try to use lsof

dtiKev · Aug 7, 2020

I can't get this VM out of this mode. I can't start it, can't stop it. Can't get the migration to try again.

I'm not familiar with lsof but an lsof | grep 106 isn't providing much in the way of help.

fabian · Aug 7, 2020

what does 'systemctl status pve-cluster corosync' say? what about 'pvecm status'?

dtiKev · Aug 7, 2020

fabian said:
what does 'systemctl status pve-cluster corosync' say? what about 'pvecm status'?

Don't see that specifically... entire output:

Code:

# systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-08-06 08:24:12 EDT; 1 day 1h ago
  Process: 2597 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 2610 (pmxcfs)
    Tasks: 9 (limit: 4915)
   Memory: 63.2M
   CGroup: /system.slice/pve-cluster.service
           └─2610 /usr/bin/pmxcfs

Aug 07 03:38:54 pvn1 pmxcfs[2610]: [status] notice: received log
Aug 07 03:53:54 pvn1 pmxcfs[2610]: [status] notice: received log
Aug 07 03:54:35 pvn1 pmxcfs[2610]: [status] notice: received log
Aug 07 03:54:41 pvn1 pmxcfs[2610]: [status] notice: received log
Aug 07 04:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful
Aug 07 05:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful
Aug 07 06:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful
Aug 07 07:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful
Aug 07 08:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful
Aug 07 09:24:11 pvn1 pmxcfs[2610]: [dcdb] notice: data verification successful

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-08-06 08:24:12 EDT; 1 day 1h ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
Main PID: 2727 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 153.0M
   CGroup: /system.slice/corosync.service
           └─2727 /usr/sbin/corosync -f

Aug 06 14:50:08 pvn1 corosync[2727]:   [KNET  ] link: host: 3 link: 0 is down
Aug 06 14:50:08 pvn1 corosync[2727]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 06 14:50:08 pvn1 corosync[2727]:   [KNET  ] host: host: 3 has no active links
Aug 06 14:50:09 pvn1 corosync[2727]:   [TOTEM ] Token has not been received in 750 ms
Aug 06 14:50:09 pvn1 corosync[2727]:   [TOTEM ] A processor failed, forming new configuration.
Aug 06 14:50:09 pvn1 corosync[2727]:   [KNET  ] rx: host: 3 link: 0 is up
Aug 06 14:50:09 pvn1 corosync[2727]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 06 14:50:09 pvn1 corosync[2727]:   [TOTEM ] A new membership (1.3ed) was formed. Members
Aug 06 14:50:09 pvn1 corosync[2727]:   [QUORUM] Members[2]: 1 3
Aug 06 14:50:09 pvn1 corosync[2727]:   [MAIN  ] Completed service synchronization, ready to provide service.

And I'll add that I have successfully migrated several hosts using the gui already.

fabian · Aug 7, 2020

and what happens if you run 'qm config 106'?

dtiKev · Aug 7, 2020

fabian said:
and what happens if you run 'qm config 106'?

It spits out the config file.

Note: I'm running these commands from host1 which is where the VM was/is currently and I was trying to migrate to host2. Not sure where I should run these commands.

Code:

~# qm config 106
bootdisk: sata0
cores: 2
ide2: none,media=cdrom
lock: migrate
memory: 2048
name: mMachine
net0: e1000=56:C4:65:88:70:AF,bridge=vmbr0
net1: rtl8139=8A:B1:34:95:9D:54,bridge=vmbr0
numa: 0
ostype: l26
sata0: local-zfs:vm-106-disk-1,size=76G
scsihw: virtio-scsi-pci
smbios1: uuid=9068dae4-5914-4034-9415-ab10b722563d
sockets: 2

Thanks,
dK

fabian · Aug 7, 2020

and 'qm start 106' prints an error?

dtiKev · Aug 7, 2020

fabian said:
and 'qm start 106' prints an error?

Yes:

Code:

# qm start 106
VM is locked (migrate)

Same basic as I get at the GUI.

fabian · Aug 7, 2020

then you can do 'qm unlock 106', and start/stop/migrate again (hopefully).

dtiKev · Aug 7, 2020

fabian said:
then you can do 'qm unlock 106', and start/stop/migrate again (hopefully).

Wow... That was easy. Any idea what would/could cause it to get to that state? If there was a sync error between nodes at just the right time, could that have done it? I can't say for certain but on the GUI I could have sworn I saw the green check on the second host grey for a moment then re-green.

Search

Search

[SOLVED] Error while cold migrating after upgrade.

dtiKev

Well-Known Member

matrix

Active Member

dtiKev

Well-Known Member

fabian

Proxmox Staff Member

dtiKev

Well-Known Member

fabian

Proxmox Staff Member

dtiKev

Well-Known Member

fabian

Proxmox Staff Member

dtiKev

Well-Known Member

fabian

Proxmox Staff Member

dtiKev

Well-Known Member

We value your privacy