qm migrate caused server to reboot

cdukes

Renowned Member
Sep 11, 2015
89
10
73
Raleigh, NC
www.logzilla.net
pve-manager/4.2-18/158720b9 (running kernel: 4.4.16-1-pve)

This morning I tried (via the gui) to migrate a VM. The task said OK, but the VM did not move.
I found this page:
https://forum.proxmox.com/threads/vm-is-locked-migrate.9358/
And was having the same trouble (VM would not unlock)
So I tried restarting the cluster service as that page says.
After that, I tried the qm unlock and qm migrate, but it was still locked.
Next, I did an ls on the /var/lock directory and then, for no apparent reason and without warning, the entire server rebooted.
Her's my ssh log:

Code:
root@<masked> [~]: # qm migrate 318 <masked>
VM is locked (migrate)
root@<masked> [~]: # qm unlock 318
trying to acquire lock...can't lock file '/var/lock/qemu-server/lock-318.conf' - got timeout
root@<masked> [~]: # qm migrate 318 bfs
VM is locked (migrate)
root@<masked> [~]: # df
Filesystem                      Size  Used Avail Use% Mounted on
udev                             10M     0   10M   0% /dev
tmpfs                            19G   18M   19G   1% /run
/dev/dm-0                        34G  2.8G   29G   9% /
tmpfs                            48G   66M   48G   1% /dev/shm
tmpfs                           5.0M     0  5.0M   0% /run/lock
tmpfs                            48G     0   48G   0% /sys/fs/cgroup
cgmfs                           100K     0  100K   0% /run/cgmanager/fs
/dev/fuse                        30M   52K   30M   1% /etc/pve
<masked>:/nfs              22T   10T   11T  49% /mnt/pve/<masked>
<masked2>:/volume1/proxmox   22T  9.3T   13T  43% /mnt/pve/<masked2>
root@<masked> [~]: # pvecm status
Quorum information
------------------
Date:             Thu Sep 29 03:36:25 2016
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          3/556
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 <masked>
0x00000001          2 <masked> (local)
0x00000002          1 <masked>
root@<masked> [~]: # /etc/init.d/pve-cluster stop
[ ok ] Stopping pve-cluster (via systemctl): pve-cluster.service.
root@<masked> [~]: # /etc/init.d/pve-cluster start
[ ok ] Starting pve-cluster (via systemctl): pve-cluster.service.
root@<masked> [~]: # qm migrate 318 <masked>
VM is locked (migrate)
root@<masked> [~]: # qm unlock 318
trying to acquire lock...can't lock file '/var/lock/qemu-server/lock-318.conf' - got timeout
root@<masked> [~]: # ll /var/lock/
total 0
drwxrwxrwt  7 root     root      260 Sep 26 15:06 ./
drwxr-xr-x 24 root     root     1.3K Sep 29 03:36 ../
-rw-r--r--  1 root     root        0 Sep 26 15:06 corosync
drwxrwxr-x  2 root     root       40 Sep 26 15:06 lockdev/
drwx------  2 root     root       40 Sep 26 15:06 lvm/
drwxr-xr-x  2 root     root       40 Sep 26 15:06 lxc/
-rw-r--r--  1 root     root        0 Sep 26 15:06 pvedaemon.lck
-rw-r--r--  1 root     root        0 Sep 26 15:06 pvefw.lck
-rw-r--r--  1 root     root        0 Sep 26 15:06 pvefw-logger.lck
-rw-r--r--  1 www-data www-data    0 Sep 26 15:06 pveproxy.lck
drwxr-xr-x  2 root     root      760 Sep 27 07:33 qemu-server/
-rw-r--r--  1 www-data www-data    0 Sep 26 15:06 spiceproxy.lck
drwxr-xr-x  2 root     root       40 Sep 26 15:06 subsys/
 
Hi Dietmar,
I think I may know the cause - although it's probably worth checking into your code/unit tests for why this would cause the entire server to reboot without any warning.
After the server came back up, I found that the PVE host I was migrating to had the wrong IP in the local server's hosts file (wrong subnet).
So from the log entries below, I see there was an IPC failure, and then near the bottom of this log you can see where the server rebooted as indicated by the ^@'s (those are actually in the log, kind of strange).
(found these in the daemon.log)

Code:
Sep 29 03:36:53 <masked> pve-ha-crm[2455]: ipcc_send_rec failed: Connection refused
Sep 29 03:36:53 <masked> pve-ha-crm[2455]: ipcc_send_rec failed: Connection refused
Sep 29 03:36:53 <masked> pve-ha-crm[2455]: ipcc_send_rec failed: Connection refused
Sep 29 03:36:54 <masked> pve-ha-lrm[18192]: Task 'UPID:<masked>:0000471B:014C0E94:57ED09F6:qmigrate:318:root@pam:' still active, waiting
Sep 29 03:36:55 <masked> systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 29 03:36:56 <masked> pmxcfs[19590]: [status] notice: update cluster info (cluster name  <masked>, version = 9)
Sep 29 03:36:56 <masked> pmxcfs[19590]: [status] notice: node has quorum
Sep 29 03:36:56 <masked> pmxcfs[19590]: [dcdb] notice: members: 1/19590, 2/2688, 3/2373
Sep 29 03:36:56 <masked> pmxcfs[19590]: [dcdb] notice: starting data syncronisation
Sep 29 03:36:56 <masked> pmxcfs[19590]: [dcdb] notice: received sync request (epoch 1/19590/00000001)
Sep 29 03:36:56 <masked> pmxcfs[19590]: [status] notice: members: 1/19590, 2/2688, 3/2373
Sep 29 03:36:56 <masked> pmxcfs[19590]: [status] notice: starting data syncronisation
Sep 29 03:36:56 <masked> pmxcfs[19590]: [status] notice: received sync request (epoch 1/19590/00000001)
Sep 29 03:36:56 <masked> pmxcfs[19590]: [dcdb] notice: received all states
Sep 29 03:36:56 <masked> pmxcfs[19590]: [dcdb] notice: leader is 2/2688
Sep 29 03:36:56 <masked> pmxcfs[19590]: [dcdb] notice: synced members: 2/2688, 3/2373
Sep 29 03:36:56 <masked> pmxcfs[19590]: [dcdb] notice: waiting for updates from leader
Sep 29 03:36:56 <masked> pmxcfs[19590]: [dcdb] notice: update complete - trying to commit (got 4 inode updates)
Sep 29 03:36:56 <masked> pmxcfs[19590]: [dcdb] notice: all data is up to date
Sep 29 03:36:56 <masked> pmxcfs[19590]: [status] notice: received all states
Sep 29 03:36:56 <masked> pmxcfs[19590]: [status] notice: all data is up to date
Sep 29 03:36:57 <masked> systemd[1]: Started The Proxmox VE cluster filesystem.
Sep 29 03:36:59 <masked> pve-ha-lrm[18192]: Task 'UPID:<masked>:0000471B:014C0E94:57ED09F6:qmigrate:318:root@pam:' still active, waiting
Sep 29 03:37:03 <masked> pve-ha-crm[2455]: status change wait_for_quorum => slave
Sep 29 03:37:04 <masked> pve-ha-lrm[18192]: Task 'UPID:<masked>:0000471B:014C0E94:57ED09F6:qmigrate:318:root@pam:' still active, waiting
Sep 29 03:37:09 <masked> pve-ha-lrm[18192]: Task 'UPID:<masked>:0000471B:014C0E94:57ED09F6:qmigrate:318:root@pam:' still active, waiting
Sep 29 03:37:14 <masked> pve-ha-lrm[18192]: Task 'UPID:<masked>:0000471B:014C0E94:57ED09F6:qmigrate:318:root@pam:' still active, waiting
Sep 29 03:37:19 <masked> pveproxy[13284]: ipcc_send_rec failed: Transport endpoint is not connected
Sep 29 03:37:19 <masked> pve-ha-lrm[18192]: Task 'UPID:<masked>:0000471B:014C0E94:57ED09F6:qmigrate:318:root@pam:' still active, waiting
Sep 29 03:37:20 <masked> pvedaemon[2447]: ipcc_send_rec failed: Transport endpoint is not connected
Sep 29 03:37:24 <masked> pve-ha-lrm[18192]: Task 'UPID:<masked>:0000471B:014C0E94:57ED09F6:qmigrate:318:root@pam:' still active, waiting
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Sep 29 03:40:43 <masked> systemd-modules-load[992]: Module 'fuse' is builtin
Sep 29 03:40:43 <masked> systemd-modules-load[992]: Inserted module 'qlge'
Sep 29 03:40:43 <masked> systemd-modules-load[992]: Inserted module '8021q'
Sep 29 03:40:43 <masked> systemd[1]: Started Create Static Device Nodes in /dev.