VM shown on wrong cluster node

JasonMHall

New Member
Jun 26, 2013
12
0
1
Preamble:
I always try to search my way out of problems, thus I never post on forums. So,please excuse any etiquette mistakes.

Problem:
I have a three node HA cluster. I wanted to test the migration of a couple VMs (both managed by HA). When I migrate a VM to another target node (from the 3rd node to the 1st node), the task completes OK; however the VM is never moved in the web gui. I checked clustat to confirm that the VM had in fact been moved. I also restarted pvestatd, yet the VM still shows under the 3rd node in the web gui when it should be under the first (since I've migrated the VM to the 1st node).

I can provide further information if required. I'm just wondering if anyone else has had a similar issue, since I couldn't google my way out of this one.



Thanks,
Jason.
 
Fairly sure. With the VM selected, I'm simply using "Migrate" from the upper-right menu.

The disk for the VM is raw format on GlusterFS storage. After the initial install of Proxmox, I removed the /dev/pve/data logical volume and reduced the physical volume on /dev/sda3 down to 600GB, for /dev/pve/swap and /dev/pve/root. Leaving 4TB or so for an xfs partition, /dev/sda4. There is a single brick on each server, making up a replicated volume (replica 3). I'm using an APC AP7901 PDU for the fencing device. I've tested, successfully, the cluster.conf using the apc snmp fencing agent.

This cluster is on an isolated network at my workplace, so I can't actually test any changes until Monday. But, I did grab logs from each node before I left today.
 
Testing again this Monday morning. I've removed the VM from the cluster.conf so that it is no longer managed by HA. Once this has been done, I can successfully live migrate the VM between all nodes. If I return the VM to being managed by HA, the problem returns and, though clustat tells me that the new node has taken ownership of the VM, the VM never moves in the web gui.
 
While attempting to migrate the VM using the cli from Node 1 with command:

Code:
clusvcadm -M pvevm:201 -m ppat-pdnhvz003

I've come across some important information in the rgmanager logs of each node. I've edited the logs for readability.

Node 1 (ppat-pdnhvz001) - rgmanager.log:
Code:
Jun 02 09:05:35 rgmanager pvevm:201 was added to the config, but I am not initializing it.Jun 02 09:05:42 rgmanager [pvevm] VM 200 is running
Jun 02 09:05:46 rgmanager Migration: pvevm:201 is running on 3
Jun 02 09:05:52 rgmanager [pvevm] VM 200 is running
Jun 02 09:06:32 rgmanager [pvevm] VM 200 is running
...
Jun 02 09:12:42 rgmanager [pvevm] VM 200 is running
Jun 02 09:13:02 rgmanager [pvevm] VM 200 is running
Jun 02 09:13:09 rgmanager Migration: pvevm:201 is running on 2
Jun 02 09:13:12 rgmanager [pvevm] VM 200 is running
Jun 02 09:13:53 rgmanager [pvevm] VM 200 is running
...
Jun 02 09:20:13 rgmanager [pvevm] VM 200 is running
Jun 02 09:20:53 rgmanager [pvevm] VM 200 is running
Jun 02 09:21:11 rgmanager Migration: pvevm:201 is running on 3
Jun 02 09:21:13 rgmanager [pvevm] VM 200 is running
Jun 02 09:21:33 rgmanager [pvevm] VM 200 is running
Jun 02 09:22:11 rgmanager Migration: pvevm:201 is running on 3
Jun 02 09:22:13 rgmanager [pvevm] VM 200 is running
Jun 02 09:22:23 rgmanager [pvevm] VM 200 is running
...
Jun 02 09:24:23 rgmanager [pvevm] VM 200 is running
Jun 02 09:24:33 rgmanager [pvevm] VM 200 is running
Jun 02 09:24:44 rgmanager BUG! Attempt to forward to myself!
Jun 02 09:24:53 rgmanager [pvevm] VM 200 is running
Jun 02 09:25:23 rgmanager [pvevm] VM 200 is running
Jun 02 09:25:30 rgmanager Migration: pvevm:201 is running on 2
Jun 02 09:25:34 rgmanager [pvevm] VM 200 is running
Jun 02 09:25:54 rgmanager [pvevm] VM 200 is running
...
Jun 02 09:30:14 rgmanager [pvevm] VM 200 is running
Jun 02 09:30:44 rgmanager [pvevm] VM 200 is running
Jun 02 09:31:14 rgmanager [pvevm] VM 201 is not running
Jun 02 09:31:24 rgmanager [pvevm] VM 200 is running
Jun 02 09:31:34 rgmanager [pvevm] VM 200 is running
...
Jun 02 09:37:14 rgmanager [pvevm] VM 200 is running
Jun 02 09:37:44 rgmanager [pvevm] VM 200 is running


Node 2 (ppat-pdnhvz002) - rgmanager.log:
Code:
Jun 02 09:05:31 rgmanager ReconfiguringJun 02 09:05:31 rgmanager Loading Service Data
Jun 02 09:05:33 rgmanager Stopping changed resources.
Jun 02 09:05:33 rgmanager Restarting changed resources.
Jun 02 09:05:33 rgmanager Starting changed resources.
Jun 02 09:05:33 rgmanager Initializing pvevm:201
Jun 02 09:05:33 rgmanager pvevm:201 was added to the config, but I am not initializing it.
Jun 02 09:05:34 rgmanager Migration: pvevm:200 is running on 1
Jun 02 09:05:43 rgmanager Starting stopped service pvevm:201
Jun 02 09:05:44 rgmanager [pvevm] Move config for VM 201 to local node
Jun 02 09:05:45 rgmanager Service pvevm:201 started
Jun 02 09:05:45 rgmanager [pvevm] VM 201 is running
Jun 02 09:05:46 rgmanager [pvevm] VM 201 is running
Jun 02 09:13:09 rgmanager [pvevm] VM 201 is running
Jun 02 09:13:12 rgmanager [pvevm] VM 201 is running
Jun 02 09:13:22 rgmanager [pvevm] VM 201 is running
Jun 02 09:13:52 rgmanager [pvevm] VM 201 is running
Jun 02 09:14:07 rgmanager Migrating pvevm:201 to ppat-pdnhvz003
Jun 02 09:14:08 rgmanager [pvevm] Task still active, waiting
Jun 02 09:14:09 rgmanager [pvevm] Task still active, waiting
...
Jun 02 09:24:07 rgmanager [pvevm] Task still active, waiting
Jun 02 09:24:08 rgmanager [pvevm] Task still active, waiting
Jun 02 09:24:09 rgmanager migrate on pvevm "201" returned 1 (generic error)
Jun 02 09:24:09 rgmanager Migration of pvevm:201 to ppat-pdnhvz003 failed; return code 1
Jun 02 09:24:09 rgmanager [pvevm] VM 201 is running
Jun 02 09:24:10 rgmanager [pvevm] VM 201 is running
Jun 02 09:24:41 rgmanager BUG! Attempt to forward to myself!
Jun 02 09:25:30 rgmanager [pvevm] VM 201 is running
Jun 02 09:25:33 rgmanager [pvevm] VM 201 is running
Jun 02 09:25:37 rgmanager Migrating pvevm:201 to ppat-pdnhvz003
Jun 02 09:25:38 rgmanager [pvevm] Task still active, waiting
Jun 02 09:25:39 rgmanager [pvevm] Task still active, waiting
...
Jun 02 09:35:37 rgmanager [pvevm] Task still active, waiting
Jun 02 09:35:38 rgmanager [pvevm] Task still active, waiting
Jun 02 09:35:39 rgmanager migrate on pvevm "201" returned 1 (generic error)
Jun 02 09:35:39 rgmanager Migration of pvevm:201 to ppat-pdnhvz003 failed; return code 1
Jun 02 09:35:39 rgmanager Stopping service pvevm:201
Jun 02 09:35:40 rgmanager [pvevm] Task still active, waiting
Jun 02 09:35:41 rgmanager Service pvevm:201 is disabled
Jun 02 09:35:42 rgmanager [pvevm] VM 201 is not running


Node 3 (ppat-pdnhvz003) - rgmanager.log:
Code:
Jun 02 09:05:31 rgmanager ReconfiguringJun 02 09:05:31 rgmanager Loading Service Data
Jun 02 09:05:33 rgmanager Stopping changed resources.
Jun 02 09:05:33 rgmanager Restarting changed resources.
Jun 02 09:05:33 rgmanager Starting changed resources.
Jun 02 09:05:33 rgmanager Initializing pvevm:201
Jun 02 09:05:33 rgmanager pvevm:201 was added to the config, but I am not initializing it.
Jun 02 09:05:35 rgmanager Migration: pvevm:200 is running on 1
Jun 02 09:05:44 rgmanager [pvevm] VM 201 is running
Jun 02 09:05:46 rgmanager [pvevm] VM 201 is running
...
Jun 02 09:23:33 rgmanager [pvevm] VM 201 is running
Jun 02 09:24:03 rgmanager [pvevm] VM 201 is running
Jun 02 09:24:09 rgmanager Stopping service pvevm:201
Jun 02 09:24:10 rgmanager stop on pvevm "201" returned 1 (generic error)
Jun 02 09:24:10 rgmanager #12: RG pvevm:201 failed to stop; intervention required
Jun 02 09:24:10 rgmanager Service pvevm:201 is failed
Jun 02 09:24:14 rgmanager BUG! Attempt to forward to myself!
Jun 02 09:24:27 rgmanager BUG! Attempt to forward to myself!
Jun 02 09:24:38 rgmanager BUG! Attempt to forward to myself!
 
You can only reliably add a VM to HA when the VM is not running as the log also tell you:
Jun 02 09:05:33 rgmanager pvevm:201 was added to the config, but I am not initializing it.
 
I guess it is not a simple matter of stopping the VMs and removing/readding them to the cluster.conf through the web gui? Since, I have just attempted that and receive the same "not initializing it" message.

*EDIT: Adding logs after removing VMs from being managed by HA, going through a start/stop cycle on VMs, and readding VMs to being managed by HA while they were stopped. 10:26am is when the new cluster.conf was activated.

Node1 - messages.log:
Code:
Jun  2 10:26:06 ppat-pdnhvz001 corosync[2533]:   [QUORUM] Members[3]: 1 2 3Jun  2 10:26:06 ppat-pdnhvz001 rgmanager[826065]: Reconfiguring
Jun  2 10:26:06 ppat-pdnhvz001 rgmanager[826065]: Loading Service Data
Jun  2 10:26:08 ppat-pdnhvz001 rgmanager[826065]: Stopping changed resources.
Jun  2 10:26:08 ppat-pdnhvz001 rgmanager[826065]: Restarting changed resources.
Jun  2 10:26:08 ppat-pdnhvz001 rgmanager[826065]: Starting changed resources.
Jun  2 10:26:08 ppat-pdnhvz001 rgmanager[826065]: Initializing pvevm:200
Jun  2 10:26:08 ppat-pdnhvz001 rgmanager[826065]: pvevm:200 was added to the config, but I am not initializing it.
Jun  2 10:26:08 ppat-pdnhvz001 rgmanager[826065]: Initializing pvevm:201
Jun  2 10:26:08 ppat-pdnhvz001 rgmanager[826065]: pvevm:201 was added to the config, but I am not initializing it.
Jun  2 10:26:19 ppat-pdnhvz001 rgmanager[826065]: Migration: pvevm:200 is running on 3
Jun  2 10:26:20 ppat-pdnhvz001 rgmanager[826065]: Migration: pvevm:201 is running on 3


Node2 - messages.log:
Code:
Jun  2 10:26:06 ppat-pdnhvz002 corosync[2517]:   [QUORUM] Members[3]: 1 2 3Jun  2 10:26:06 ppat-pdnhvz002 rgmanager[2879]: Reconfiguring
Jun  2 10:26:06 ppat-pdnhvz002 rgmanager[2879]: Loading Service Data
Jun  2 10:26:08 ppat-pdnhvz002 rgmanager[2879]: Stopping changed resources.
Jun  2 10:26:08 ppat-pdnhvz002 rgmanager[2879]: Restarting changed resources.
Jun  2 10:26:08 ppat-pdnhvz002 rgmanager[2879]: Starting changed resources.
Jun  2 10:26:08 ppat-pdnhvz002 rgmanager[2879]: Initializing pvevm:200
Jun  2 10:26:08 ppat-pdnhvz002 rgmanager[2879]: pvevm:200 was added to the config, but I am not initializing it.
Jun  2 10:26:08 ppat-pdnhvz002 rgmanager[2879]: Initializing pvevm:201
Jun  2 10:26:08 ppat-pdnhvz002 rgmanager[2879]: pvevm:201 was added to the config, but I am not initializing it.
Jun  2 10:26:20 ppat-pdnhvz002 rgmanager[2879]: Migration: pvevm:200 is running on 3


Node3 - messages.log:
Code:
Jun  2 10:26:06 ppat-pdnhvz003 corosync[2534]:   [QUORUM] Members[3]: 1 2 3Jun  2 10:26:06 ppat-pdnhvz003 rgmanager[2940]: Reconfiguring
Jun  2 10:26:06 ppat-pdnhvz003 rgmanager[2940]: Loading Service Data
Jun  2 10:26:08 ppat-pdnhvz003 rgmanager[2940]: Stopping changed resources.
Jun  2 10:26:08 ppat-pdnhvz003 rgmanager[2940]: Restarting changed resources.
Jun  2 10:26:08 ppat-pdnhvz003 rgmanager[2940]: Starting changed resources.
Jun  2 10:26:08 ppat-pdnhvz003 rgmanager[2940]: Initializing pvevm:200
Jun  2 10:26:08 ppat-pdnhvz003 rgmanager[2940]: pvevm:200 was added to the config, but I am not initializing it.
Jun  2 10:26:08 ppat-pdnhvz003 rgmanager[2940]: Initializing pvevm:201
Jun  2 10:26:08 ppat-pdnhvz003 rgmanager[2940]: pvevm:201 was added to the config, but I am not initializing it.
Jun  2 10:26:08 ppat-pdnhvz003 rgmanager[2940]: Starting stopped service pvevm:200
Jun  2 10:26:09 ppat-pdnhvz003 rgmanager[188077]: [pvevm] Move config for VM 200 to local node
Jun  2 10:26:09 ppat-pdnhvz003 task UPID:ppat-pdnhvz003:0002DEC1:027D78A2:538C8981:qmstart:200:root@pam:: start VM 200: UPID:ppat-pdnhvz003:0002DEC1:027D78A2:538C8981:qmstart:200:root@pam:
Jun  2 10:26:09 ppat-pdnhvz003 pvevm: <root@pam> starting task UPID:ppat-pdnhvz003:0002DEC1:027D78A2:538C8981:qmstart:200:root@pam:
Jun  2 10:26:09 ppat-pdnhvz003 kernel: device tap200i0 entered promiscuous mode
Jun  2 10:26:09 ppat-pdnhvz003 kernel: vmbr0: port 3(tap200i0) entering forwarding state
Jun  2 10:26:10 ppat-pdnhvz003 pvevm: <root@pam> end task UPID:ppat-pdnhvz003:0002DEC1:027D78A2:538C8981:qmstart:200:root@pam: OK
Jun  2 10:26:10 ppat-pdnhvz003 rgmanager[2940]: Service pvevm:200 started
Jun  2 10:26:18 ppat-pdnhvz003 rgmanager[2940]: Starting stopped service pvevm:201
Jun  2 10:26:18 ppat-pdnhvz003 rgmanager[188137]: [pvevm] VM 201 is already running
Jun  2 10:26:19 ppat-pdnhvz003 rgmanager[2940]: Service pvevm:201 started

Looking at the last few entries on Node3, it appears to be falsely reporting that VM 201 is already running. The web gui is still showing VM 201 as stopped on Node 2.
 
Last edited:
This thread can be marked as solved.

Solution:
  1. Stop the running VMs
  2. Remove the VMs from the cluster.conf so that they are not managed by HA.
    1. In the web gui, select "Datacenter"
    2. Select the "HA" tab.
    3. Select the appropriate pvevm vmid.
    4. Click "Remove"
    5. Click "Activate" to activate the new cluster.conf.
  3. Reboot all nodes.
  4. Once rebooted, check the status of the cman and rgmanager services.
  5. Once again, add the VMs to the cluster.conf.
  6. Activate the new cluster.conf
  7. Check the cluster status with the clustat command.
  8. When the VMs are started again, test live migration across all nodes.

For me, all tasks completed OK and all changes were properly reflected in the web gui as well as in the cluster status.

Thanks, mir, for pointing me in the right direction.