New host added to cluster does not show in web interface for all other hosts

Sean Corbett

New Member
Aug 11, 2016
6
0
1
44
Hi,

I've got a 13-node cluster, 7 of which are just running Ceph (i.e. they don't host any VMs) and the other 6 of which are just hosting VMs. When I added the 13th host, it seemed to join the cluster just fine, and when I check pvecm status on any of the nodes, all 13 nodes show up and they have quorum. However, when I check the web interface on any of the 6 non-ceph hosts, the new node does not appear and I cannot migrate any VMs to it (it says "no such cluster node" if I try). The new 13th host *does* show up in the web interface of the 7 Ceph-only hosts, however. I've attempted to restart the pveproxy, corosync, and pve-cluster services on the hosts having the issue but I wasn't sure what else to try. (I do know that multicast is working properly, because we had a multicast configuration issue on our switches previously that caused all kinds of touble... since fixing that, the cluster has been working smoothly until this issue appeared.)

Here is my corosync.conf, which seems to be correctly propagated to all 13 hosts:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: vmhost85
    nodeid: 13
    quorum_votes: 1
    ring0_addr: vmhost85
  }

  node {
    name: vmhost84
    nodeid: 8
    quorum_votes: 1
    ring0_addr: vmhost84
  }

  node {
    name: vmhost82
    nodeid: 6
    quorum_votes: 1
    ring0_addr: vmhost82
  }

  node {
    name: vmhost-ceph-5
    nodeid: 10
    quorum_votes: 1
    ring0_addr: vmhost-ceph-5
  }

  node {
    name: vmhost-ceph-1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: vmhost-ceph-1
  }

  node {
    name: vmhost-ceph-7
    nodeid: 12
    quorum_votes: 1
    ring0_addr: vmhost-ceph-7
  }

  node {
    name: vmhost-ceph-6
    nodeid: 11
    quorum_votes: 1
    ring0_addr: vmhost-ceph-6
  }

  node {
    name: vmhost81
    nodeid: 5
    quorum_votes: 1
    ring0_addr: vmhost81
  }
  node {
    name: vmhost-ceph-2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: vmhost-ceph-2
  }

  node {
    name: vmhost-ceph-4
    nodeid: 9
    quorum_votes: 1
    ring0_addr: vmhost-ceph-4
  }

  node {
    name: vmhost83
    nodeid: 7
    quorum_votes: 1
    ring0_addr: vmhost83
  }

  node {
    name: vmhost-ceph-3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: vmhost-ceph-3
  }

  node {
    name: vmhost80
    nodeid: 4
    quorum_votes: 1
    ring0_addr: vmhost80
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: gjr-virt-stack
  config_version: 13
  ip_version: ipv4
  secauth: on
  version: 2
  interface {
    bindnetaddr: 10.1.2.61
    ringnumber: 0
  }

}

Edit:
Also this is a 4.1 cluster, with all software up-to-date in the repos. (we don't have a support key for these servers yet, but I'm working on that part...)
 
Last edited:
Also this is a 4.1 cluster, with all software up-to-date in the repos. (we don't have a support key for these servers yet, but I'm working on that part...)
if you are on 4.1, you are not up to date,

even if you have no support, you can use the no-subscription repository to get updates
see https://pve.proxmox.com/wiki/Package_repositories

as for your problem, try to restart pvestatd,
and make sure you can resolve the names of the hosts everywhere (vmhost84, vmhost85, etc.)

also, what does
Code:
pvecm status

show?
 
Thank you for the reply.

pvecm status on all the nodes I've tried so far shows quorum:

From the new node:
Code:
root@vmhost85:~# pvecm status
Quorum information
------------------
Date:             Wed Aug 17 10:08:56 2016
Quorum provider:  corosync_votequorum
Nodes:            13
Node ID:          0x0000000d
Ring ID:          652844
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   13
Highest expected: 13
Total votes:      13
Quorum:           7 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.1.2.61
0x00000002          1 10.1.2.62
0x00000003          1 10.1.2.63
0x00000009          1 10.1.2.64
0x0000000a          1 10.1.2.65
0x0000000b          1 10.1.2.66
0x0000000c          1 10.1.2.67
0x00000004          1 10.1.2.80
0x00000005          1 10.1.2.81
0x00000006          1 10.1.2.82
0x00000007          1 10.1.2.83
0x00000008          1 10.1.2.84
0x0000000d          1 10.1.2.85 (local)

From one of the "trouble" nodes:
Code:
root@vmhost80:~# pvecm status
Quorum information
------------------
Date:             Wed Aug 17 10:08:10 2016
Quorum provider:  corosync_votequorum
Nodes:            13
Node ID:          0x00000004
Ring ID:          652844
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   13
Highest expected: 13
Total votes:      13
Quorum:           7 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.1.2.61
0x00000002          1 10.1.2.62
0x00000003          1 10.1.2.63
0x00000009          1 10.1.2.64
0x0000000a          1 10.1.2.65
0x0000000b          1 10.1.2.66
0x0000000c          1 10.1.2.67
0x00000004          1 10.1.2.80 (local)
0x00000005          1 10.1.2.81
0x00000006          1 10.1.2.82
0x00000007          1 10.1.2.83
0x00000008          1 10.1.2.84
0x0000000d          1 10.1.2.85

A new piece of the puzzle: I created a test VM on VMHost85 (the newest host), and suddenly the host appeared in the web interface on VMHost80 (one of the trouble hosts). The VM I created even appeared in the web interface, however it doesn't show the VM's name, only its ID, and the icon on VMHost85 is red. (As before, when I log into one of the Ceph-only nodes, VMHost85 behaves as expected: Icon is green, the test VM and all expected info populates.) What's odd is, when logged into VMHost80's web interface, if I *click* on VMHost85, it lets me tab through all the properties of the host and even look at the VM info. But still when I try to migrate a VM to VMHost85, I get "no such cluster node."

Per your advice, I did restart pvestatd on several of the hosts and it didn't seem to change anything.

I had not upgraded all the way to 4.2 because I didn't want to risk taking any hosts offline (which I assume I have to do since there's a new kernel). But If that's what you'd recommend doing I can try that during off-hours.

Any logs, etc, that I should be checking for clues?
 
Another clue: When I run "pvecm nodes" on one of the *behaving* nodes, it shows this:

Code:
root@vmhost-ceph-1:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 vmhost-ceph-1 (local)
         2          1 vmhost-ceph-2
         3          1 vmhost-ceph-3
         9          1 vmhost-ceph-4
        10          1 vmhost-ceph-5
        11          1 vmhost-ceph-6
        12          1 vmhost-ceph-7
         4          1 vmhost80
         5          1 vmhost81
         6          1 vmhost82
         7          1 vmhost83
         8          1 vmhost84
        13          1 vmhost85

However, when I run it on one of the touble nodes, it shows this:
Code:
root@vmhost80:~# pvecm nodes

Membership information
----------------------
  Nodeid  Votes Name
  1  1 vmhost-ceph-1
  2  1 vmhost-ceph-2
  3  1 vmhost-ceph-3
  9  1 vmhost-ceph-4
  10  1 vmhost-ceph-5
  11  1 vmhost-ceph-6
  12  1 vmhost-ceph-7
  4  1 vmhost80 (local)
  5  1 vmhost81
  6  1 vmhost82
  7  1 vmhost83
  8  1 vmhost84
  13  1 vmhost85.gjr.org

Note the FQDN for the last node name.

When I try to migrate I get this:
Code:
root@vmhost80:~# qm migrate 232 vmhost85 -online
no such cluster node 'vmhost85'

But if I use the FQDN I get this:
Code:
root@vmhost80:~# qm migrate 232 vmhost85.gjr.org -online
400 Parameter verification failed.
target: invalid format - value does not look like a valid node name

qm migrate <vmid> <target> [OPTIONS]

Do you think that might be the issue? And if so, what's the best way to correct it?
 
Update: I managed to fix the node-name issue, by correcting my /etc/hosts on all the nodes, and "force" re-joining VMHost85 to the cluster... however that still didn't fix the migration issue. So I'm stuck again now.
 
I just had a similar thing happen with Proxmox 4.4 - added a new node, and on some (but not all) of the already-existing nodes 'pvecm nodes' showed the new node using its FQDN. Attempts to migrate VMs from a node where the new node was listed using its FQDN failed. I added 2 more nodes and the problem resolved itself, with 'pvecm nodes' on all nodes showing all nodes with just the hostname, not FQDN.
 
Any update on this ? It happened already three times on my up-to-date 5.1-36 proxmox cluster and I am looking for a better solution than to remove and reinstall a cluster node.