forcing .members file update

CTCcloud · Apr 26, 2016

We have a cluster of 8 nodes. One was JUST recently added. When trying to migrate a VM from a particular node to the new node it said "no such cluster node "node name""

After looking around, the /etc/pve/.members file doesn't match on all 8 nodes. It's the same on 5 nodes but 3 nodes have an older version that doesn't contain the new node.

How do I get the .members files updated forcibly?

Thanks in advance for any help,

CTC

t.lamprecht · Apr 27, 2016

Hi,

ctcknows said:
After looking around, the /etc/pve/.members file doesn't match on all 8 nodes. It's the same on 5 nodes but 3 nodes have an older version that doesn't contain the new node.

How do I get the .members files updated forcibly?

You got it the wrong way around, the .members is a special readonly file and shows which node the cluster is configured for and aware of plus their state, so you can not force anything to it but need to fix the cluster/corosync config.

Can you post the corosync config from /etc/pve/corosync.conf (if on PVE 4.x)

All in all it seems the node Joins wasn't that successful, any error or similar during the process?

CTCcloud · Apr 27, 2016

No, there were no errors joining the node and it was working fine in the beginning. We were able to move Virtual machines from a node that now can't. The newest node is PMX-73-V.

Here is the corosync.conf

root@PMX-72-I:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: PMX-72-II
nodeid: 4
quorum_votes: 1
ring0_addr: PMX-72-II
}

node {
name: PMX-72-III
nodeid: 5
quorum_votes: 1
ring0_addr: PMX-72-III
}

node {
name: PM-72-N
nodeid: 1
quorum_votes: 1
ring0_addr: PM-72-N
}

node {
name: PM-61-I
nodeid: 3
quorum_votes: 1
ring0_addr: PM-61-I
}

node {
name: PMX-73-V
nodeid: 8
quorum_votes: 1
ring0_addr: PMX-73-V
}

node {
name: PMX-72-IV
nodeid: 6
quorum_votes: 1
ring0_addr: PMX-72-IV
}

node {
name: PMX-72-I
nodeid: 2
quorum_votes: 1
ring0_addr: PMX-72-I
}

node {
name: PMX-71-I
nodeid: 7
quorum_votes: 1
ring0_addr: PMX-71-I
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: ctccloud
config_version: 22
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 192.168.0.34
ringnumber: 0
}

}

t.lamprecht · Apr 27, 2016

Ok, the names are a little hard to distinguish but it looks good to me.
You can control if it on the failed node is also the same.

ctcknows said:
We were able to move Virtual machines from a node that now can't.

So a previously healthy node is not healthy anymore?

I would need the output from

Code:

pvecm st
systemctl status corosync
systemctl status pve-cluster

on a healthy and on the failed ones please, to get an idea what happened.

You may try to restart corosync and the pve-cluter service thorugh systemctl restart, but I would prefer to see the output from above first, just in case.

CTCcloud · Apr 27, 2016

It's not that they are unhealthy or even act unhealthy ... the 3 that can't migrate a VM to the PMX-73-V simply give the error, no such node

PMX-72-I ("healthy node")

root@PMX-72-I:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: active (running) since Tue 2016-04-26 16:52:10 EDT; 15h ago
Process: 3136 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 3145 (corosync)
CGroup: /system.slice/corosync.service
└─3145 corosync

Apr 26 16:52:09 PMX-72-I corosync[3145]: [TOTEM ] JOIN or LEAVE message was ....
Apr 26 16:52:09 PMX-72-I corosync[3145]: [TOTEM ] JOIN or LEAVE message was ....
Apr 26 16:52:09 PMX-72-I corosync[3145]: [TOTEM ] A new membership (192.168....2
Apr 26 16:52:09 PMX-72-I corosync[3145]: [QUORUM] Members[1]: 2
Apr 26 16:52:09 PMX-72-I corosync[3145]: [MAIN ] Completed service synchron....
Apr 26 16:52:09 PMX-72-I corosync[3145]: [TOTEM ] A new membership (192.168....6
Apr 26 16:52:09 PMX-72-I corosync[3145]: [QUORUM] This node is within the pr....
Apr 26 16:52:09 PMX-72-I corosync[3145]: [QUORUM] Members[8]: 1 7 8 2 3 4 5 6
Apr 26 16:52:09 PMX-72-I corosync[3145]: [MAIN ] Completed service synchron....
Apr 26 16:52:10 PMX-72-I corosync[3136]: Starting Corosync Cluster Engine (c...]
Hint: Some lines were ellipsized, use -l to show in full.

root@PMX-72-I:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: active (running) since Tue 2016-04-26 16:52:09 EDT; 15h ago
Process: 2904 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 2856 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 2898 (pmxcfs)
CGroup: /system.slice/pve-cluster.service
└─2898 /usr/bin/pmxcfs

Apr 27 07:18:47 PMX-72-I pmxcfs[2898]: [status] notice: received log
Apr 27 07:18:53 PMX-72-I pmxcfs[2898]: [status] notice: received log
Apr 27 07:19:11 PMX-72-I pmxcfs[2898]: [status] notice: received log
Apr 27 07:28:02 PMX-72-I pmxcfs[2898]: [status] notice: received log
Apr 27 07:29:31 PMX-72-I pmxcfs[2898]: [status] notice: received log
Apr 27 07:31:24 PMX-72-I pmxcfs[2898]: [dcdb] notice: data verification suc...ul
Apr 27 07:43:02 PMX-72-I pmxcfs[2898]: [status] notice: received log
Apr 27 07:44:32 PMX-72-I pmxcfs[2898]: [status] notice: received log
Apr 27 07:58:02 PMX-72-I pmxcfs[2898]: [status] notice: received log
Apr 27 07:59:32 PMX-72-I pmxcfs[2898]: [status] notice: received log

PMX-72-III - ("unhealthy")

root@PMX-72-III:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: active (running) since Wed 2016-03-09 23:48:03 EST; 1 months 17 days ago
Main PID: 2758 (corosync)
CGroup: /system.slice/corosync.service
└─2758 corosync

Apr 25 13:58:02 PMX-72-III corosync[2758]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 25 14:11:03 PMX-72-III corosync[2758]: [TOTEM ] A new membership (192.168.x.x:192488) was formed. Members joined: 2
Apr 25 14:11:03 PMX-72-III corosync[2758]: [QUORUM] Members[8]: 1 7 8 2 3 4 5 6
Apr 25 14:11:03 PMX-72-III corosync[2758]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 26 16:38:55 PMX-72-III corosync[2758]: [TOTEM ] A new membership (192.168.x.x:192492) was formed. Members left: 2
Apr 26 16:38:55 PMX-72-III corosync[2758]: [QUORUM] Members[7]: 1 7 8 3 4 5 6
Apr 26 16:38:55 PMX-72-III corosync[2758]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 26 16:52:09 PMX-72-III corosync[2758]: [TOTEM ] A new membership (192.168.x.x:192496) was formed. Members joined: 2
Apr 26 16:52:09 PMX-72-III corosync[2758]: [QUORUM] Members[8]: 1 7 8 2 3 4 5 6
Apr 26 16:52:09 PMX-72-III corosync[2758]: [MAIN ] Completed service synchronization, ready to provide service.

root@PMX-72-III:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: active (running) since Tue 2016-04-26 14:54:44 EDT; 17h ago
Process: 13702 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 13699 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 13700 (pmxcfs)
CGroup: /system.slice/pve-cluster.service
└─13700 /usr/bin/pmxcfs

Apr 27 07:14:30 PMX-72-III pmxcfs[13700]: [status] notice: received log
Apr 27 07:18:37 PMX-72-III pmxcfs[13700]: [status] notice: received log
Apr 27 07:18:53 PMX-72-III pmxcfs[13700]: [status] notice: received log
Apr 27 07:28:02 PMX-72-III pmxcfs[13700]: [status] notice: received log
Apr 27 07:29:31 PMX-72-III pmxcfs[13700]: [status] notice: received log
Apr 27 07:31:24 PMX-72-III pmxcfs[13700]: [dcdb] notice: data verification successful
Apr 27 07:43:02 PMX-72-III pmxcfs[13700]: [status] notice: received log
Apr 27 07:44:32 PMX-72-III pmxcfs[13700]: [status] notice: received log
Apr 27 07:58:02 PMX-72-III pmxcfs[13700]: [status] notice: received log
Apr 27 07:59:32 PMX-72-III pmxcfs[13700]: [status] notice: received log

Maybe I mislead with asking about the .members files ... the real problem is that 3 nodes out of 8 can't migrate to the newest node

t.lamprecht · Apr 27, 2016

ctcknows said:
It's not that they are unhealthy or even act unhealthy ... the 3 that can't migrate a VM to the PMX-73-V simply give the error, no such node

I see, but that's quite strange behaviour that's the reason I asked for those details.
So I guess if you do

Code:

touch /etc/pve/test.tmp

It appears on all nodes ("healty" and "unhealthy")

ctcknows said:
Maybe I mislead with asking about the .members files ... the real problem is that 3 nodes out of 8 can't migrate to the newest node

Can you restart the pve-cluster service on a node where the .members file does not show all?

Code:

systemctl restart pve-cluster

CTCcloud · Apr 27, 2016

To answer your first question, yes, test.tmp appears on all nodes

I restarted the corosync service on the "unhealthy" nodes and now their .member files show all nodes and the correct version number

I'll have to test the ability to migrate to the new node now I guess to see if restarting corosync did the trick ... looks like it may have ... almost as if the corosync service had gone a bit funky

t.lamprecht · Apr 27, 2016

ctcknows said:
To answer your first question, yes, test.tmp appears on all nodes

It appeared already ion all nodes before you restarted it? Strange, that would mean that corosync per se was working but the library which is used for the cluster file system had some problems, probably regarding communication, quite strange and yeah funky.

But good to hear that that did the trick, the migration problem should be gone if the .members file is correct now.

CTCcloud · Apr 27, 2016

No, to be frank, I had already restarted the corosync service on the "unhealthy" nodes before trying your suggestion of the adding the empty file to /etc/pve

Your questions were enough to prod my thought process so it was this dialog back and forth that got the ball rolling and made me think to restart the corosync service ... honestly, I'm still getting my head wrapped around systemd so wasn't totally sure how to restart corosync service as it's not in /etc/init.d/ where many of the scripts are. You questions gave me the answer.

If you think it'd be wise to do some further testing as to why corosync was in a funk on those 3 nodes, let me know ... if not, thank you very much for your help if those nodes can now migrate VMs to the new node

I will confirm whether migration now works as well ..

CTCcloud · Apr 27, 2016

Just to confirm

Migration is now working just fine

Thanks again for prodding me the right direction and again, if you think it'd be good to test a bit more just let me know, if not, we'll call this one licked

t.lamprecht · Apr 27, 2016

Yeah systemd requires getting used to it. a little. But all in all it can make a comfortable/seamless experience if configured correctly, imo.

Ok, good to hear that all is working now again.

Still little strange, multicast is working reliable? I guess so else the empty file would not showed up.

I suspect that the config changed gave corosync some problems which the restart fixed, testing that is hard as its not obvious to trigger.
I would tick it off as a rare hiccup from corosync for now and if such a problem or a similar one happens again we'll give it a deeper look and try to find the route cause.

CTCcloud · Apr 27, 2016

multicast had been reliable, yes ... we ARE working on changing out from 1Gb switching to 10Gb switching and retrofitting the servers' NICs so that they communicate over 10Gb which is helping in many ways

I'm certainly good with considering it a rare hiccup from corosync and leaving it for now ... I will definitely keep an eye and if it comes back we'll get back on it.

Thanks again,

CTC

khanhnguyen · Feb 11, 2020

Sorry for activate this post again. But it is the same for me.
Setup:
Virtual Environment 5.4-11
8 nodes
2 of 8 show just 7 nodes instead of 8.

I have restarted corosync, pve-cluster, pvedaemon, pveproxy. Nothing works.

Khanh Nguyen

t.lamprecht · Feb 11, 2020

khanhnguyen said:
Sorry for activate this post again. But it is the same for me.

Pretty sure it isn't the exact same root cause, this post is from 2016 Proxmox VE 5.0 wasn't even released then..
Please open a new thread, post your corosync config and any error from the syslog on the problematic node, thanks!

Search

Search

forcing .members file update

CTCcloud

Renowned Member

t.lamprecht

Proxmox Staff Member

CTCcloud

Renowned Member

t.lamprecht

Proxmox Staff Member

CTCcloud

Renowned Member

t.lamprecht

Proxmox Staff Member

CTCcloud

Renowned Member

t.lamprecht

Proxmox Staff Member

CTCcloud

Renowned Member

CTCcloud

Renowned Member

t.lamprecht

Proxmox Staff Member

CTCcloud

Renowned Member

khanhnguyen

New Member

t.lamprecht

Proxmox Staff Member

We value your privacy