Any ideas

adamb

Famous Member
Mar 1, 2012
1,329
77
113
Hey guys, we have a fresh cluster down on our bench getting ready to go out the door and have hit an odd issue I can't seem to figure out.

root@mlglickprox1:~# ha-manager status
quorum OK
master mlglickprox1 (active, Wed Feb 15 07:48:01 2017)
lrm mlgickprox1 (unable to read lrm status)
lrm mlglickprox1 (active, Wed Feb 15 07:48:04 2017)
lrm mlglickprox2 (active, Wed Feb 15 07:48:06 2017)
lrm mlglickprox3 (idle, Wed Feb 15 07:48:00 2017)
service vm:100 (mlglickprox1, started)
service vm:102 (mlglickprox2, started)
root@mlglickprox1:~#

Its just a 3 node cluster and I can't seem to find were its getting "mlgickprox1". ha-manager status is the only one which shows this 4th node. "pvecm nodes" "pvecm status" don't show this 4th node. Ive dug around in /etc/pve but no trace of mlgickprox1. Its not in /etc/pve/corosync.conf and I can't "pvecm delnode mlgickprox1" as it says it doesn't exist. Any ideas?
 
Hi,

Its just a 3 node cluster and I can't seem to find were its getting "mlgickprox1".

Hmm, any possibility that you:
a) renamed a node
b) configured a service manually ( through the `ha-manager add` or editor) which is limited to specific nodes, and one node is misspelled?
I tried to mess around and saw that we did not checked if nodes existed when configuring them in a HA manager group but the HA stack can cope with that and I did not get your problem...

/etc/pve but no trace of mlgickprox1. Its not in /etc/pve/corosync.conf

Those should be OK, the HA manager does his own record keeping of which nodes are there, to be able to map jobs and their results to nodes even if they aren't there anymore in corosync.

Can you please post the output of
Code:
ha-manager status --verbose
# and if you didn't already checked that
grep -r mlgickprox1 /etc/pve

I suspect that somehow the manager status has this node included...

If there is sensitive information in the VM/CT names I suggest that you redact them.
 
Hi,



Hmm, any possibility that you:
a) renamed a node
b) configured a service manually ( through the `ha-manager add` or editor) which is limited to specific nodes, and one node is misspelled?
I tried to mess around and saw that we did not checked if nodes existed when configuring them in a HA manager group but the HA stack can cope with that and I did not get your problem...



Those should be OK, the HA manager does his own record keeping of which nodes are there, to be able to map jobs and their results to nodes even if they aren't there anymore in corosync.

Can you please post the output of
Code:
ha-manager status --verbose
# and if you didn't already checked that
grep -r mlgickprox1 /etc/pve

I suspect that somehow the manager status has this node included...

If there is sensitive information in the VM/CT names I suggest that you redact them.

I ended up removing both resources from HA, then adding them back and the issue is resolved.
 
Just checked that out, but it looked identical to /etc/pve/corosync.conf.

This should be normally the case if pve-cluster is up and working. We sync any changes from /etc/pve/corosync.conf to /etc/corosync/corosync.conf the moment they happen :)

I ended up removing both resources from HA, then adding them back and the issue is resolved.

Good that you could resolve it, you didn't looked what was written there? Or if the grep returned anything? Would be great to know for a possible fix, but no problem if not, I try to mess around and reproduce this a bit myself.
 
This should be normally the case if pve-cluster is up and working. We sync any changes from /etc/pve/corosync.conf to /etc/corosync/corosync.conf the moment they happen :)



Good that you could resolve it, you didn't looked what was written there? Or if the grep returned anything? Would be great to know for a possible fix, but no problem if not, I try to mess around and reproduce this a bit myself.

I did not as this is getting installed tomorrow at our clients site. If I see it again on our bench I will report back. I really appreciate the input!
 
found this thread before starting a new one.

I have a node that was removed from the cluster. then reinstalled.

IP was same for the reinstalled node, hostname changed from sys3 > pve3

Code:
# ha-manager status
quorum OK
master pve10 (active, Sun Nov 12 14:14:28 2017)
lrm pve10 (idle, Sun Nov 12 14:14:31 2017)
lrm pve3 (unable to read lrm status)
lrm sys3 (old timestamp - dead?, Tue Nov  7 13:29:02 2017)
lrm sys5 (active, Sun Nov 12 14:14:30 2017)
lrm sys8 (idle, Sun Nov 12 14:14:31 2017)
service ct:124 (pve3, starting)
service ct:200 (sys5, started)
service vm:119 (sys5, started)

* corosync.conf
Code:
nodelist {
 node {
   name: pve10
   nodeid: 3
   quorum_votes: 1
   ring0_addr: 10.1.10.10
 }
 node {
   name: pve3
   nodeid: 1
   quorum_votes: 1
   ring0_addr: 10.1.10.3
 }
 node {
   name: sys5
   nodeid: 4
   quorum_votes: 1
   ring0_addr: 10.1.10.5
 }
 node {
   name: sys8
   nodeid: 2
   quorum_votes: 1
   ring0_addr: 10.1.10.8
 }
}

Note I had changed ring0_addr to use IP instead of hostname.

I see that I may have neglected to change the version number in totem . just did that now. I do not thing that was causing an issue but am not sure. In any case increasing the number did not fix the issue.

i tried grep to find sys3 in a config file:
Code:
# grep -r sys3 /etc/pve
/etc/pve/ha/manager_status:{"node_status":{"pve10":"online","pve3":"online","sys5":"online","sys8":"online","sys3":"gone"},"timestamp":
1510514779,"service_status":{"vm:119":{"uid":"IIlHClaZtkPqpn9TtvZ1Ew","running":1,"state":"started","node":"sys5"},"ct:124":{"state":"s
tarted","node":"pve3","uid":"3YdvpwsVj5TMKj7vqs25Xg"},"ct:200":{"running":1,"state":"started","node":"sys5","uid":"EreNAbkjtLYHJqIALeGB
xw"}},"master_node":"pve10"}

For now I'll remove the resource that pve3 is supposed to handle.

If more info is needed then please reply.
 
That sys3 still appears in the ha-manager's status is intended, we only delete it if it is not in the member list (=deleted from cluster) for over an hour.

I see that I may have neglected to change the version number in totem . just did that now.

Did you the ensure that /etc/corosync/corosync.conf is exactly the same on pve3 and one of the cluster nodes?
Because if pve3 was not really in the cluster it missed the sync of /etc/pve/corosync.conf , may work nonetheless but always good to check.

What does
Code:
pvecm status
tells you for pve3 and one of the "old" nodes?

lrm pve3 (unable to read lrm status)

This looks like the lrm could not write its status out, that may be a result of a readonly pmxcfs (/etc/pve) on pve3, if that's not the case please also post the output of:
Code:
systemctl status pve-ha-lrm pve-ha-crm
 
That sys3 still appears in the ha-manager's status is intended, we only delete it if it is not in the member list (=deleted from cluster) for over an hour.



Did you the ensure that /etc/corosync/corosync.conf is exactly the same on pve3 and one of the cluster nodes?

I just checked and the 6 files on 3the 3 nodes are the same.

Because if pve3 was not really in the cluster it missed the sync of /etc/pve/corosync.conf , may work nonetheless but always good to check.
What does
Code:
pvecm status
tells you for pve3 and one of the "old" nodes?

*reinstalled mode:
Code:
pve3  /etc/pve # pvecm status
Quorum information
------------------
Date:             Mon Nov 13 04:26:57 2017
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1/9268
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.1.10.3 (local)
0x00000004          1 10.1.10.5
0x00000002          1 10.1.10.8
0x00000003          1 10.1.10.10
*one of the orig nodes
Code:
sys5  /etc/pve # pvecm status
Quorum information
------------------
Date:             Mon Nov 13 04:30:19 2017
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000004
Ring ID:          1/9268
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.1.10.3
0x00000004          1 10.1.10.5 (local)
0x00000002          1 10.1.10.8
0x00000003          1 10.1.10.10

This looks like the lrm could not write its status out, that may be a result of a readonly pmxcfs (/etc/pve) on pve3, if that's not the case please also post the output of:
Code:
systemctl status pve-ha-lrm pve-ha-crm

Code:
sys5  /etc/pve # systemctl status pve-ha-lrm pve-ha-crm
● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2017-11-07 11:04:51 EST; 5 days ago
 Main PID: 4049 (pve-ha-lrm)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/pve-ha-lrm.service
           └─4049 pve-ha-lrm

Nov 07 11:04:50 sys5 systemd[1]: Starting PVE Local HA Ressource Manager Daemon...
Nov 07 11:04:51 sys5 pve-ha-lrm[4049]: starting server
Nov 07 11:04:51 sys5 pve-ha-lrm[4049]: status change startup => wait_for_agent_lock
Nov 07 11:04:51 sys5 systemd[1]: Started PVE Local HA Ressource Manager Daemon.
Nov 12 13:49:50 sys5 pve-ha-lrm[4049]: successfully acquired lock 'ha_agent_sys5_lock'
Nov 12 13:49:50 sys5 pve-ha-lrm[4049]: watchdog active
Nov 12 13:49:50 sys5 pve-ha-lrm[4049]: status change wait_for_agent_lock => active

● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2017-11-07 11:04:50 EST; 5 days ago
 Main PID: 4006 (pve-ha-crm)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/pve-ha-crm.service
           └─4006 pve-ha-crm

Nov 07 11:04:50 sys5 systemd[1]: Starting PVE Cluster Ressource Manager Daemon...
Nov 07 11:04:50 sys5 pve-ha-crm[4006]: starting server
Nov 07 11:04:50 sys5 pve-ha-crm[4006]: status change startup => wait_for_quorum
Nov 07 11:04:50 sys5 systemd[1]: Started PVE Cluster Ressource Manager Daemon.
Nov 12 13:49:53 sys5 pve-ha-crm[4006]: status change wait_for_quorum => slave
sys5  /etc/pve #