Node in my cluster is always offline

integart

Member
Sep 13, 2017
15
0
21
40
Hi,
All the time I have a problem with one node in the cluster. The web interface still shows that it is offline, but the server is working.
I noticed that there is probably a problem with the corosync service.

service corosync status
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: failed (Result: exit-code) since Wed 2017-09-13 10:50:16 CEST; 8min ago
Process: 19975 ExecStart = / usr / share / corosync / corosync start (code = exited, status = 1 / FAILURE)

Sep 13 10:49:15 node-111 corosync [19982]: [QB] server name: cmap
Sep 13 10:49:15 node-111 corosync [19982]: [SERV] Service engine loaded: corosync configuration service [1]
Sep 13 10:49:15 node-111 corosync [19982]: [QB] server name: cfg
Sep 13 10:49:15 node-111 corosync [19982]: [SERV] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 13 10:49:15 node-111 corosync [19982]: [QB] server name: cpg
Sep 13 10:49:15 corosync [19982]: [SERV] Service loading: corosync loading service [4]
Sep 13 10:50:16 corosync node-111 [19975]: Starting Corosync Cluster Engine (corosync): [FAILED]
Ntp 13 10:50:16 node-111 systemd [1]: corosync.service: control process exited, code = exited status = 1
Sep 13 10:50:16 node-111 systemd [1]: Failed to start Corosync Cluster Engine.
Sep 13 10:50:16 node-111 systemd [1]: Unit corosync.service entered failed state.


My corosync configuration looks like this:
cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: node-112
nodeid: 1
quorum_votes: 1
ring0_addr: node-112
}

node {
name: node-110
nodeid: 2
quorum_votes: 1
ring0_addr: node-110
}

node {
name: node-114
nodeid: 4
quorum_votes: 1
ring0_addr: 172.30.10.114
}

node {
name: node-113
nodeid: 3
quorum_votes: 1
ring0_addr: node-113
}

node {
name: node-111
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.2.111
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: INTEGART
config_version: 9
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.30.10.112
ringnumber: 0
}

}


On the other hand, a node that works properly has a corosync configuration:

cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: node-112
nodeid: 1
quorum_votes: 1
ring0_addr: node-112
}

node {
name: node-110
nodeid: 2
quorum_votes: 1
ring0_addr: node-110
}

node {
name: node-114
nodeid: 4
quorum_votes: 1
ring0_addr: 172.30.10.114
}

node {
name: node-113
nodeid: 3
quorum_votes: 1
ring0_addr: node-113
}

node {
name: node-111
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.2.111
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: INTEGART
config_version: 9
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.30.10.112
ringnumber: 0
}

}

Please help.
 
node {
name: node-111
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.2.111
}
This node has a complete different IP then the bindnetaddr, I presume the other nodes use also a 172.30.xxx.xxx. Is node-111 also the failing one?
 
I noticed that it is badly promoted. But how to change it. And can you on a working cluster?
 
If your network and naming setup of that node is correct, then in the simplest case you need to change the address in the corosync.conf.
 
I changed the address to the correct one (172.30.10.111) I restarted the server and restarted the previous address (192.168.2.111) after reboot.
 
Okay, thank you something has moved. However, logging on to another node after the WWW and I want to go to the node that started to be online on the web I have messages as in the attachment.
 

Attachments

  • Bez tytułu.png
    Bez tytułu.png
    72.3 KB · Views: 2
  • Bez tytułu2.png
    Bez tytułu2.png
    85 KB · Views: 2
  • Bez tytułu3.png
    Bez tytułu3.png
    74.9 KB · Views: 1
What does "pvecm status" give you? Is your name resolution setup properly?
 
pvecm status:

Quorum information
------------------
Date: Wed Sep 13 15:13:02 2017
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000005
Ring ID: 2/532
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.30.10.110
0x00000005 1 172.30.10.111 (local)
0x00000001 1 172.30.10.112
0x00000003 1 172.30.10.113
0x00000004 1 172.30.10.114

I also noticed that I can not get the www to this node (https://192.168.2.111:8006).
 
Check your network config and naming. In the worst case, remove the node from the cluster and do a clean re-install.
 
All nodes ping each other. Names on all nodes are the same. I guess I will actually have to remove the reinstall from the cluster and add it again.
Is there any safe procedure to remove a node from the cluster or unplug it? Or just reinstall on the server and add it again?