Node in my cluster is always offline

integart

Member
Sep 13, 2017
15
0
21
38
Hi,
All the time I have a problem with one node in the cluster. The web interface still shows that it is offline, but the server is working.
I noticed that there is probably a problem with the corosync service.

service corosync status
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: failed (Result: exit-code) since Wed 2017-09-13 10:50:16 CEST; 8min ago
Process: 19975 ExecStart = / usr / share / corosync / corosync start (code = exited, status = 1 / FAILURE)

Sep 13 10:49:15 node-111 corosync [19982]: [QB] server name: cmap
Sep 13 10:49:15 node-111 corosync [19982]: [SERV] Service engine loaded: corosync configuration service [1]
Sep 13 10:49:15 node-111 corosync [19982]: [QB] server name: cfg
Sep 13 10:49:15 node-111 corosync [19982]: [SERV] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 13 10:49:15 node-111 corosync [19982]: [QB] server name: cpg
Sep 13 10:49:15 corosync [19982]: [SERV] Service loading: corosync loading service [4]
Sep 13 10:50:16 corosync node-111 [19975]: Starting Corosync Cluster Engine (corosync): [FAILED]
Ntp 13 10:50:16 node-111 systemd [1]: corosync.service: control process exited, code = exited status = 1
Sep 13 10:50:16 node-111 systemd [1]: Failed to start Corosync Cluster Engine.
Sep 13 10:50:16 node-111 systemd [1]: Unit corosync.service entered failed state.


My corosync configuration looks like this:
cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: node-112
nodeid: 1
quorum_votes: 1
ring0_addr: node-112
}

node {
name: node-110
nodeid: 2
quorum_votes: 1
ring0_addr: node-110
}

node {
name: node-114
nodeid: 4
quorum_votes: 1
ring0_addr: 172.30.10.114
}

node {
name: node-113
nodeid: 3
quorum_votes: 1
ring0_addr: node-113
}

node {
name: node-111
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.2.111
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: INTEGART
config_version: 9
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.30.10.112
ringnumber: 0
}

}


On the other hand, a node that works properly has a corosync configuration:

cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: node-112
nodeid: 1
quorum_votes: 1
ring0_addr: node-112
}

node {
name: node-110
nodeid: 2
quorum_votes: 1
ring0_addr: node-110
}

node {
name: node-114
nodeid: 4
quorum_votes: 1
ring0_addr: 172.30.10.114
}

node {
name: node-113
nodeid: 3
quorum_votes: 1
ring0_addr: node-113
}

node {
name: node-111
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.2.111
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: INTEGART
config_version: 9
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.30.10.112
ringnumber: 0
}

}

Please help.
 
node {
name: node-111
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.2.111
}
This node has a complete different IP then the bindnetaddr, I presume the other nodes use also a 172.30.xxx.xxx. Is node-111 also the failing one?
 
I noticed that it is badly promoted. But how to change it. And can you on a working cluster?
 
If your network and naming setup of that node is correct, then in the simplest case you need to change the address in the corosync.conf.
 
I changed the address to the correct one (172.30.10.111) I restarted the server and restarted the previous address (192.168.2.111) after reboot.
 
Okay, thank you something has moved. However, logging on to another node after the WWW and I want to go to the node that started to be online on the web I have messages as in the attachment.
 

Attachments

  • Bez tytułu.png
    Bez tytułu.png
    72.3 KB · Views: 2
  • Bez tytułu2.png
    Bez tytułu2.png
    85 KB · Views: 2
  • Bez tytułu3.png
    Bez tytułu3.png
    74.9 KB · Views: 1
What does "pvecm status" give you? Is your name resolution setup properly?
 
pvecm status:

Quorum information
------------------
Date: Wed Sep 13 15:13:02 2017
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000005
Ring ID: 2/532
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.30.10.110
0x00000005 1 172.30.10.111 (local)
0x00000001 1 172.30.10.112
0x00000003 1 172.30.10.113
0x00000004 1 172.30.10.114

I also noticed that I can not get the www to this node (https://192.168.2.111:8006).
 
Check your network config and naming. In the worst case, remove the node from the cluster and do a clean re-install.
 
All nodes ping each other. Names on all nodes are the same. I guess I will actually have to remove the reinstall from the cluster and add it again.
Is there any safe procedure to remove a node from the cluster or unplug it? Or just reinstall on the server and add it again?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!