Node in my cluster is always offline

integart · Sep 13, 2017

Hi,
All the time I have a problem with one node in the cluster. The web interface still shows that it is offline, but the server is working.
I noticed that there is probably a problem with the corosync service.

service corosync status
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: failed (Result: exit-code) since Wed 2017-09-13 10:50:16 CEST; 8min ago
Process: 19975 ExecStart = / usr / share / corosync / corosync start (code = exited, status = 1 / FAILURE)

Sep 13 10:49:15 node-111 corosync [19982]: [QB] server name: cmap
Sep 13 10:49:15 node-111 corosync [19982]: [SERV] Service engine loaded: corosync configuration service [1]
Sep 13 10:49:15 node-111 corosync [19982]: [QB] server name: cfg
Sep 13 10:49:15 node-111 corosync [19982]: [SERV] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 13 10:49:15 node-111 corosync [19982]: [QB] server name: cpg
Sep 13 10:49:15 corosync [19982]: [SERV] Service loading: corosync loading service [4]
Sep 13 10:50:16 corosync node-111 [19975]: Starting Corosync Cluster Engine (corosync): [FAILED]
Ntp 13 10:50:16 node-111 systemd [1]: corosync.service: control process exited, code = exited status = 1
Sep 13 10:50:16 node-111 systemd [1]: Failed to start Corosync Cluster Engine.
Sep 13 10:50:16 node-111 systemd [1]: Unit corosync.service entered failed state.

My corosync configuration looks like this:
cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: node-112
nodeid: 1
quorum_votes: 1
ring0_addr: node-112
}

node {
name: node-110
nodeid: 2
quorum_votes: 1
ring0_addr: node-110
}

node {
name: node-114
nodeid: 4
quorum_votes: 1
ring0_addr: 172.30.10.114
}

node {
name: node-113
nodeid: 3
quorum_votes: 1
ring0_addr: node-113
}

node {
name: node-111
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.2.111
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: INTEGART
config_version: 9
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.30.10.112
ringnumber: 0
}

}

On the other hand, a node that works properly has a corosync configuration:

cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: node-112
nodeid: 1
quorum_votes: 1
ring0_addr: node-112
}

node {
name: node-110
nodeid: 2
quorum_votes: 1
ring0_addr: node-110
}

node {
name: node-114
nodeid: 4
quorum_votes: 1
ring0_addr: 172.30.10.114
}

node {
name: node-113
nodeid: 3
quorum_votes: 1
ring0_addr: node-113
}

node {
name: node-111
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.2.111
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: INTEGART
config_version: 9
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.30.10.112
ringnumber: 0
}

}

Please help.

Alwin · Sep 13, 2017

integart said:
node {
name: node-111
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.2.111
}

This node has a complete different IP then the bindnetaddr, I presume the other nodes use also a 172.30.xxx.xxx. Is node-111 also the failing one?

integart · Sep 13, 2017

I noticed that it is badly promoted. But how to change it. And can you on a working cluster?

Alwin · Sep 13, 2017

If your network and naming setup of that node is correct, then in the simplest case you need to change the address in the corosync.conf.

integart · Sep 13, 2017

Okay, but only on that non-functioning? Does all of the nodes propagate automatically?

integart · Sep 13, 2017

I changed the address to the correct one (172.30.10.111) I restarted the server and restarted the previous address (192.168.2.111) after reboot.

integart · Sep 13, 2017

Still is not good

Alwin · Sep 13, 2017

The IP needs to be changed on the failing node and on one node of the functioning cluster -> /etc/pve/corosync.conf
Also check on the failed node, that the same content is in /etc/corosync/corosync.conf
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_manager

integart · Sep 13, 2017

Okay, thank you something has moved. However, logging on to another node after the WWW and I want to go to the node that started to be online on the web I have messages as in the attachment.

Alwin · Sep 13, 2017

What does "pvecm status" give you? Is your name resolution setup properly?

integart · Sep 13, 2017

pvecm status:

Quorum information
------------------
Date: Wed Sep 13 15:13:02 2017
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000005
Ring ID: 2/532
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.30.10.110
0x00000005 1 172.30.10.111 (local)
0x00000001 1 172.30.10.112
0x00000003 1 172.30.10.113
0x00000004 1 172.30.10.114

I also noticed that I can not get the www to this node (https://192.168.2.111:8006).

Alwin · Sep 13, 2017

Check your network config and naming. In the worst case, remove the node from the cluster and do a clean re-install.

integart · Sep 14, 2017

All nodes ping each other. Names on all nodes are the same. I guess I will actually have to remove the reinstall from the cluster and add it again.
Is there any safe procedure to remove a node from the cluster or unplug it? Or just reinstall on the server and add it again?

Alwin · Sep 14, 2017

Check out our docs for it.

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node

Search

Search

Node in my cluster is always offline

integart

Member

Alwin

Proxmox Retired Staff

integart

Member

Alwin

Proxmox Retired Staff

integart

Member

integart

Member

integart

Member

Alwin

Proxmox Retired Staff

integart

Member

Attachments

Alwin

Proxmox Retired Staff

integart

Member

Alwin

Proxmox Retired Staff

integart

Member

Alwin

Proxmox Retired Staff