What could cause this?

adamb

Famous Member
Mar 1, 2012
1,329
77
113
Hey guys we did some site failover testing this past weekend.

When bringing the clusters back up at our main site, we found that corosync would fail to start and the nodes wouldn't join the cluster.

After lots of head banging, the only way we could get this corrected was to add a entry for each node in the hosts file on all cluster nodes. We typically only have a entry in the hosts file for that specific node, not all of them.

Here is what we use to have.

10.211.45.4 bunkmiscrit1.ccs.com bunkmiscrit1 pvelocalhost

Here is what we have now

10.211.45.4 bunkmiscrit1.ccs.com bunkmiscrit1 pvelocalhost
10.211.45.5 bunkmiscrit2.ccs.com bunkmiscrit2 pvelocalhost
10.211.45.6 bunkmiscrit3.ccs.com bunkmiscrit3 pvelocalhost

We have 30 or so clusters out in the field on older Proxmox5 versions that have just a single hosts file entry. I also have 2 inhouse test clusters that only have the single hosts file entry and they still work.

Any ideas on what I could be missing?
 
Hmm - do all of the nodes also have working DNS-entries?
please also check /etc/corosync/corosync.conf on all nodes - Recently PVE writes the IP-addresses there (to prevent the issues you were experiencing), but AFAIR it used to write the node-names there (in which case working name resolution via /etc/hosts or DNS was needed)

I hope this helps!
 
  • Like
Reactions: Alwin
Hmm - do all of the nodes also have working DNS-entries?
please also check /etc/corosync/corosync.conf on all nodes - Recently PVE writes the IP-addresses there (to prevent the issues you were experiencing), but AFAIR it used to write the node-names there (in which case working name resolution via /etc/hosts or DNS was needed)

I hope this helps!

They have working DNS entries for their LAN IP's, but not these 10.211.45.x backend cluster IP's. Corosync.conf on all the nodes has the hostname, not the IP. Here is an example of one node.

root@bunkmiscrit1:~# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: bunkmiscrit3
nodeid: 3
quorum_votes: 1
ring0_addr: bunkmiscrit3
}

node {
name: bunkmiscrit2
nodeid: 2
quorum_votes: 1
ring0_addr: bunkmiscrit2
}

node {
name: bunkmiscrit1
nodeid: 1
quorum_votes: 1
ring0_addr: bunkmiscrit1
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: bunkmiscrit
config_version: 3
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.211.45.4
ringnumber: 0
}

}

So based on this, we definitely need each node defined in /etc/hosts otherwise its actually forming the cluster over our LAN network? Its decieving as the output of pvecm status shows our backend IP's and not our LAN IP's.

0x00000001 1 10.211.45.4 (local)
0x00000002 1 10.211.45.5
0x00000003 1 10.211.45.6
 
Ive confused myself even further.

On my test cluster each node only has itself defined in the hosts file. I also removed the entry from /etc/resolv.conf so DNS is broke as well.

root@testprox1:~# ping testprox1
PING testprox1.ccs.com (10.211.45.1) 56(84) bytes of data.
64 bytes from testprox1.ccs.com (10.211.45.1): icmp_seq=1 ttl=64 time=0.024 ms
64 bytes from testprox1.ccs.com (10.211.45.1): icmp_seq=2 ttl=64 time=0.018 ms
--- testprox1.ccs.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1007ms
rtt min/avg/max/mdev = 0.018/0.021/0.024/0.003 ms

root@testprox1:~# ping testprox2
ping: testprox2: Temporary failure in name resolution

root@testprox1:~# ping testprox3
ping: testprox3: Temporary failure in name resolution

/etc/corosync/corosync.conf is using hostnames.

However, the cluster still comes up and is aok. I don't understand how its doing that if it can't resolve the hostnames in /etc/corosync/corosync.conf.
 
Ok I reproduced the issue.

I have to actually have a DNS server in /etc/resolv.conf which isn't working. If I have one in there which isn't working this causes DNS requests to hang, were as having no DNS server is a instant failure.

I still have the question though, if corosync.conf has hostnames, /etc/hosts doesn't have all the nodes and a DNS server isn't configured. How is corosync determining the IP of the other nodes?
 
How is corosync determining the IP of the other nodes?

Hi,

By linux resolution: hosts and then dns. My 2 (euro) cent advice use both(hosts and dns) for all of yours pmx nodes. And also better to have 2 different dns servers.


Good luck / Bafta !
 
Hi,

By linux resolution: hosts and then dns. My 2 (euro) cent advice use both(hosts and dns) for all of yours pmx nodes. And also better to have 2 different dns servers.


Good luck / Bafta !

But thats not the case. I don't have them configured in the hosts file and there is no DNS server set, but yet corosync is still working

It seems to me, that if a DNS server is set, but not working, it shouldn't prevent the cluster from joining. It should behave just like a DNS server not being present.

Seems like a bug to me.
 
I have to actually have a DNS server in /etc/resolv.conf which isn't working. If

Mayby is working only for some of your pmx hosts but not for all, or maybe the dns results was cached(ttl) and then the dns server was broken. Linux resolution is simple (by default): dns and hosts.

Anyway, is not so important how a misconfigure server has work. Important is how to corectly configure your pmx cluster.
 
Mayby is working only for some of your pmx hosts but not for all. Linux resolution is simple (by default): dns and hosts.

Yep it is simple which is why I am here scratching my head.

If a DNS server is set but not working, and the hosts file doesn't have entries for each node, corosync fails to start.

If no DNS server is set and the hosts file doesn't have entries for each node, corosync works 100%
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!