Corosync redundancy over second nic with public ip

TheMrg · Aug 1, 2019

We have 3 Nodes. Each has eth0 with public ip and eth1 with private. no more NICs possible.

privatenet: 192.168.0.0 - eth0
public: different public ips, bit all in same datacenter - eth1
we are not able to change this. 2 private networks are not possible due datacenter restrictions.

we set the nodes with
pvecm add 192.168.0.1 -link0 192.168.0.2

sadly if the eth0 have much traffic we sometimes get :
Aug 01 20:43:07 storage1 corosync[3283]: [KNET ] link: host: 4 link: 0 is down
Aug 01 20:43:07 storage1 corosync[3283]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 01 20:43:07 storage1 corosync[3283]: [KNET ] host: host: 4 has no active links
Aug 01 20:43:07 storage1 corosync[3283]: [TOTEM ] Token has not been received in 61 ms
Aug 01 20:43:07 storage1 corosync[3283]: [TOTEM ] Retransmit List: e89
Aug 01 20:43:07 storage1 corosync[3283]: [TOTEM ] Retransmit List: e89
Aug 01 20:43:07 storage1 corosync[3283]: [TOTEM ] Retransmit List: e89

So we want to know to send the corosync traffic via eth1 (public ip) ?
or set a redundancy may:
https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy
to send traffic via eth1 if eth0 is slow.

How can we do this. So if eth0 traffic is high, the corosync do not los and sometimes kill the nodes.

Thanks

schinzelh · Aug 2, 2019

Hi,

we have a similar setup up and running: ring0 with private IP over eth0, ring1 with public IP over eth1. We only got it working with public IPs being from the same /24 subnet, so it depends on what public IPs you have assigned.

So e.g.

103.43.75.180
103.43.75.190
103.43.75.232

works for us, while

103.43.75.180
103.43.75.190
103.43.88.232

does not.

TheMrg · Aug 2, 2019

we have ips like 52.78.. 52.76 and so on

schinzelh · Aug 2, 2019

Like i wrote: we only got it working with IPs from the same /24 subnet. Using /16 with corosync 2 did not work for us, but maybe things have changed with corosync 3.

TheMrg · Aug 2, 2019

We use proxmox 6 , this is with corosync 3 we think.

guletz · Aug 2, 2019

Hi,

Maybe you can do 1 or 2 variants:

- create a vpn using the external interface, using a /24 address (udp vpn) for the second ring
- use a DSCP(flash) mark for corosync traffic(this is what I used and I do not see any corosync problems, but I mark with DSCP on my switch level)

Good luck

-

TheMrg · Aug 2, 2019

Thanks. Any suggestions with VPN?
We testet with tinc.
But it high cpu if we test high traffic on this net.

2nd: Sadly we have no access to infrastructure in our datacenter. So VPN is the only way we see.

guletz · Aug 3, 2019

TheMrg said:
Any suggestions with VPN?
We testet with tinc.
But it high cpu if we test high traffic on this

No. You will use this vpn only for corosync, and not for any other things. Corosync do not make a lot of traffic, so I could not imagine how you can have high cpu usage for any udp vpn!

schinzelh · Aug 3, 2019

TheMrg said:
we have ips like 52.78.. 52.76 and so on

Maybe a bit offtopic but just a suggestion:

Judging on the IP Range you are using Amazon AWS [1][2], which is not a good idea to run Proxmox on: You are trying to run a hypervisor (Proxmox/KVM) on virtualized hardware (EC2/Xen) - so if CPU load on the host is high your virtualized network cards will start to lag, which is exactly what you are seeing.

I'd suggest to look for a different datacenter instead where you can rent bare metal dedicated servers - and most probably you will get 2 private networks/NICs there as well.

[1] http://geoiplookup.net/ip/52.78.0.0
[2] http://geoiplookup.net/ip/52.76.0.0

TheMrg · Aug 4, 2019

Thanks. Only Corosync works well.
Thanks, IPs are examples.
Corosyn all is fine with tinc now.

But we have other "Problems".

We use eth0 for corosync and eth1 is used for proxmox internal net (communicate to hosts / nodes) and it is internal net for KVM VMs.
So if eth1 is down, corosync eth1 says all is fine, but it is not:
GUI did not find the other Nodes.
Internal NET between VMs is offline.
So also HA VMs are not restarted in other hosts, because corosync says all is fine. No fence is done.

How can we solve this?

Chriswiss · Aug 18, 2019

Good evening,

I have exactly the same problem, and I haven't found a solution.

I think this problem comes from the hosts file.

So, what is the point of several Corosync links?

Chris

Dominic · Aug 19, 2019

Chriswiss said:
So, what is the point of several Corosync links?

Besides the obvious benefits that the cluster still works on a switch failure, also the maintenance of your systems becomes easier. A firmware upgrade of a switch, for example, can be done on a running cluster with no downtime, as the other ring still handles the traffic in the time between. (https://pve.proxmox.com/wiki/Separate_Cluster_Network)

Chriswiss · Aug 19, 2019

Hi Dominic,

I misspoke. That's what I'm thinking about.
--> https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy

Chris

Dominic · Aug 20, 2019

Chriswiss said:
I misspoke. That's what I'm thinking about.
--> https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy

I think we're talking at cross purposes.

ringX_addr actually specifies a corosync link address, the name "ring" is a remnant of older corosync versions that is kept for backwards compatibility.

TheMrg said:
Corosyn all is fine with tinc now.

But we have other "Problems".

This sounds like a good moment to open a new thread for me.

TheMrg said:
How can we solve this?

Could you provide the (redacted) output of cat /etc/network/interfaces and cat /etc/pvebackup/corosync.conf?

Search

Search

Corosync redundancy over second nic with public ip

TheMrg

Well-Known Member

schinzelh

Active Member

TheMrg

Well-Known Member

schinzelh

Active Member

TheMrg

Well-Known Member

guletz

Distinguished Member

TheMrg

Well-Known Member

guletz

Distinguished Member

schinzelh

Active Member

TheMrg

Well-Known Member

Chriswiss

Renowned Member

Dominic

Proxmox Retired Staff

Chriswiss

Renowned Member

Dominic

Proxmox Retired Staff

We value your privacy