[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Status
Not open for further replies.

Apollon77

Member
Sep 24, 2018
134
10
23
43
So, I also upgraded again to corosync3 on my pve5 system ... if it stays stable till sunday I will upgrade the first host to pve6 :)
So I have 24h of stability so far ... and only 3 cases of "Token Retransmit list" cases the wole day with the new version and settings (was much more with old config and 2.x)
 
Jul 16, 2018
18
1
3
50
@efinley


So ,you also loose ssh access ?
if yes, I don't think it's corosync related. maybe a nic driver bug, or other kernel bug.
do you have some log in /var/log/daemon.log or /var/log/kern.log ?
can you send your /etc/network/interfaces config ?

That's what I thought initially too. But the reason SSH is inaccessible is because there are files that SSHD needs that live in /etc/pve hierarchy. /etc/pve becomes inaccessible when corosync/pve-cluster stop functioning, so anything that touches it hangs.

*and*

restarting corosync/pve-cluster brings the node back online, so...
 

bofh

Member
Nov 7, 2017
108
10
23
40
@spirit just fyi, i found out that ovh does uindeed guarantee that 100mbit freebie vlan
well theri service quality is just shit but same as the payed 1gbit version
 
Apr 12, 2018
27
0
6
33
Hello, my cluster doesn't works after upgrading corosync from 2.4.4 to 3.0.3
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.1
  }
  node {
    name: node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.2
  }
  node {
    name: node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: mycluster
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4
  link_mode: passive
  secauth: on
  version: 2
}

node1 :
Code:
# pvecm status
Quorum information
------------------
Date:             Tue May 12 17:06:05 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.6ea5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.10.1 (local)
0x00000002          1 10.10.10.2
node2:
Code:
# pvecm status
Quorum information
------------------
Date:             Tue May 12 17:07:32 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.6eb9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.10.1
0x00000002          1 10.10.10.2 (local)
node3:
Code:
# pvecm status
Quorum information
------------------
Date:             Tue May 12 17:08:08 2020
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000003
Ring ID:          3.6ca1
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 10.10.10.3 (local)

I tried to restart corosync and pve-cluster services.
Do you have any idea ?
 

gradinaruvasile

Active Member
Oct 22, 2015
66
9
28
Stop the pve-ha-lrm on all nodes.
Stop pve-ha-lrm services on all nodes.
Stop corosync on all nodes
Then
Start corosync on all nodes
Start pve-ha-crm on all nodes
Start pve-ha-lrm on all nodes
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
3,260
592
133
South Tyrol/Italy
shop.maurer-it.com
Hello, my cluster doesn't works after upgrading corosync from 2.4.4 to 3.0.3

Ensure the config is the same on all nodes, i.e., /etc/pve/corosync.conf and also the local one /etc/corosync/corosync.conf
 
Apr 12, 2018
27
0
6
33
@gradinaruvasile I tried but it didn't change anything
@t.lamprecht I have the same result on all nodes:
Code:
root@node1:~# md5sum /etc/pve/corosync.conf
0164366e7424ffcdc99c881ac5c7960d  /etc/pve/corosync.conf
root@node1:~# md5sum /etc/corosync/corosync.conf
0164366e7424ffcdc99c881ac5c7960d  /etc/corosync/corosync.conf
 
Apr 12, 2018
27
0
6
33
After restarting lrm, crm and corosync :
https://hastebin.com/ebaxujaqed

After restarting pve-cluster :
https://hastebin.com/opedatudiq


Here a part of the second:

Code:
May 13 08:49:45 node3 pmxcfs[29854]: [status] notice: cpg_send_message retry 30
May 13 08:49:46 node3 corosync[27946]:   [KNET  ] rx: host: 2 link: 0 is up
May 13 08:49:46 node3 corosync[27946]:   [KNET  ] rx: host: 1 link: 0 is up
May 13 08:49:46 node3 corosync[27946]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 13 08:49:46 node3 corosync[27946]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 13 08:49:46 node3 pmxcfs[29854]: [status] notice: cpg_send_message retry 40
May 13 08:49:46 node3 corosync[27946]:   [TOTEM ] A new membership (3.7387) was formed. Members left: 1 2
May 13 08:49:46 node3 corosync[27946]:   [TOTEM ] Failed to receive the leave message. failed: 1 2
May 13 08:49:46 node3 corosync[27946]:   [CPG   ] downlist left_list: 2 received
May 13 08:49:46 node3 corosync[27946]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 13 08:49:46 node3 corosync[27946]:   [QUORUM] Members[1]: 3
May 13 08:49:46 node3 corosync[27946]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 13 08:49:46 node3 pmxcfs[29854]: [status] notice: node lost quorum
May 13 08:49:46 node3 pmxcfs[29854]: [dcdb] notice: members: 3/29854
May 13 08:49:46 node3 pmxcfs[29854]: [dcdb] notice: all data is up to date
May 13 08:49:46 node3 pmxcfs[29854]: [dcdb] crit: received write while not quorate - trigger resync
May 13 08:49:46 node3 pmxcfs[29854]: [dcdb] crit: leaving CPG group
May 13 08:49:46 node3 pmxcfs[29854]: [status] notice: members: 3/29854
May 13 08:49:46 node3 pmxcfs[29854]: [status] notice: all data is up to date
May 13 08:49:46 node3 pmxcfs[29854]: [status] notice: cpg_send_message retried 41 times
May 13 08:49:47 node3 corosync[27946]:   [TOTEM ] A new membership (1.738b) was formed. Members joined: 1 2
May 13 08:49:47 node3 corosync[27946]:   [TOTEM ] Retransmit List: 3
May 13 08:49:47 node3 corosync[27946]:   [CPG   ] downlist left_list: 0 received
May 13 08:49:47 node3 corosync[27946]:   [CPG   ] downlist left_list: 2 received
May 13 08:49:47 node3 corosync[27946]:   [CPG   ] downlist left_list: 2 received
May 13 08:49:47 node3 pmxcfs[29854]: [status] notice: members: 1/40759, 2/4521, 3/29854
May 13 08:49:47 node3 pmxcfs[29854]: [status] notice: starting data syncronisation
May 13 08:49:47 node3 corosync[27946]:   [QUORUM] This node is within the primary component and will provide service.
May 13 08:49:47 node3 corosync[27946]:   [QUORUM] Members[3]: 1 2 3
May 13 08:49:47 node3 corosync[27946]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 13 08:49:47 node3 pmxcfs[29854]: [status] notice: node has quorum
May 13 08:49:47 node3 pmxcfs[29854]: [dcdb] notice: start cluster connection
May 13 08:49:47 node3 pmxcfs[29854]: [dcdb] crit: cpg_join failed: 14
May 13 08:49:47 node3 pmxcfs[29854]: [dcdb] crit: can't initialize service
May 13 08:49:47 node3 pmxcfs[29854]: [dcdb] crit: cpg_send_message failed: 9

EDIT : I read in another topic that it could be a multicast issue.
Nodes are connected with tinc vpn, something related ?
 
Last edited:

TechLineX

Active Member
Mar 2, 2015
213
4
38
Hello,

I'm planning a upgrade from proxmox 5 to 6 and found this thread. Is it still a big problem to upgrade corosync from v2 to v3? I saw that the thread has 13 pages.

Best regards
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
3,260
592
133
South Tyrol/Italy
shop.maurer-it.com
Nodes are connected with tinc vpn, something related ?

That's not really supported, you have steady retransmits from corosync, your network cannot keep up (latency wise, not necessarily bandwidth wise), and that's why a node leaves and joins the quorate partition constantly.
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
3,260
592
133
South Tyrol/Italy
shop.maurer-it.com
I'm planning a upgrade from proxmox 5 to 6 and found this thread. Is it still a big problem to upgrade corosync from v2 to v3? I saw that the thread has 13 pages.

Yeah, this thread is quite big, but there are lots of different mixed issues and troubleshooting posts in it, some where actual issues in corosync/kronosnet and got addressed, some where network or configuration problems (e.g., ringX_addr wasn't resolvable, which worked by look with corosync 2 but not for corsync 3), and some posts are totally unrelated at all.

Actually I'll close this thread pretty soon, as new ones make much more sense to have now.

But anyway, as long as you follow closely follow the upgrade docs ( https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0 ) and use the pve5to6 check list helper tool you should be fine if your setup was supported before it is now.
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
3,260
592
133
South Tyrol/Italy
shop.maurer-it.com
EDIT : I read in another topic that it could be a multicast issue.

Multicast isn't used in corosync 3 kronosnet for now, but for three nodes the unicast/multicast differences should be marginal..
 
Apr 12, 2018
27
0
6
33
That's not really supported, you have steady retransmits from corosync, your network cannot keep up (latency wise, not necessarily bandwidth wise), and that's why a node leaves and joins the quorate partition constantly.
But why it works well with corosync 2 ?
Is there a way to increase the timeout ?
And seems I don't have latency issue (ping ~14ms).
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
3,260
592
133
South Tyrol/Italy
shop.maurer-it.com
And seems I don't have latency issue (ping ~14ms).

Yeah well, that is a latency issue.. <= 2 ms is LAN performance and would be ideal, >8ms starts to makes issues fast and not really recommended > 10 ms isn't really usable.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_network_requirements

But why it works well with corosync 2 ?

I'd rather say it worked barely there, not well.. And that probably due to reduced packet loads thanks to it using multicast, but nothing to sure.

Is there a way to increase the timeout ?

There is but normally adjusting them is only required for big clusters (we have known working >50 node clusters with corosync 3 with some tuning, so it can work at scale).

You could try increasing token_coefficient from 650ms to 1000ms, see https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_edit_corosync_conf
In your case with the flapping links it may be a bit difficult to edit, if problems arise just ensure that the /etc/corosync/corosync.conf is the same on all nodes, if not copy a known good one with increased config_version value to all, restart corosync, then pve-cluster.
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
3,260
592
133
South Tyrol/Italy
shop.maurer-it.com
But why it works well with corosync 2 ?

Also you nest a vpn/cluster/tunneling protocol (kronosnet) in another such layer (tinc), this surely adds also some overhead...
Maybe try using it directly..
 
Apr 12, 2018
27
0
6
33
Average ping is about 14.5ms through tinc vpn and 13.5ms without vpn.
Nodes are on different agencies connected by an enterprise network.
So it therefore seems difficult to get LAN performance.
In your case with the flapping links it may be a bit difficult to edit, if problems arise just ensure that the /etc/corosync/corosync.conf is the same on all nodes, if not copy a known good one with increased config_version value to all, restart corosync, then pve-cluster.
Currently, the easiest way to edit corosync.conf is to restore the cluster by downgrading to corosync 2 (apt install corosync=2.4.4-pve1)
You could try increasing token_coefficient from 650ms to 1000ms
I tried token_coefficient = 5000, but it didn't works.
 
Apr 12, 2018
27
0
6
33
It works by adding "transport: udp" !
This option only works with "crypto_cipher: none" and "crypto_auth: none ".
I think disabling cryptography is not a problem since nodes are connected with a specific vpn.
My final corosync.conf:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.1
  }
  node {
    name: node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.2
  }
  node {
    name: node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: mycluster
  config_version: 11
  interface {
    bindnetaddr: 10.10.10.1
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
  transport: udp
  crypto_cipher: none
  crypto_hash: none
}
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
3,260
592
133
South Tyrol/Italy
shop.maurer-it.com
It works by adding "transport: udp" !
This option only works with "crypto_cipher: none" and "crypto_auth: none ".
I think disabling cryptography is not a problem since nodes are connected with a specific vpn.

It can still be a problem, this implies that any program being able to receive and send traffic on the tinc network has implicit root permissions, just as a heads up.

I mean, good on you to make it work at all, just have above in your mind.

I'll now really open this thread. All real issues of kronosnet/corosync 3 on supported setups (and networks) got fixed and for questions or new ones it's far better to open a new thread.
 
Status
Not open for further replies.

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!