[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Status
Not open for further replies.
So, I also upgraded again to corosync3 on my pve5 system ... if it stays stable till sunday I will upgrade the first host to pve6 :)
So I have 24h of stability so far ... and only 3 cases of "Token Retransmit list" cases the wole day with the new version and settings (was much more with old config and 2.x)
 
@efinley


So ,you also loose ssh access ?
if yes, I don't think it's corosync related. maybe a nic driver bug, or other kernel bug.
do you have some log in /var/log/daemon.log or /var/log/kern.log ?
can you send your /etc/network/interfaces config ?

That's what I thought initially too. But the reason SSH is inaccessible is because there are files that SSHD needs that live in /etc/pve hierarchy. /etc/pve becomes inaccessible when corosync/pve-cluster stop functioning, so anything that touches it hangs.

*and*

restarting corosync/pve-cluster brings the node back online, so...
 
@spirit just fyi, i found out that ovh does uindeed guarantee that 100mbit freebie vlan
well theri service quality is just shit but same as the payed 1gbit version
 
Hello, my cluster doesn't works after upgrading corosync from 2.4.4 to 3.0.3
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.1
  }
  node {
    name: node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.2
  }
  node {
    name: node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: mycluster
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4
  link_mode: passive
  secauth: on
  version: 2
}

node1 :
Code:
# pvecm status
Quorum information
------------------
Date:             Tue May 12 17:06:05 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.6ea5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.10.1 (local)
0x00000002          1 10.10.10.2
node2:
Code:
# pvecm status
Quorum information
------------------
Date:             Tue May 12 17:07:32 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.6eb9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.10.1
0x00000002          1 10.10.10.2 (local)
node3:
Code:
# pvecm status
Quorum information
------------------
Date:             Tue May 12 17:08:08 2020
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000003
Ring ID:          3.6ca1
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 10.10.10.3 (local)

I tried to restart corosync and pve-cluster services.
Do you have any idea ?
 
Stop the pve-ha-lrm on all nodes.
Stop pve-ha-lrm services on all nodes.
Stop corosync on all nodes
Then
Start corosync on all nodes
Start pve-ha-crm on all nodes
Start pve-ha-lrm on all nodes
 
Hello, my cluster doesn't works after upgrading corosync from 2.4.4 to 3.0.3

Ensure the config is the same on all nodes, i.e., /etc/pve/corosync.conf and also the local one /etc/corosync/corosync.conf
 
@gradinaruvasile I tried but it didn't change anything
@t.lamprecht I have the same result on all nodes:
Code:
root@node1:~# md5sum /etc/pve/corosync.conf
0164366e7424ffcdc99c881ac5c7960d  /etc/pve/corosync.conf
root@node1:~# md5sum /etc/corosync/corosync.conf
0164366e7424ffcdc99c881ac5c7960d  /etc/corosync/corosync.conf
 
After restarting lrm, crm and corosync :
https://hastebin.com/ebaxujaqed

After restarting pve-cluster :
https://hastebin.com/opedatudiq


Here a part of the second:

Code:
May 13 08:49:45 node3 pmxcfs[29854]: [status] notice: cpg_send_message retry 30
May 13 08:49:46 node3 corosync[27946]:   [KNET  ] rx: host: 2 link: 0 is up
May 13 08:49:46 node3 corosync[27946]:   [KNET  ] rx: host: 1 link: 0 is up
May 13 08:49:46 node3 corosync[27946]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 13 08:49:46 node3 corosync[27946]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 13 08:49:46 node3 pmxcfs[29854]: [status] notice: cpg_send_message retry 40
May 13 08:49:46 node3 corosync[27946]:   [TOTEM ] A new membership (3.7387) was formed. Members left: 1 2
May 13 08:49:46 node3 corosync[27946]:   [TOTEM ] Failed to receive the leave message. failed: 1 2
May 13 08:49:46 node3 corosync[27946]:   [CPG   ] downlist left_list: 2 received
May 13 08:49:46 node3 corosync[27946]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 13 08:49:46 node3 corosync[27946]:   [QUORUM] Members[1]: 3
May 13 08:49:46 node3 corosync[27946]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 13 08:49:46 node3 pmxcfs[29854]: [status] notice: node lost quorum
May 13 08:49:46 node3 pmxcfs[29854]: [dcdb] notice: members: 3/29854
May 13 08:49:46 node3 pmxcfs[29854]: [dcdb] notice: all data is up to date
May 13 08:49:46 node3 pmxcfs[29854]: [dcdb] crit: received write while not quorate - trigger resync
May 13 08:49:46 node3 pmxcfs[29854]: [dcdb] crit: leaving CPG group
May 13 08:49:46 node3 pmxcfs[29854]: [status] notice: members: 3/29854
May 13 08:49:46 node3 pmxcfs[29854]: [status] notice: all data is up to date
May 13 08:49:46 node3 pmxcfs[29854]: [status] notice: cpg_send_message retried 41 times
May 13 08:49:47 node3 corosync[27946]:   [TOTEM ] A new membership (1.738b) was formed. Members joined: 1 2
May 13 08:49:47 node3 corosync[27946]:   [TOTEM ] Retransmit List: 3
May 13 08:49:47 node3 corosync[27946]:   [CPG   ] downlist left_list: 0 received
May 13 08:49:47 node3 corosync[27946]:   [CPG   ] downlist left_list: 2 received
May 13 08:49:47 node3 corosync[27946]:   [CPG   ] downlist left_list: 2 received
May 13 08:49:47 node3 pmxcfs[29854]: [status] notice: members: 1/40759, 2/4521, 3/29854
May 13 08:49:47 node3 pmxcfs[29854]: [status] notice: starting data syncronisation
May 13 08:49:47 node3 corosync[27946]:   [QUORUM] This node is within the primary component and will provide service.
May 13 08:49:47 node3 corosync[27946]:   [QUORUM] Members[3]: 1 2 3
May 13 08:49:47 node3 corosync[27946]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 13 08:49:47 node3 pmxcfs[29854]: [status] notice: node has quorum
May 13 08:49:47 node3 pmxcfs[29854]: [dcdb] notice: start cluster connection
May 13 08:49:47 node3 pmxcfs[29854]: [dcdb] crit: cpg_join failed: 14
May 13 08:49:47 node3 pmxcfs[29854]: [dcdb] crit: can't initialize service
May 13 08:49:47 node3 pmxcfs[29854]: [dcdb] crit: cpg_send_message failed: 9

EDIT : I read in another topic that it could be a multicast issue.
Nodes are connected with tinc vpn, something related ?
 
Last edited:
Hello,

I'm planning a upgrade from proxmox 5 to 6 and found this thread. Is it still a big problem to upgrade corosync from v2 to v3? I saw that the thread has 13 pages.

Best regards
 
Nodes are connected with tinc vpn, something related ?

That's not really supported, you have steady retransmits from corosync, your network cannot keep up (latency wise, not necessarily bandwidth wise), and that's why a node leaves and joins the quorate partition constantly.
 
I'm planning a upgrade from proxmox 5 to 6 and found this thread. Is it still a big problem to upgrade corosync from v2 to v3? I saw that the thread has 13 pages.

Yeah, this thread is quite big, but there are lots of different mixed issues and troubleshooting posts in it, some where actual issues in corosync/kronosnet and got addressed, some where network or configuration problems (e.g., ringX_addr wasn't resolvable, which worked by look with corosync 2 but not for corsync 3), and some posts are totally unrelated at all.

Actually I'll close this thread pretty soon, as new ones make much more sense to have now.

But anyway, as long as you follow closely follow the upgrade docs ( https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0 ) and use the pve5to6 check list helper tool you should be fine if your setup was supported before it is now.
 
EDIT : I read in another topic that it could be a multicast issue.

Multicast isn't used in corosync 3 kronosnet for now, but for three nodes the unicast/multicast differences should be marginal..
 
That's not really supported, you have steady retransmits from corosync, your network cannot keep up (latency wise, not necessarily bandwidth wise), and that's why a node leaves and joins the quorate partition constantly.
But why it works well with corosync 2 ?
Is there a way to increase the timeout ?
And seems I don't have latency issue (ping ~14ms).
 
And seems I don't have latency issue (ping ~14ms).

Yeah well, that is a latency issue.. <= 2 ms is LAN performance and would be ideal, >8ms starts to makes issues fast and not really recommended > 10 ms isn't really usable.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_network_requirements

But why it works well with corosync 2 ?

I'd rather say it worked barely there, not well.. And that probably due to reduced packet loads thanks to it using multicast, but nothing to sure.

Is there a way to increase the timeout ?

There is but normally adjusting them is only required for big clusters (we have known working >50 node clusters with corosync 3 with some tuning, so it can work at scale).

You could try increasing token_coefficient from 650ms to 1000ms, see https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_edit_corosync_conf
In your case with the flapping links it may be a bit difficult to edit, if problems arise just ensure that the /etc/corosync/corosync.conf is the same on all nodes, if not copy a known good one with increased config_version value to all, restart corosync, then pve-cluster.
 
But why it works well with corosync 2 ?

Also you nest a vpn/cluster/tunneling protocol (kronosnet) in another such layer (tinc), this surely adds also some overhead...
Maybe try using it directly..
 
Average ping is about 14.5ms through tinc vpn and 13.5ms without vpn.
Nodes are on different agencies connected by an enterprise network.
So it therefore seems difficult to get LAN performance.
In your case with the flapping links it may be a bit difficult to edit, if problems arise just ensure that the /etc/corosync/corosync.conf is the same on all nodes, if not copy a known good one with increased config_version value to all, restart corosync, then pve-cluster.
Currently, the easiest way to edit corosync.conf is to restore the cluster by downgrading to corosync 2 (apt install corosync=2.4.4-pve1)
You could try increasing token_coefficient from 650ms to 1000ms
I tried token_coefficient = 5000, but it didn't works.
 
It works by adding "transport: udp" !
This option only works with "crypto_cipher: none" and "crypto_auth: none ".
I think disabling cryptography is not a problem since nodes are connected with a specific vpn.
My final corosync.conf:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.1
  }
  node {
    name: node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.2
  }
  node {
    name: node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: mycluster
  config_version: 11
  interface {
    bindnetaddr: 10.10.10.1
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
  transport: udp
  crypto_cipher: none
  crypto_hash: none
}
 
It works by adding "transport: udp" !
This option only works with "crypto_cipher: none" and "crypto_auth: none ".
I think disabling cryptography is not a problem since nodes are connected with a specific vpn.

It can still be a problem, this implies that any program being able to receive and send traffic on the tinc network has implicit root permissions, just as a heads up.

I mean, good on you to make it work at all, just have above in your mind.

I'll now really open this thread. All real issues of kronosnet/corosync 3 on supported setups (and networks) got fixed and for questions or new ones it's far better to open a new thread.
 
Status
Not open for further replies.

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!