our new cluster is being prepaired for use and while slowly starting to use the cluster i see some messages in syslog:
Dec 25 06:27:52 prxa06 pmxcfs: [dcdb] notice: data verification successful
Dec 25 06:34:26 prxa06 corosync: [TOTEM ] Retransmit List: b819f
Dec 25 06:35:35...
My syslog on all nodes is basically page after page of:
Dec 12 13:03:27 putsproxp10 corosync: [KNET ] pmtud: Starting PMTUD for host: 7 link: 0
Dec 12 13:03:27 putsproxp10 corosync: [KNET ] udp: detected kernel MTU: 1500
Dec 12 13:03:27 putsproxp10 corosync: [KNET ]...
I have a ceph cluster with the followed details:
Cluster works on separated NIC, active-backup bonding, separated DELL 10G switch, and separated IP range on 10Gbit.
On all of nodes there are some KNET link down entries when there is heavy load on some of node.
after upgrading from pve6 to pve7, the syslog of both servers is spammed with:
corosync: [KNET ] nsscrypto: Incorrect packet size.
After some googleing I haven't found a solution yet.
Hopfefully someone can help.
Related corosync github issue
The reason was the...
Today our cluster lost synchronization. Most of the nodes were shown as offline or unknown. The nodes were up but every node could see only itself and few other nodes.
Restarting the pve-cluster and corosync didn't help so we brought everything down and started them one by one.
Before I had a cluster of 13 nodes. I added 3 other nodes and within 5 minutes I lost the whole cluster. After restarting corosync 1 by 1 but when I start a 15th node I have this message:
corosync: [TOTEM ] Token has not been received in 380 ms
then after a few minutes the cluster...
I've been discussing this with corosync developers and they've told me this:
Multicast was only reccomended for corosync 1.x, because unicast was not tested yet
For corosync 2.x, they reccomend to use unicast (Proxmox currently uses...