Hello everyone!
I faced problems during testing 6.3-4 version upgrade in staging env and this thread is about corosync upgrade particularly. According to corosync logs reasons were not clear even with debug: on
State before:
33 online nodes in cluster and 2 unreachable ( turned off) nodes
All nodes have identical versions proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
corosync: 3.0.4-pve1
libcorosync-common4: 3.0.4-pve1
config with legacy bindnetaddr in interface section:
upgrade test scenario:
2 nodes received apt update && apt dist-upgrade. PVE 6.3-4 and corosync 3.1.0 were installed.
Corosync failed on start on both nodes with following last logs:
I decided to remove bindnetaddr from config. It successeed for running nodes using 3.0.4 and it helped to start c corosync on one test node with 3.1.0 but on another node corosync was still failing.
I am attaching logs from failing node with corosync 3.1.0 with debugn
Whole cluster is living successfully since pve 5
Can you advise where to dig further please? Nodes are identical in terms of network settings, number of interface, however hw is a bit different.
I faced problems during testing 6.3-4 version upgrade in staging env and this thread is about corosync upgrade particularly. According to corosync logs reasons were not clear even with debug: on
State before:
33 online nodes in cluster and 2 unreachable ( turned off) nodes
All nodes have identical versions proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
corosync: 3.0.4-pve1
libcorosync-common4: 3.0.4-pve1
config with legacy bindnetaddr in interface section:
Code:
logging {
debug: off
timestamp: on
to_syslog: yes
}
nodelist {
node {
name: vmm01
nodeid: 1
quorum_votes: 1
ring0_addr: 10.192.220.20
}
...
node {
name: vmm35
nodeid: 34
quorum_votes: 1
ring0_addr: 10.192.220.54
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: cluster1
config_version: 47
interface {
ringnumber: 0
bindnetaddr: 10.192.220.0
}
ip_version: ipv4
join: 500
knet_compression_model: zlib
max_messages: 12
merge: 600
netmtu: 1300
secauth: on
send_join: 250
token: 100000
version: 2
window_size: 30
}
2 nodes received apt update && apt dist-upgrade. PVE 6.3-4 and corosync 3.1.0 were installed.
Corosync failed on start on both nodes with following last logs:
Feb 26 20:11:48 vmm06 corosync[1638201]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Feb 26 20:11:48 vmm06 corosync[1638201]: [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Feb 26 20:11:48 vmm06 corosync[1638201]: [MAIN ] Please migrate config file to nodelist.
Feb 26 20:11:48 vmm06 corosync[1638201]: [TOTEM ] Initializing transport (Kronosnet).
Feb 26 20:11:48 vmm06 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Feb 26 20:11:48 vmm06 systemd[1]: corosync.service: Failed with result 'signal'.
Feb 26 20:11:48 vmm06 systemd[1]: Failed to start Corosync Cluster Engine.
I decided to remove bindnetaddr from config. It successeed for running nodes using 3.0.4 and it helped to start c corosync on one test node with 3.1.0 but on another node corosync was still failing.
I am attaching logs from failing node with corosync 3.1.0 with debugn
Whole cluster is living successfully since pve 5
Can you advise where to dig further please? Nodes are identical in terms of network settings, number of interface, however hw is a bit different.