[SOLVED] 6.3-4 and Corosync 3.1.0 startup failed

Anton G · Feb 28, 2021

Hello everyone!
I faced problems during testing 6.3-4 version upgrade in staging env and this thread is about corosync upgrade particularly. According to corosync logs reasons were not clear even with debug: on

State before:
33 online nodes in cluster and 2 unreachable ( turned off) nodes
All nodes have identical versions proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
corosync: 3.0.4-pve1
libcorosync-common4: 3.0.4-pve1

config with legacy bindnetaddr in interface section:

Code:

logging {
  debug: off
  timestamp: on
  to_syslog: yes
}

nodelist {
  node {
    name: vmm01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.192.220.20
  }
...
  node {
    name: vmm35
    nodeid: 34
    quorum_votes: 1
    ring0_addr: 10.192.220.54
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster1
  config_version: 47
  interface {
    ringnumber: 0
    bindnetaddr: 10.192.220.0
  }
  ip_version: ipv4
  join: 500
  knet_compression_model: zlib
  max_messages: 12
  merge: 600
  netmtu: 1300
  secauth: on
  send_join: 250
  token: 100000
  version: 2
  window_size: 30
}

upgrade test scenario:
2 nodes received apt update && apt dist-upgrade. PVE 6.3-4 and corosync 3.1.0 were installed.

Corosync failed on start on both nodes with following last logs:

Feb 26 20:11:48 vmm06 corosync[1638201]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Feb 26 20:11:48 vmm06 corosync[1638201]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Feb 26 20:11:48 vmm06 corosync[1638201]:   [MAIN  ] Please migrate config file to nodelist.
Feb 26 20:11:48 vmm06 corosync[1638201]:   [TOTEM ] Initializing transport (Kronosnet).
Feb 26 20:11:48 vmm06 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Feb 26 20:11:48 vmm06 systemd[1]: corosync.service: Failed with result 'signal'.
Feb 26 20:11:48 vmm06 systemd[1]: Failed to start Corosync Cluster Engine.

I decided to remove bindnetaddr from config. It successeed for running nodes using 3.0.4 and it helped to start c corosync on one test node with 3.1.0 but on another node corosync was still failing.

I am attaching logs from failing node with corosync 3.1.0 with debugn
Whole cluster is living successfully since pve 5

Can you advise where to dig further please? Nodes are identical in terms of network settings, number of interface, however hw is a bit different.

Moayad · Mar 1, 2021

Hi,

Please provide the Syslog from the time around the upgrade nodes in the cluster. And your network configuration.

fabian · Mar 1, 2021

are there any log lines after this? any chance a coredump was collected?

Anton G · Apr 3, 2021

Hello folks,
Sorry for being so long.
I came back to this upgrade and played around a bit.
I picked the same host from this thread (vmm06) and bumped all packages. Corosync was upgraded from 3.0.4 to 3.1 successfully and started.
I performed reboot and it started crashing.
I collected coredump and attached it.
Syslog around the time attached.
network settings attached
@fabian - no, there were all lines of logs with debug enabled.

Anton G · Apr 3, 2021

Corosync coredump

fabian · Apr 6, 2021

sounds like you hit https://github.com/corosync/corosync/issues/630 - fix is already available and will be integrated after upstream has reviewed it.

fabian · Apr 7, 2021

corosync 3.1.2 is now available on pvetest with the bug fix. you can also disable knet compression as a work around, the bug/crash should only affect setups that have it enabled.

Anton G · Apr 7, 2021

Thank you, already tested disabling compression with a success. @fabian

Search

Search

[SOLVED] 6.3-4 and Corosync 3.1.0 startup failed

Anton G

New Member

Attachments

Moayad

Proxmox Staff Member

fabian

Proxmox Staff Member

Anton G

New Member

Attachments

Anton G

New Member

fabian

Proxmox Staff Member

fabian

Proxmox Staff Member

Anton G

New Member