[SOLVED] 6.3-4 and Corosync 3.1.0 startup failed

Anton G

New Member
Feb 28, 2021
4
1
3
38
Hello everyone!
I faced problems during testing 6.3-4 version upgrade in staging env and this thread is about corosync upgrade particularly. According to corosync logs reasons were not clear even with debug: on

State before:
33 online nodes in cluster and 2 unreachable ( turned off) nodes
All nodes have identical versions proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
corosync: 3.0.4-pve1
libcorosync-common4: 3.0.4-pve1

config with legacy bindnetaddr in interface section:
Code:
logging {
  debug: off
  timestamp: on
  to_syslog: yes
}

nodelist {
  node {
    name: vmm01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.192.220.20
  }
...
  node {
    name: vmm35
    nodeid: 34
    quorum_votes: 1
    ring0_addr: 10.192.220.54
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster1
  config_version: 47
  interface {
    ringnumber: 0
    bindnetaddr: 10.192.220.0
  }
  ip_version: ipv4
  join: 500
  knet_compression_model: zlib
  max_messages: 12
  merge: 600
  netmtu: 1300
  secauth: on
  send_join: 250
  token: 100000
  version: 2
  window_size: 30
}
upgrade test scenario:
2 nodes received apt update && apt dist-upgrade. PVE 6.3-4 and corosync 3.1.0 were installed.

Corosync failed on start on both nodes with following last logs:
Feb 26 20:11:48 vmm06 corosync[1638201]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow Feb 26 20:11:48 vmm06 corosync[1638201]: [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used. Feb 26 20:11:48 vmm06 corosync[1638201]: [MAIN ] Please migrate config file to nodelist. Feb 26 20:11:48 vmm06 corosync[1638201]: [TOTEM ] Initializing transport (Kronosnet). Feb 26 20:11:48 vmm06 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV Feb 26 20:11:48 vmm06 systemd[1]: corosync.service: Failed with result 'signal'. Feb 26 20:11:48 vmm06 systemd[1]: Failed to start Corosync Cluster Engine.

I decided to remove bindnetaddr from config. It successeed for running nodes using 3.0.4 and it helped to start c corosync on one test node with 3.1.0 but on another node corosync was still failing.

I am attaching logs from failing node with corosync 3.1.0 with debug:on
Whole cluster is living successfully since pve 5

Can you advise where to dig further please? Nodes are identical in terms of network settings, number of interface, however hw is a bit different.
 

Attachments

Hi,

Please provide the Syslog from the time around the upgrade nodes in the cluster. And your network configuration.
 
are there any log lines after this? any chance a coredump was collected?
 
Hello folks,
Sorry for being so long.
I came back to this upgrade and played around a bit.
I picked the same host from this thread (vmm06) and bumped all packages. Corosync was upgraded from 3.0.4 to 3.1 successfully and started.
I performed reboot and it started crashing.
I collected coredump and attached it.
Syslog around the time attached.
network settings attached
@fabian - no, there were all lines of logs with debug enabled.
 

Attachments

corosync 3.1.2 is now available on pvetest with the bug fix. you can also disable knet compression as a work around, the bug/crash should only affect setups that have it enabled.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!