PVE 5.4-11 + Corosync 3.x: major issues

Jun 8, 2016
223
43
33
43
Johannesburg, South Africa
I'm also able to report 7 healthy clusters with zero false positive fencing events over the last week. We always configure Corosync to run on LACP OvS bonds so the changes @spirit recommended are perfect for our usage case (detailed here)

The cluster where nodes would get fenced regularly (the one where the client isn't willing to follow our recommendation of upgrading 2 x 1 GbE to 2 x 10 GbE) logs many re-transmits but Corosync now always catches up and continues working...
 

ahovda

New Member
Sep 2, 2019
13
3
3
42
if somebody still have corosync crash/segfault, can you enable debug logs in corosync.conf and apt-install systemd-coredump ?
This will help corosync devs to debug the problem.
Got a segfault today. Attached coredump (recompressed with xz for size and gz for accepted filetype in forum).

Code:
           PID: 2200707 (corosync)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Thu 2019-10-03 16:42:21 CEST (18min ago)
  Command Line: /usr/sbin/corosync -f
    Executable: /usr/sbin/corosync
 Control Group: /system.slice/corosync.service
          Unit: corosync.service
         Slice: system.slice
       Boot ID: d6655aeac94f4c6697bdf860d329c246
    Machine ID: 6e925d11b497446e8e7f2ff38e7cf891
      Hostname: osl107pve
       Storage: /var/lib/systemd/coredump/core.corosync.0.d6655aeac94f4c6697bdf860d329c246.2200707.1570113741000000.lz4
       Message: Process 2200707 (corosync) of user 0 dumped core.

                Stack trace of thread 2200707:
                #0  0x00007feef7a920f1 n/a (libc.so.6)
                #1  0x00005585e63a8b64 n/a (corosync)
                #2  0x00005585e63a05e6 n/a (corosync)
                #3  0x00005585e63a10e4 n/a (corosync)
                #4  0x00005585e63ab459 n/a (corosync)
                #5  0x00007feef78d50af n/a (libqb.so.0)
                #6  0x00007feef78d4c8d qb_loop_run (libqb.so.0)
                #7  0x00005585e63750f5 n/a (corosync)
                #8  0x00007feef795709b __libc_start_main (libc.so.6)
                #9  0x00005585e63757ba n/a (corosync)

                Stack trace of thread 2200715:
                #0  0x00007feef79f9720 __nanosleep (libc.so.6)
                #1  0x00007feef7a24874 usleep (libc.so.6)
                #2  0x00007feef7b2e64a n/a (libknet.so.1)
                #3  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #4  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200719:
                #0  0x00007feef79f9720 __nanosleep (libc.so.6)
                #1  0x00007feef7a24874 usleep (libc.so.6)
                #2  0x00007feef7b2e43f n/a (libknet.so.1)
                #3  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #4  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200717:
                #0  0x00007feef7a2c7ef epoll_wait (libc.so.6)
                #1  0x00007feef7b32486 n/a (libknet.so.1)
                #2  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #3  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200718:
                #0  0x00007feef7a2c7ef epoll_wait (libc.so.6)
                #1  0x00007feef7b2f7e3 n/a (libknet.so.1)
                #2  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #3  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200713:
                #0  0x00007feef7a2c7ef epoll_wait (libc.so.6)
                #1  0x00007feef7b34cb4 n/a (libknet.so.1)
                #2  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #3  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200714:
                #0  0x00007feef7a2c7ef epoll_wait (libc.so.6)
                #1  0x00007feef7b35650 n/a (libknet.so.1)
                #2  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #3  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200716:
                #0  0x00007feef7a2c7ef epoll_wait (libc.so.6)
                #1  0x00007feef7b2ddc0 n/a (libknet.so.1)
                #2  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #3  0x00007feef7a2c4cf __clone (libc.so.6)
 

Attachments

bofh

Member
Nov 7, 2017
98
10
13
39
Got a segfault today. Attached coredump (recompressed with xz for size and gz for accepted filetype in forum).
i assume you have the latest patch. provided in this threat by spirit, installed on all nodes ?
 

ahovda

New Member
Sep 2, 2019
13
3
3
42
i assume you have the latest patch. provided in this threat by spirit, installed on all nodes
Yes, all nodes completely up to date as of today. (and corosync and/or node restarted). Cluster has been stable ever since the first patched 1.11 knet update from pvetest, and later upgraded to latest version 1.12. I've also applied the suggested knet timeout tweaks since we only have a single ring.
 

Apollon77

Member
Sep 24, 2018
60
6
8
42
FYI, latest info out of the Issue:


So we hope to get a real fix soon, in the meanwhile you could adapt
/etc/pve/corosync.conf to have a reduced netmtu:

totem {
netmtu: 1446
...
}

This would be for a real detected pmtu of 1500, for others just reduce that by
>= 54, e.g., 8946 for 9000 MTU

I'l update this report here as soon we've more information.
 

pfoo

Member
Jan 21, 2012
21
0
21
My 2 cents on this
Background : since previous proxmox version required multicast, and my network (OVH without vrack) did not support it, I built a cluster under a TINC meshed network. This was working flawlessly since proxmox 2.x. The cluster is mainly here to have centralized administration, no HA or migration.

I upgraded to pve6 yesterday, and my 3-node cluster fall apart with lots of corosync Retransmit in logs, knet link down/up, and quorum loss.
knet/pmtud is also complaining about MTU misconfiguration

According to tinc debug, my PMTU is 1451, so I forced a netmtu in corosync.conf and rebooted the 3 nodes.

At this point, I have :
- libknet 1.13-pve1 (from repo)
- netmtu 1397

Cluster is not failing anymore, no corosync crash, but I'm still seeing knet down/up events, "Token has not been received", and Retransmit in logs (but way less than before lowering mtu)
The tinc link is stable with no visible loss

I'm still investigating, and don't known if those errors are significant with knet, but I did not have any errors with pve5 cluster.
 

Apollon77

Member
Sep 24, 2018
60
6
8
42
Then edited PVE distributed Corosync configuration file (remember to increment config_version):
Code:
[admin@kvm1 ~]# cat /etc/pve/corosync.conf


totem {
  cluster_name: cluster1
  config_version: 6
  interface {
    linknumber: 0
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
  }
  ip_version: ipv4
  secauth: on
  version: 2
  token: 1000
}

Essentially set interface link0 to ping with an interval of 0.2s, timeout of 5s and count a single pong as being successful.
Now that the problems are basically solved my question is if for a 7 node cluster your changes are still "best practice"?! ;-)

My current corosync config for the totem part looks like:


Code:
totem {
  cluster_name: apollon-pm
  config_version: 9
  interface {
    bindnetaddr: 192.168.178.76
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}
So I would add the lines there?
I don't get why there is that "bindnetaddr:" pointing to the IP of my "very first host" from the cluster in there ... but at least with 2.0 it works ;-)

So I simply would add the 4 lines to "Interface" and the "token" one.

Can I already do this before the update to corosync3 for the 2.x setup?

Thank you for your support!
 

Apollon77

Member
Sep 24, 2018
60
6
8
42
Hi,

We run corosync on the vlan Ceph replicates on, on a redundant LACP channel, instead of a dedicated NIC. False positive fencing events are extremely disruptive so we continue to run with those settings in place...
Thank you, then I also try them that way ;-)
 

Apollon77

Member
Sep 24, 2018
60
6
8
42
So, I also upgraded again to corosync3 on my pve5 system ... if it stays stable till sunday I will upgrade the first host to pve6 :)
 
Jul 16, 2018
18
1
3
49
Hopefully this is still being looked at. I still have nodes that go offline and can be brought back by restarting corosync, then restarting pve-cluster. I have 2 4-node clusters. 1 node in each cluster has never gone offline - they have 128G RAM in them. The other 3 nodes in each cluster have gone offline, they each have 1.5TB RAM. I only mention this because the problem has been ongoing ever since the 5 -> 6 upgrade and the RAM difference is the only material difference in the servers. This could be coincidence, but it's been going on long enough now that I think not.

I have remote console access to the servers, so when one of them goes offline (no ssh or gui access), I can access the console. If any dev wants access, I'll gladly let them take a look around.

This problem may be mitigated, but it is NOT solved!

Code:
root@vsys06:/etc/pve# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-3-pve)
pve-manager: 6.0-9 (running version: 6.0-9/508dcee0)
pve-kernel-5.0: 6.0-9
pve-kernel-helper: 6.0-9
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-7
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-7
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve1
 

spirit

Famous Member
Apr 2, 2010
3,565
164
83
www.odiso.com
@efinley
so when one of them goes offline (no ssh or gui access), I
So ,you also loose ssh access ?
if yes, I don't think it's corosync related. maybe a nic driver bug, or other kernel bug.
do you have some log in /var/log/daemon.log or /var/log/kern.log ?
can you send your /etc/network/interfaces config ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!