[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Status
Not open for further replies.
I'm also able to report 7 healthy clusters with zero false positive fencing events over the last week. We always configure Corosync to run on LACP OvS bonds so the changes @spirit recommended are perfect for our usage case (detailed here)

The cluster where nodes would get fenced regularly (the one where the client isn't willing to follow our recommendation of upgrading 2 x 1 GbE to 2 x 10 GbE) logs many re-transmits but Corosync now always catches up and continues working...
 
if somebody still have corosync crash/segfault, can you enable debug logs in corosync.conf and apt-install systemd-coredump ?
This will help corosync devs to debug the problem.
Got a segfault today. Attached coredump (recompressed with xz for size and gz for accepted filetype in forum).

Code:
           PID: 2200707 (corosync)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Thu 2019-10-03 16:42:21 CEST (18min ago)
  Command Line: /usr/sbin/corosync -f
    Executable: /usr/sbin/corosync
 Control Group: /system.slice/corosync.service
          Unit: corosync.service
         Slice: system.slice
       Boot ID: d6655aeac94f4c6697bdf860d329c246
    Machine ID: 6e925d11b497446e8e7f2ff38e7cf891
      Hostname: osl107pve
       Storage: /var/lib/systemd/coredump/core.corosync.0.d6655aeac94f4c6697bdf860d329c246.2200707.1570113741000000.lz4
       Message: Process 2200707 (corosync) of user 0 dumped core.

                Stack trace of thread 2200707:
                #0  0x00007feef7a920f1 n/a (libc.so.6)
                #1  0x00005585e63a8b64 n/a (corosync)
                #2  0x00005585e63a05e6 n/a (corosync)
                #3  0x00005585e63a10e4 n/a (corosync)
                #4  0x00005585e63ab459 n/a (corosync)
                #5  0x00007feef78d50af n/a (libqb.so.0)
                #6  0x00007feef78d4c8d qb_loop_run (libqb.so.0)
                #7  0x00005585e63750f5 n/a (corosync)
                #8  0x00007feef795709b __libc_start_main (libc.so.6)
                #9  0x00005585e63757ba n/a (corosync)

                Stack trace of thread 2200715:
                #0  0x00007feef79f9720 __nanosleep (libc.so.6)
                #1  0x00007feef7a24874 usleep (libc.so.6)
                #2  0x00007feef7b2e64a n/a (libknet.so.1)
                #3  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #4  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200719:
                #0  0x00007feef79f9720 __nanosleep (libc.so.6)
                #1  0x00007feef7a24874 usleep (libc.so.6)
                #2  0x00007feef7b2e43f n/a (libknet.so.1)
                #3  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #4  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200717:
                #0  0x00007feef7a2c7ef epoll_wait (libc.so.6)
                #1  0x00007feef7b32486 n/a (libknet.so.1)
                #2  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #3  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200718:
                #0  0x00007feef7a2c7ef epoll_wait (libc.so.6)
                #1  0x00007feef7b2f7e3 n/a (libknet.so.1)
                #2  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #3  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200713:
                #0  0x00007feef7a2c7ef epoll_wait (libc.so.6)
                #1  0x00007feef7b34cb4 n/a (libknet.so.1)
                #2  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #3  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200714:
                #0  0x00007feef7a2c7ef epoll_wait (libc.so.6)
                #1  0x00007feef7b35650 n/a (libknet.so.1)
                #2  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #3  0x00007feef7a2c4cf __clone (libc.so.6)

                Stack trace of thread 2200716:
                #0  0x00007feef7a2c7ef epoll_wait (libc.so.6)
                #1  0x00007feef7b2ddc0 n/a (libknet.so.1)
                #2  0x00007feef7afbfa3 start_thread (libpthread.so.0)
                #3  0x00007feef7a2c4cf __clone (libc.so.6)
 

Attachments

Got a segfault today. Attached coredump (recompressed with xz for size and gz for accepted filetype in forum).

i assume you have the latest patch. provided in this threat by spirit, installed on all nodes ?
 
i assume you have the latest patch. provided in this threat by spirit, installed on all nodes
Yes, all nodes completely up to date as of today. (and corosync and/or node restarted). Cluster has been stable ever since the first patched 1.11 knet update from pvetest, and later upgraded to latest version 1.12. I've also applied the suggested knet timeout tweaks since we only have a single ring.
 
Thanks ahovda.
We are currently working with corosync dev, they have some idea of what could be the problem. Some patches are in testing.
https://github.com/kronosnet/kronosnet/issues/261

So, it's not the same segfault than previously. I hope it's the last one. (I'm still waiting before upgrading to proxmox6)
 
FYI, latest info out of the Issue:


So we hope to get a real fix soon, in the meanwhile you could adapt
/etc/pve/corosync.conf to have a reduced netmtu:

totem {
netmtu: 1446
...
}

This would be for a real detected pmtu of 1500, for others just reduce that by
>= 54, e.g., 8946 for 9000 MTU

I'l update this report here as soon we've more information.
 
My 2 cents on this
Background : since previous proxmox version required multicast, and my network (OVH without vrack) did not support it, I built a cluster under a TINC meshed network. This was working flawlessly since proxmox 2.x. The cluster is mainly here to have centralized administration, no HA or migration.

I upgraded to pve6 yesterday, and my 3-node cluster fall apart with lots of corosync Retransmit in logs, knet link down/up, and quorum loss.
knet/pmtud is also complaining about MTU misconfiguration

According to tinc debug, my PMTU is 1451, so I forced a netmtu in corosync.conf and rebooted the 3 nodes.

At this point, I have :
- libknet 1.13-pve1 (from repo)
- netmtu 1397

Cluster is not failing anymore, no corosync crash, but I'm still seeing knet down/up events, "Token has not been received", and Retransmit in logs (but way less than before lowering mtu)
The tinc link is stable with no visible loss

I'm still investigating, and don't known if those errors are significant with knet, but I did not have any errors with pve5 cluster.
 
Then edited PVE distributed Corosync configuration file (remember to increment config_version):
Code:
[admin@kvm1 ~]# cat /etc/pve/corosync.conf


totem {
  cluster_name: cluster1
  config_version: 6
  interface {
    linknumber: 0
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
  }
  ip_version: ipv4
  secauth: on
  version: 2
  token: 1000
}


Essentially set interface link0 to ping with an interval of 0.2s, timeout of 5s and count a single pong as being successful.

Now that the problems are basically solved my question is if for a 7 node cluster your changes are still "best practice"?! ;-)

My current corosync config for the totem part looks like:


Code:
totem {
  cluster_name: apollon-pm
  config_version: 9
  interface {
    bindnetaddr: 192.168.178.76
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

So I would add the lines there?
I don't get why there is that "bindnetaddr:" pointing to the IP of my "very first host" from the cluster in there ... but at least with 2.0 it works ;-)

So I simply would add the 4 lines to "Interface" and the "token" one.

Can I already do this before the update to corosync3 for the 2.x setup?

Thank you for your support!
 
Hi,

We run corosync on the vlan Ceph replicates on, on a redundant LACP channel, instead of a dedicated NIC. False positive fencing events are extremely disruptive so we continue to run with those settings in place...
Thank you, then I also try them that way ;-)
 
So, I also upgraded again to corosync3 on my pve5 system ... if it stays stable till sunday I will upgrade the first host to pve6 :-)
 
Hopefully this is still being looked at. I still have nodes that go offline and can be brought back by restarting corosync, then restarting pve-cluster. I have 2 4-node clusters. 1 node in each cluster has never gone offline - they have 128G RAM in them. The other 3 nodes in each cluster have gone offline, they each have 1.5TB RAM. I only mention this because the problem has been ongoing ever since the 5 -> 6 upgrade and the RAM difference is the only material difference in the servers. This could be coincidence, but it's been going on long enough now that I think not.

I have remote console access to the servers, so when one of them goes offline (no ssh or gui access), I can access the console. If any dev wants access, I'll gladly let them take a look around.

This problem may be mitigated, but it is NOT solved!

Code:
root@vsys06:/etc/pve# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-3-pve)
pve-manager: 6.0-9 (running version: 6.0-9/508dcee0)
pve-kernel-5.0: 6.0-9
pve-kernel-helper: 6.0-9
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-7
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-7
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve1
 
@efinley
so when one of them goes offline (no ssh or gui access), I

So ,you also loose ssh access ?
if yes, I don't think it's corosync related. maybe a nic driver bug, or other kernel bug.
do you have some log in /var/log/daemon.log or /var/log/kern.log ?
can you send your /etc/network/interfaces config ?
 
Status
Not open for further replies.

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!