Corosync - nsscrypto: Incorrect packet size

and could you post the network config (/etc/network/interfaces) from both nodes? something is mangling packets here, the question is who/what/where.. my next best guess would be to run tcpdump and monitor traffic on both nodes and links
How do I have to run tcpdump? I simply started it, but at first glance, there was nothing noticeable.
 
Ok here is the next weird info:
I have now restarted both nodes. Independent from each other. And syslog spam still occurs while the other server is down while rebooting. So it shouldn't be a problem in packet loss between these servers when one is down. Or I am wrong?
 
Ok it has something to do with the qdevice. I stopped the corosync service on the qdevice and the log spam disappeared.

Here is the corosync.conf of the device. Maybe there is a misconfiguration.

Bash:
pi@pve-qdevice1:~ $ cat /etc/corosync/corosync.conf
# Please read the corosync.conf.5 manual page
totem {
        version: 2

        secauth: on

        # Corosync itself works without a cluster name, but DLM needs one.
        # The cluster name is also written into the VG metadata of newly
        # created shared LVM volume groups, if lvmlockd uses DLM locking.
        cluster_name: PVE-Cluster

        # crypto_cipher and crypto_hash: Used for mutual node authentication.
        # If you choose to enable this, then do remember to create a shared
        # secret with "corosync-keygen".
        # enabling crypto_cipher, requires also enabling of crypto_hash.
        # crypto works only with knet transport
        crypto_cipher: none
        crypto_hash: none
}

logging {
        # Log the source file and line where messages are being
        # generated. When in doubt, leave off. Potentially useful for
        # debugging.
        fileline: off
        # Log to standard error. When in doubt, set to yes. Useful when
        # running in the foreground (when invoking "corosync -f")
        to_stderr: yes
        # Log to a log file. When set to "no", the "logfile" option
        # must not be set.
        to_logfile: yes
        logfile: /var/log/corosync/corosync.log
        # Log to the system log daemon. When in doubt, set to yes.
        to_syslog: yes
        # Log debug messages (very verbose). When in doubt, leave off.
        debug: off
        # Log messages with time stamps. When in doubt, set to hires (or on)
        #timestamp: hires
        logger_subsys {
                subsys: QUORUM
                debug: off
        }
}

quorum {
        # Enable and configure quorum subsystem (default: off)
        # see also corosync.conf.5 and votequorum.5
        provider: corosync_votequorum
}

nodelist {
        # Change/uncomment/add node sections to match cluster configuration

        node {
                # Hostname of the node
                name: pve1
                # Cluster membership node identifier
                nodeid: 1
                quorum_votes: 1
                # Address of first link
                ring0_addr: 192.168.41.10
                ring1_addr: 192.168.40.11
                # When knet transport is used it's possible to define up to 8 links
                #ring1_addr: 192.168.1.1
        }
        node {
                name: pve2
                nodeid: 2
                quorum_votes: 1
                ring0_addr: 192.168.41.20
                ring1_addr: 192.168.40.21
        }
        node {
                name: pve-qdevice1
                nodeid: 3
                quorum_votes: 1
                ring0_addr: 192.168.41.30
                ring1_addr: 192.168.40.30
        }
        # ...
}
 
Hi, my name is Fabio and I am one of the corosync/knet maintainers looking at this issue together with Fabian.

What is odd about this report is that the cluster is forming without any problems and at the same time we see those packets that are being rejected because either too small or too big. The double log you see now with debug enable is perfectly fine. It's the same error reported by different code paths inside knet.

First, I would love if you could capture a tcpdump on both nodes on the corosync interface. A few seconds while the problem is happening should be more than enough. Please use verbose options in tcpdump.

Next, I would like you to please try one thing for me. In corosync.conf, in the totem section, please add: crypto_model: openssl

The section would look like:
totem {
....
crypto_model: openssl
}

and restart both nodes, one at a time. If the problem disappear, then we have at least isolated part of the problem to be inside nss code. Otherwise tcpdump can help us investigate further.

Thanks
Fabio
 
Hi again, I just saw your last post about qdevice config. That might be the problem actually. I will let Fabian look into it.

Fabio
 
Ok it has something to do with the qdevice. I stopped the corosync service on the qdevice and the log spam disappeared.
Is there corosync running on the QDevice host? That could be an issue yes, or do you mean you only stopped the corosync-qnetd.service?
 
Is there corosync running on the QDevice host? That could be an issue yes, or do you mean you only stopped the corosync-qnetd.service?
The QDevice is an RPi. QDevice and QNet are installed on it. I simply stopped the corosync service, nothing else. The setup worked before migration (I think, haven't looked at the Syslog before that much). Maybe caused by a qdevice update or similar.
 
on the rpi the corosync service is not supposed to run - the whole point of the qdevice/qnetd feature is to have a tie-braker vote that is not running the full corosync stack ;) the config on the rpi probably had the two other nodes, but crypto disabled so it was sending plain traffic to the actual cluster which was expecting encrypted traffic..
 
  • Like
Reactions: Dudeplayz
on the rpi the corosync service is not supposed to run - the whole point of the qdevice/qnetd feature is to have a tie-braker vote that is not running the full corosync stack ;) the config on the rpi probably had the two other nodes, but crypto disabled so it was sending plain traffic to the actual cluster which was expecting encrypted traffic..
Thank you Fabian. I removed corosync and everything is still working, but now without the log spam. I have also discussed it a little bit more in the github issue.

King regards,
Dario
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!