[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Status
Not open for further replies.
Hi Fabio, we have some hardware coming in this week for a new production cluster so more than happy to do whatever we can to help fix this issue.

Attached are the logs from all three hosts from 12am saturday morning until I manually restarted corosync on scramjet around midday. Scramjet is the host that experienced the FPE at 02:49:31

Let me know if you need anything else.

Thanks
 

Attachments

@WesC thanks for the logs.

what is interesting to notice in the logs is that it is always node1 (scramjet) experiencing the link flapping. but there is no link flapping between node 2 and node3.

Does scramjet have a specific workload that the other nodes don't have? (I am no proxmox expert, apology if this is supposed to be a well known configuration ;)).

Out of curiosity here. is scarmjet connected to the same network switch as the other 2 nodes? is it the same hw class as the other nodes? Would it be possible to check for network packet loss on the connection between node1 and node2|3?

I see at least that we are not experiencing MTU changes, so that's good, at least I know what part of the code needs tuning (and thanks for the feedback here)

Thanks
Fabio
 
in addition to the questions that @Fabio M. Di Nitto asked, IF THIS IS A TEST CLUSTER, could you try editing /etc/pve/corosync.conf to set "secauth" to "off" and bump the contained "config_version" by 1 and then restart corosync on all three nodes? this disables all encryption and authentication (so please don't do it on a production system!), it would be interesting to see whether the additional overhead for that is a possible culprit or not. same goes for changing the transport to the legacy udpu one altogether (this also requires secauth = off).
 
Last edited:
@WesC thanks for the logs.

what is interesting to notice in the logs is that it is always node1 (scramjet) experiencing the link flapping. but there is no link flapping between node 2 and node3.

Does scramjet have a specific workload that the other nodes don't have? (I am no proxmox expert, apology if this is supposed to be a well known configuration ;)).

I think previously, on the original libknet version other hosts were flapping too. I can go back further in the logs to find out if needed. The workload on scramjet isn't very different in nature but it is higher, being an EPYC system with more ram than the other hosts it'll typically run more VMs at any given time.

Out of curiosity here. is scarmjet connected to the same network switch as the other 2 nodes? is it the same hw class as the other nodes? Would it be possible to check for network packet loss on the connection between node1 and node2|3?

They are all connected to the same switch, which was replaced with new hardware when we first started seeing the PMTUD issues (the previous switch had other issues so we decided to just buy a new, modern one). I don't see any signs of packet loss on scramjet's or sledgehammer's interfaces, but I do see some on sanctuary, though I'm not sure when these were per se, they could be old and might predate some network cable replacements that were also made. I'll document them here so if we see another loss of quorum in the next few days I can compare the new values to current

Code:
        RX packets 519626933  bytes 444646171568 (414.1 GiB)
        RX errors 0  dropped 4765  overruns 0  frame 0
        TX packets 301465625  bytes 29353892481 (27.3 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

        RX packets 5114043882  bytes 7575038969366 (6.8 TiB)
        RX errors 0  dropped 9817  overruns 0  frame 0
        TX packets 1993387039  bytes 2576367143729 (2.3 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I wouldn't say the 3 nodes were the same hardware class, scramjet is definitely newer hardware with greater resources. The other 2 nodes are 1x single socket and 1x dual socket xeon systems. They're aging but they're not too far behind scramjet, just less ram and core count.

I see at least that we are not experiencing MTU changes, so that's good, at least I know what part of the code needs tuning (and thanks for the feedback here)

Thanks
Fabio

With the test version of libknet Fabian posted the PMTUD issue does appear to have subsided. Yesterday and this morning the nodes were green and quorum maintained, which is the longest they've held this state since forming the cluster. The FPE on saturday morning is the one exception on this version of libknet. It does seem like a step in the right direction
 
Last edited:
in addition to the questions that @Fabio M. Di Nitto asked, IF THIS IS A TEST CLUSTER, could you try editing /etc/pve/corosync.conf to set "secauth" to "off" and bump the contained "config_version" by 1 and then restart corosync on all three nodes? this disables all encryption and authentication (so please don't do it on a production system!), it would be interesting to see whether the additional overhead for that is a possible culprit or not. same goes for changing the transport to the legacy udpu one altogether (this also requires secauth = off).

I did try that the night before you posted the test version of libknet, and that morning the cluster was green and not reporting the PMTUD issue. I undid this change to test that version of libknet however
 
@Fabio M. Di Nitto Would this also help with the errors in the top posts (or in my post) ... there was nothing with MTU errors ... or is this something different?

Hi Apollon77, Marin Bernard's logs did show pmtud log entries, which lead me to suspect their issue was the same as mine - but it's possible there's multiple things at work here (in which case I apologize for any potential thread hijacking!). You might want to grep your logs for it to confirm if yours is the same/similar or something different. The logs you posted do look a lot like mine do after the pmtud issue manifests. You mentioned kernel crashes, maybe logs of that or leading up to it, if available, would help determine if it's the same or multiple issues as well?

In fact it is a home network for me all with Gigabit and I also saw hanging SSH connections while the rest of the network and connections worked just fine

Not wanting to jump to any conclusions, but if it helps you it's been my experience that specific symptom, if it's when trying to log in to a host via ssh initially, is often an indication of IO contention and stalling, usually (in my case) from a dying-but-not-quite-dead hard drive. Though that's not the only thing that can cause it.

I'd wait around and try ssh'ing to my hosts when this failure happens to see if I experience the same issue with ssh hanging but they seem to love to do it between 3am and 6am...
 
I'm also seeing a much more stable PVE cluster with http://download.proxmox.com/temp/libknet1_1.10-pve2~test1_amd64.deb installed. I've got a 5 node cluster in a single VLAN and corosync was frequently "splitting" into subsets of nodes, e.g.

root@maia# pvecm status
Quorum information
------------------
Date: Mon Aug 5 16:47:44 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000006
Ring ID: 4/586844
Quorate: No

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 2
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000004 1 192.168.1.60
0x00000006 1 192.168.1.80 (local)​

Errors were being logged every 3 seconds with a rate of several thousand errors per hour. The errors were all very similar.

<28>Aug 5 16:29:58 hera corosync[408628]: [KNET ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
syslog_message:
[KNET ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet​

Since installing libknet1_1.10-pve2~test1 the errors have stopped completely. All hosts are logging into Kibana and a graph shows a complete absence of errors since the package was installed at 4:30pm. This is also the longest the cluster has stayed in sync since the upgrade to Proxmox 6 Beta. I'll report if it regresses but looking good so far.

I've attached my corosync.conf. This is not a production cluster so if you need me to run any tests or a copy of the Kibana logs let me know.
 

Attachments

Last edited:
Hi Apollon77, Marin Bernard's logs did show pmtud log entries, which lead me to suspect their issue was the same as mine - but it's possible there's multiple things at work here (in which case I apologize for any potential thread hijacking!). You might want to grep your logs for it to confirm if yours is the same/similar or something different. The logs you posted do look a lot like mine do after the pmtud issue manifests.

I did and yes I had some such lines also in my logs but not many ... but yes exactly on those days ...
Code:
root@pm6:/var/log# zgrep MTU syslog.*.gz
syslog.5.gz:Jul 31 07:26:26 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 6 link: 0 from 470 to 1366
syslog.5.gz:Jul 31 07:26:27 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 470 to 1366
syslog.5.gz:Jul 31 07:26:27 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 470 to 1366
syslog.5.gz:Jul 31 07:26:27 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366
syslog.5.gz:Jul 31 07:26:27 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 470 to 1366
syslog.5.gz:Jul 31 07:26:27 pm6 corosync[1617]:   [KNET  ] pmtud: Global data MTU changed to: 1366
syslog.5.gz:Jul 31 08:02:21 pm6 corosync[1617]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 3 link 0 but the other node is not acknowledging packets of this size.
syslog.5.gz:Jul 31 08:02:21 pm6 corosync[1617]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
syslog.5.gz:Jul 31 08:02:21 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 1366 to 1350
syslog.5.gz:Jul 31 08:02:21 pm6 corosync[1617]:   [KNET  ] pmtud: Global data MTU changed to: 1350
syslog.5.gz:Jul 31 08:02:51 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 1350 to 1366
syslog.5.gz:Jul 31 08:02:51 pm6 corosync[1617]:   [KNET  ] pmtud: Global data MTU changed to: 1366
syslog.5.gz:Jul 31 08:08:58 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 7 link: 0 from 470 to 1366
syslog.5.gz:Jul 31 11:03:09 pm6 corosync[1617]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
syslog.5.gz:Jul 31 11:03:09 pm6 corosync[1617]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
syslog.5.gz:Jul 31 11:03:09 pm6 corosync[1617]:   [KNET  ] pmtud: Global data MTU changed to: 1350
syslog.5.gz:Jul 31 11:03:39 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1350 to 1366
syslog.5.gz:Jul 31 11:03:39 pm6 corosync[1617]:   [KNET  ] pmtud: Global data MTU changed to: 1366
syslog.5.gz:Jul 31 15:23:54 pm6 corosync[1617]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
syslog.5.gz:Jul 31 15:23:54 pm6 corosync[1617]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
syslog.5.gz:Jul 31 15:23:54 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1366 to 1350
syslog.5.gz:Jul 31 15:23:54 pm6 corosync[1617]:   [KNET  ] pmtud: Global data MTU changed to: 1350
syslog.5.gz:Jul 31 15:24:24 pm6 corosync[1617]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1350 to 1366
syslog.5.gz:Jul 31 15:24:24 pm6 corosync[1617]:   [KNET  ] pmtud: Global data MTU changed to: 1366
syslog.5.gz:Binary file (standard input) matches
syslog.6.gz:Jul 30 23:12:50 pm6 corosync[9920]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 470 to 1366
syslog.6.gz:Jul 30 23:12:50 pm6 corosync[9920]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 470 to 1366
syslog.6.gz:Jul 30 23:12:50 pm6 corosync[9920]:   [KNET  ] pmtud: Global data MTU changed to: 1366
syslog.6.gz:Jul 30 23:12:51 pm6 corosync[9920]:   [KNET  ] pmtud: PMTUD link change for host: 6 link: 0 from 470 to 1366
syslog.6.gz:Jul 30 23:12:51 pm6 corosync[9920]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 470 to 1366
syslog.6.gz:Jul 30 23:12:53 pm6 corosync[9920]:   [KNET  ] pmtud: PMTUD link change for host: 7 link: 0 from 470 to 1366
syslog.6.gz:Jul 30 23:12:54 pm6 corosync[9920]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366

So it could be connected, yes you are right.

You mentioned kernel crashes, maybe logs of that or leading up to it, if available, would help determine if it's the same or multiple issues as well?

In the logs I saw nothing or "NULL bytes in syslog" and then the watchdog which I have on those machines restarted them. This happened three times on three different machines ... two at the same time and one some hours later. But in fact logs did not had any real kernel panic message or such in thjm on those crashes

I'm now back to PVE 5.4 with corosync2 at the moment and monitor this issue and the glusterfs packaging issue because both are blocking me. I think because of updoming vacations I will do a next real try after the vacation in September :-) I can not afford an instable system during absence.
But now with knowing how to revert the corosync3 experiment I would just need a PVE5 compatible version of the fix and I could also try it
 
Hi all,

Looks like I've got the same issue as you all - corosync failing randomly (but roughly every 12 - 48 hours), causing various management and connectivity issues.

If there's any logs that I can provide to shed more light on the problem, please shout.

Couple of things about my cluster setup;
4 node cluster
Running ceph with 28HDD's
About 50 - 100 VM's
All over single flat 1G network (not ideal but has not given any issues in the past)

I have noticed that it's more likely to fail in my case when there's a lot of traffic on the system (i.e. backups, migrations, etc), and have also noticed a lot of KNET messages in my logs.

Thanks for investigating and shout if I can be of any help.
 
Unfortunately my cluster has failed again even with pve2-test1. The cluster split into two subsets this time.

First subset, two nodes.

root@hera:~# pvecm status
Quorum information
------------------
Date: Wed Aug 7 00:49:29 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000004
Ring ID: 1/604052
Quorate: No

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 2
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.150
0x00000004 1 192.168.1.60 (local)​

Second subset, three nodes.

root@zeus:~# pvecm status
Quorum information
------------------
Date: Wed Aug 7 00:49:34 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 2/603936
Quorate: Yes
Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.1.100 (local)
0x00000005 1 192.168.1.70
0x00000006 1 192.168.1.80​

Looking at the logs the split occurred around 8pm. The knet error says "link down".

August 6th 2019, 20:03:42.878 [KNET ] link: host: 1 link: 0 is down syslog_message: [KNET ] link: host: 1 link: 0 is down @version:1 syslog_pid:1939599 @timestamp:August 6th 2019, 20:03:42.878 syslog_program:corosync syslog_hostname:leto priority:30 severity_label:informational _id:EhZiZmwBTsBZz8olcEEt _type:doc _index:logstash-2019.08.06
August 6th 2019, 20:03:42.879 [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) @version:1 syslog_pid:1939599 @timestamp:August 6th 2019, 20:03:42.879 syslog_program:corosync syslog_hostname:leto priority:30 severity_label:informational _id:FBZiZmwBTsBZz8olcEEt _type:doc _index:logstash-2019.08.06 _score: -​

I can't see any other network errors in the logs around that time.
 
I also have instability in my 3 nodes cluster recently upgrade from 5.4 to 6.0.
corosync got killed on two nodes during the past days.

This morning, one of the node got out of the cluster. The log showed a "TOTEM FAILED RECEIVED".
It got back it in by adding a second ring (on my infiniband network) as I suspected some network related problem.

A few minutes ago, it was again out of the cluster with a very strange status on the GUI showing "standalone node" on top of the 3 nodes list of nodes !

I'm switching to libknet 1.10 and will report any change...
 
Hi @Fabio, @fabian,

This morning sanctuary dropped out of the cluster with a divide error, load would have been negligible at the time:
Code:
Aug  7 00:10:37 sanctuary kernel: [739314.684408] show_signal: 6 callbacks suppressed
Aug  7 00:10:37 sanctuary kernel: [739314.684410] traps: corosync[2533523] trap divide error ip:xxxxxxxx38e6 sp:xxxxxxxxea50 error:0 in libknet.so.1.2.0[xxxxxxxx8000+13000]
Aug  7 00:10:37 sanctuary pmxcfs[217081]: [quorum] crit: quorum_dispatch failed: 2
Aug  7 00:10:37 sanctuary pmxcfs[217081]: [status] notice: node lost quorum

I can post full logs later when I get to the office if required

edit: restarting corosync on sanctuary didn't restore the cluster, restarting corosync on all hosts after that first attempt also did not restore the cluster. I'll dig deeper when I get to the office

edit2: The only way to get sanctuary back into the cluster was to reboot it. I can't see anything that's obvious (to me) in the logs explaining why this was the case. As far as I can see corosync returned to quorum, but then other components claimed there was no quorum:

Code:
Aug  7 09:01:40 sanctuary systemd[1]: Starting Corosync Cluster Engine... 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [MAIN  ] Corosync Cluster Engine 3.0.2-dirty starting up 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [TOTEM ] Initializing transport (Kronosnet). 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [TOTEM ] kronosnet crypto initialized: aes256/sha256 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [TOTEM ] totemknet initialized 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [SERV  ] Service engine loaded: corosync configuration map access [0] 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [QB    ] server name: cmap 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [SERV  ] Service engine loaded: corosync configuration service [1] 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [QB    ] server name: cfg 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2] 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [QB    ] server name: cpg 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [SERV  ] Service engine loaded: corosync profile loading service [4] 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6] 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [WD    ] Watchdog not enabled by configuration 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [WD    ] resource load_15min missing a recovery key. 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [WD    ] resource memory_used missing a recovery key. 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [WD    ] no resources configured. 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [SERV  ] Service engine loaded: corosync watchdog service [7] 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [QUORUM] Using quorum provider corosync_votequorum 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5] 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [QB    ] server name: votequorum 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3] 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [QB    ] server name: quorum 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0) 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 2 has no active links 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0) 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 1 has no active links 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1) 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 1 has no active links 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1) 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 1 has no active links 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 0) 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 3 has no active links 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1) 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 3 has no active links 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [TOTEM ] A new membership (2:317588) was formed. Members joined: 2 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1) 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [KNET  ] host: host: 3 has no active links 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [CPG   ] downlist left_list: 0 received 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [QUORUM] Members[1]: 2 
Aug  7 09:01:40 sanctuary corosync[3771614]:   [MAIN  ] Completed service synchronization, ready to provide service. 
Aug  7 09:01:40 sanctuary systemd[1]: Started Corosync Cluster Engine. 
Aug  7 09:01:40 sanctuary pvedaemon[2921333]: <root@pam> end task UPID:sanctuary:00398CCA:0498C10F:5D4A06D3:srvrestart:corosync:root@pam: OK 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [KNET  ] rx: host: 3 link: 0 is up 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [KNET  ] rx: host: 1 link: 0 is up 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1) 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1) 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [TOTEM ] A new membership (1:317592) was formed. Members joined: 1 3 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [CPG   ] downlist left_list: 0 received 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [CPG   ] downlist left_list: 0 received 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [CPG   ] downlist left_list: 0 received 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [QUORUM] This node is within the primary component and will provide service. 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [QUORUM] Members[3]: 1 2 3 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [MAIN  ] Completed service synchronization, ready to provide service. 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 470 to 1366 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 470 to 1366 
Aug  7 09:01:42 sanctuary corosync[3771614]:   [KNET  ] pmtud: Global data MTU changed to: 1366 
Aug  7 09:01:53 sanctuary pveproxy[3612179]: proxy detected vanished client connection 
Aug  7 09:01:53 sanctuary pvedaemon[2943900]: <root@pam> successful auth for user 'root@pam' 
Aug  7 09:02:00 sanctuary systemd[1]: Starting Proxmox VE replication runner... 
Aug  7 09:02:01 sanctuary pvesr[3771935]: trying to acquire cfs lock 'file-replication_cfg' ... 
Aug  7 09:02:02 sanctuary pvesr[3771935]: trying to acquire cfs lock 'file-replication_cfg' ... 
Aug  7 09:02:03 sanctuary pvesr[3771935]: trying to acquire cfs lock 'file-replication_cfg' ... 
Aug  7 09:02:04 sanctuary pvesr[3771935]: trying to acquire cfs lock 'file-replication_cfg' ... 
Aug  7 09:02:05 sanctuary pvesr[3771935]: trying to acquire cfs lock 'file-replication_cfg' ... 
Aug  7 09:02:06 sanctuary pvesr[3771935]: trying to acquire cfs lock 'file-replication_cfg' ... 
Aug  7 09:02:07 sanctuary pvesr[3771935]: trying to acquire cfs lock 'file-replication_cfg' ... 
Aug  7 09:02:08 sanctuary pvesr[3771935]: trying to acquire cfs lock 'file-replication_cfg' ... 
Aug  7 09:02:09 sanctuary pvesr[3771935]: trying to acquire cfs lock 'file-replication_cfg' ... 
Aug  7 09:02:10 sanctuary pvesr[3771935]: error with cfs lock 'file-replication_cfg': no quorum! 
Aug  7 09:02:10 sanctuary systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a 
Aug  7 09:02:10 sanctuary systemd[1]: pvesr.service: Failed with result 'exit-code'. 
Aug  7 09:02:10 sanctuary systemd[1]: Failed to start Proxmox VE replication runner. 
Aug  7 09:03:00 sanctuary systemd[1]: Starting Proxmox VE replication runner...
 
Last edited:
I'm experiencing the same symptoms as described by others in this thread in my recently-upgraded pve v6 3-node cluster.

Restarting corosync seems to resolve the issue. Prior to restarting corosync today, I noticed that the corosync process was running at 100% CPU.
 
corosync got killed (again) on one of my 3 nodes cluster

Code:
root@r510a:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Sun 2019-08-11 21:50:48 UTC; 9h ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 3105318 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=FPE)
 Main PID: 3105318 (code=killed, signal=FPE)

août 11 21:50:47 r510a corosync[3105318]:   [KNET  ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
août 11 21:50:47 r510a corosync[3105318]:   [KNET  ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
août 11 21:50:47 r510a corosync[3105318]:   [KNET  ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
août 11 21:50:48 r510a corosync[3105318]:   [KNET  ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
août 11 21:50:48 r510a corosync[3105318]:   [KNET  ] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery.
août 11 21:50:48 r510a corosync[3105318]:   [KNET  ] link: host: 5 link: 0 is down
août 11 21:50:48 r510a corosync[3105318]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
août 11 21:50:48 r510a corosync[3105318]:   [KNET  ] host: host: 5 has no active links
août 11 21:50:48 r510a systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
août 11 21:50:48 r510a systemd[1]: corosync.service: Failed with result 'signal'.

On https://forum.proxmox.com/threads/upgarding-2-nodes-to-corosync3.56457/ I've seen that the MTU problem could be related to my infiniband secondary ring. So I've updated libknet and libnozzle to 1.10_pve2 versions and will report is other problem occur...
 
I'm also experiencing corosync issues on clusters we've upgraded to PVE 6.0. We tried changing protocol from UDP to SCTP, defining netmtu (which we subsequently discovered to no longer be used in Corosync 3). We've upgraded one cluster to using libknet1_1.10-pve2~test1_amd64.deb but still see error messages in /var/log/syslog.

The following document nicely summarieses the changes in Corosync 3:
http://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf

The nodes haven't rebooted since increasing token to 10,000ms but I'm pretty sure it's simply a matter of time.

We only installed libknet1 from testing as the other packages don't appear to be installed:
Code:
[root@kvm1 ~]# dpkg -l | grep -e libknet -e libnozzle
ii  libknet1:amd64                       1.10-pve2~test1                         amd64        kronosnet core switching implementation


Herewith our Corosync configuration file:
Code:
[root@kvm1 ~]# cat /etc/corosync/corosync.conf[/INDENT]
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: kvm1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 1.1.7.9
  }

  node {
    name: kvm2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 1.1.7.10
  }

  node {
    name: kvm3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 1.1.7.11
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster1
  config_version: 3
  interface {
    linknumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
  token: 10000
}

Herewith the logs from today, after having installed libknet1_1.10-pve2 yesterday:
Code:
[root@kvm1 ~]# grep -i corosync /var/log/syslog
Aug 12 01:29:08 kvm1 corosync[689090]:   [TOTEM ] Retransmit List: 1f112
Aug 12 01:30:34 kvm1 corosync[689090]:   [KNET  ] link: host: 2 link: 0 is down
Aug 12 01:30:34 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 01:30:34 kvm1 corosync[689090]:   [KNET  ] host: host: 2 has no active links
Aug 12 01:30:40 kvm1 corosync[689090]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 12 01:30:40 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 01:30:53 kvm1 corosync[689090]:   [TOTEM ] Retransmit List: 1f34b
Aug 12 02:32:37 kvm1 corosync[689090]:   [TOTEM ] Token has not been received in 382 ms
Aug 12 02:32:38 kvm1 corosync[689090]:   [TOTEM ] Retransmit List: 251b4
Aug 12 02:34:23 kvm1 corosync[689090]:   [KNET  ] link: host: 2 link: 0 is down
Aug 12 02:34:23 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 02:34:23 kvm1 corosync[689090]:   [KNET  ] host: host: 2 has no active links
Aug 12 02:34:29 kvm1 corosync[689090]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 12 02:34:29 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 02:34:30 kvm1 corosync[689090]:   [TOTEM ] Retransmit List: 2539f
Aug 12 02:52:19 kvm1 corosync[689090]:   [KNET  ] link: host: 2 link: 0 is down
Aug 12 02:52:19 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 02:52:19 kvm1 corosync[689090]:   [KNET  ] host: host: 2 has no active links
Aug 12 02:52:25 kvm1 corosync[689090]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 12 02:52:25 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 03:01:34 kvm1 corosync[689090]:   [KNET  ] link: host: 2 link: 0 is down
Aug 12 03:01:34 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 03:01:34 kvm1 corosync[689090]:   [KNET  ] host: host: 2 has no active links
Aug 12 03:01:40 kvm1 corosync[689090]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 12 03:01:40 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 03:15:57 kvm1 corosync[689090]:   [TOTEM ] Retransmit List: 290bd
Aug 12 05:02:03 kvm1 corosync[689090]:   [KNET  ] link: host: 2 link: 0 is down
Aug 12 05:02:03 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 05:02:03 kvm1 corosync[689090]:   [KNET  ] host: host: 2 has no active links
Aug 12 05:02:09 kvm1 corosync[689090]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 12 05:02:09 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 05:30:09 kvm1 corosync[689090]:   [KNET  ] link: host: 2 link: 0 is down
Aug 12 05:30:15 kvm1 corosync[689090]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 12 05:30:15 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 05:30:16 kvm1 corosync[689090]:   [TOTEM ] Token has not been received in 382 ms
Aug 12 07:16:44 kvm1 corosync[689090]:   [TOTEM ] Retransmit List: 3ff63
Aug 12 07:50:27 kvm1 corosync[689090]:   [TOTEM ] Token has not been received in 7987 ms
Aug 12 07:50:29 kvm1 corosync[689090]:   [TOTEM ] Retransmit List: 43237
Aug 12 07:51:46 kvm1 corosync[689090]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 2 link 0 but the other node is not acknowledging packets of this size.
Aug 12 07:51:46 kvm1 corosync[689090]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Aug 12 07:51:46 kvm1 corosync[689090]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 8854 to 8838
Aug 12 07:51:46 kvm1 corosync[689090]:   [KNET  ] pmtud: Global data MTU changed to: 8838
Aug 12 07:52:16 kvm1 corosync[689090]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 8838 to 8854
Aug 12 07:52:16 kvm1 corosync[689090]:   [KNET  ] pmtud: Global data MTU changed to: 8854
Aug 12 08:07:41 kvm1 corosync[689090]:   [KNET  ] link: host: 2 link: 0 is down
Aug 12 08:07:41 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 08:07:41 kvm1 corosync[689090]:   [KNET  ] host: host: 2 has no active links
Aug 12 08:07:47 kvm1 corosync[689090]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 12 08:07:47 kvm1 corosync[689090]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 08:07:47 kvm1 corosync[689090]:   [TOTEM ] Token has not been received in 382 ms
Aug 12 09:28:44 kvm1 corosync[689090]:   [TOTEM ] Retransmit List: 4c6e2
Aug 12 09:30:48 kvm1 corosync[689090]:   [TOTEM ] Retransmit List: 4c981
Aug 12 09:31:02 kvm1 corosync[689090]:   [TOTEM ] Token has not been received in 7987 ms
 
Last edited:
Still got SEGV on corosync today, even with updated libknet1:

Code:
root@dev-proxmox15:~# dpkg -l | grep libknet
ii  libknet1:amd64                       1.10-pve2                       amd64        kronosnet core switching implementation

Code:
Aug 15 08:03:47 dev-proxmox15 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Aug 15 08:03:47 dev-proxmox15 systemd[1]: corosync.service: Failed with result 'signal'.

That SEGV somehow started slow fall of whole cluster (15 nodes) with non of them seeing each other until i stopped corosync/pve-cluster services, pushed new version of corosync.conf to all of them and started services again through salt.

Code:
root@dev-proxmox15:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.18-1-pve)
pve-manager: 6.0-5 (running version: 6.0-5/f8a710d7)
pve-kernel-5.0: 6.0-6
pve-kernel-helper: 6.0-6
pve-kernel-4.15: 5.4-7
pve-kernel-5.0.18-1-pve: 5.0.18-1
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve2
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-3
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-6
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-6
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
 

Attachments

Last edited:
i will expand log a bit:
Code:
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [TOTEM ] A new membership (1:82856) was formed. Members
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [QUORUM] Members[15]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Aug 15 05:44:17 dev-proxmox15 corosync[31144]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [TOTEM ] A new membership (1:82860) was formed. Members
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [QUORUM] Members[15]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Aug 15 06:52:59 dev-proxmox15 corosync[31144]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [TOTEM ] A new membership (1:82864) was formed. Members
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [CPG   ] downlist left_list: 0 received
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [QUORUM] Members[15]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Aug 15 07:33:30 dev-proxmox15 corosync[31144]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 15 08:03:47 dev-proxmox15 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Aug 15 08:03:47 dev-proxmox15 systemd[1]: corosync.service: Failed with result 'signal'.
 
Status
Not open for further replies.

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!