[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Status
Not open for further replies.
Ahhh...while no one likes to see problems I'm actually glad I just spotted this thread. Same issues, started with v6 upgrade from v5. I though my problem was because my cluster (which actually started on v3, with rebuild to v4 then upgrades from there) is running in multiple locations with a layer 2 vpn tunnel between them.

Same exact symptoms others are reporting: cluster quorum goes, restarting corosync service fixes it -- sometimes having to restart it on one node, sometimes on more or even all nodes to get it back. Lasts randomly anywhere from 5-10 minutes to 5-10 hours. Hasn't lasted an entire day since the upgrade.

My cluster is "production", but it is production for my company with no other customers on it other than some sites being monitored via omd/check_mk, so I can try things.

Is there any way to "tune" the timing on corosync? My guess is that this is a factor -- at least in mine.
 
I don't know if that 'increase' in token time is an actual solution.
We actually didn't experience this in proxmox VE 4, we only started getting it on VE 6.

So it's better to find the actual cause of this and since more people are reporting it, there must be a solution found.
 
So it's better to find the actual cause of this and since more people are reporting it, there must be a solution found.

They are known bug with mtu auto detection mainly.

A new version is coming with big fixes
https://github.com/kronosnet/kronosnet/pull/245
I think proxmox team will release it in coming days.


About token, it should be autocompute in corosync3/knet. for me, in corosync2, with udpu (so unicast), it should be increase with cluster > 10 nodes. (even with low latency swithes)
 
One more corosync crash (with 1.10_pve2)... I hope the upcoming bug fixes will help !

Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Thu 2019-08-22 12:29:22 UTC; 2h 55min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 3423798 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=FPE)
 Main PID: 3423798 (code=killed, signal=FPE)

août 22 12:13:23 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
août 22 12:13:23 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 1366
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] link: host: 4 link: 0 is down
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 65382
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] rx: host: 4 link: 0 is up
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 1366
août 22 12:29:22 proxmox72 systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
août 22 12:29:22 proxmox72 systemd[1]: corosync.service: Failed with result 'signal'.
 
One more corosync crash (with 1.10_pve2)... I hope the upcoming bug fixes will help !
yes maybe the new patch will fix it.
Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Thu 2019-08-22 12:29:22 UTC; 2h 55min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 3423798 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=FPE)
Main PID: 3423798 (code=killed, signal=FPE)

août 22 12:13:23 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
août 22 12:13:23 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 1366
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] link: host: 4 link: 0 is down
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 65382
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] rx: host: 4 link: 0 is up
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 1366
août 22 12:29:22 proxmox72 systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
août 22 12:29:22 proxmox72 systemd[1]: corosync.service: Failed with result 'signal'.



your log is interesting, we see that corosync have switched to link1 because link0 was down.(was it really down ?)
I still don't known if mtu 65000 is supported (they was a known bug with mtu 65000),
the 3 seconds later, it's going back to link0 again.
and then crash.
(Not sure why it's crashing, maybe other nodes are flooding too,maybe because of mtu 65000,.....)

I never had problem with corosync3 since 6 month of testing, but I'm with 2link lacp, so they never go down
(my interface mtu is 1500, and corosync "pmtud: Global data MTU changed to: 1446")
 
The 10000 value for token seems to have had some affect, but I'm still losing the cluster. MTU sounds interesting...I've checked all of the interfaces/tunnels this passes through and everything is 1500.
 
Meanwhile... why not ask systemd to restart corosync when it crashed ?

Restart=on-failure
yes maybe the new patch will fix it.




your log is interesting, we see that corosync have switched to link1 because link0 was down.(was it really down ?)
I still don't known if mtu 65000 is supported (they was a known bug with mtu 65000),
the 3 seconds later, it's going back to link0 again.
and then crash.
(Not sure why it's crashing, maybe other nodes are flooding too,maybe because of mtu 65000,.....)

I never had problem with corosync3 since 6 month of testing, but I'm with 2link lacp, so they never go down
(my interface mtu is 1500, and corosync "pmtud: Global data MTU changed to: 1446")

To my knowledge, link0 was not down
yes maybe the new patch will fix it.




your log is interesting, we see that corosync have switched to link1 because link0 was down.(was it really down ?)
I still don't known if mtu 65000 is supported (they was a known bug with mtu 65000),
the 3 seconds later, it's going back to link0 again.
and then crash.
(Not sure why it's crashing, maybe other nodes are flooding too,maybe because of mtu 65000,.....)

I never had problem with corosync3 since 6 month of testing, but I'm with 2link lacp, so they never go down
(my interface mtu is 1500, and corosync "pmtud: Global data MTU changed to: 1446")


Here is a full syslog during the same period:

Code:
Aug 22 12:29:00 proxmox72 systemd[1]: Starting Proxmox VE replication runner...
Aug 22 12:29:00 proxmox72 systemd[1]: pvesr.service: Succeeded.
Aug 22 12:29:00 proxmox72 systemd[1]: Started Proxmox VE replication runner.
Aug 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] link: host: 4 link: 0 is down
Aug 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)
Aug 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 65382
Aug 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] rx: host: 4 link: 0 is up
Aug 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 1366
Aug 22 12:29:21 proxmox72 kernel: [330697.957774] traps: corosync[3423834] trap divide error ip:7eff1bb3b8c6 sp:7eff0ff1ea50 error:0 in libknet.so.1.2.0[7eff1bb30000+13000]
Aug 22 12:29:22 proxmox72 systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
Aug 22 12:29:22 proxmox72 systemd[1]: corosync.service: Failed with result 'signal'.
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [quorum] crit: quorum_dispatch failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] notice: node lost quorum
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [dcdb] crit: cpg_dispatch failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [dcdb] crit: cpg_leave failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [confdb] crit: cmap_dispatch failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] crit: cpg_dispatch failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] crit: cpg_leave failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [quorum] crit: quorum_initialize failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [quorum] crit: can't initialize service
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [confdb] crit: cmap_initialize failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [confdb] crit: can't initialize service
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [dcdb] notice: start cluster connection
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [dcdb] crit: cpg_initialize failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [dcdb] crit: can't initialize service
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] notice: start cluster connection
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] crit: cpg_initialize failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] crit: can't initialize service
Aug 22 12:29:28 proxmox72 pmxcfs[252725]: [quorum] crit: quorum_initialize failed: 2
Aug 22 12:29:28 proxmox72 pmxcfs[252725]: [confdb] crit: cmap_initialize failed: 2
Aug 22 12:29:28 proxmox72 pmxcfs[252725]: [dcdb] crit: cpg_initialize failed: 2
Aug 22 12:29:28 proxmox72 pmxcfs[252725]: [status] crit: cpg_initialize failed: 2

As you can see, no error is reported on the ethernet link (which is a bond0 composed of 2 ethernet ports).
We have a divide by zero...
 
Last fail was this morning, all servers down.
I have updated 2 hours ago, all good, nothing yet to report ;-)
 
Restart=on-failure

How did I NOT think of this ... *sigh* ... thanks!
I'm going to do that after taking libknet1-1.11-pve1 for a spin.

That being said, I installed libknet1-1.11-pve1 from non-sub around 12 hours ago and the
cluster did not fall apart yet, which is the longest period of time corosync3 survived since I
updated. So while I hate being optimistic too soon, that might be the fix we're all waiting for...
 
  • Like
Reactions: l.ansaloni
No crash of cluster yet since 22 hours after knet 1.11.

So my systemd change of the most problem server (i guess its best to set it on all servers later, if needed):
nano /lib/systemd/system/corosync.service
Add code in [service]
Restart=on-failure
Save
systemctl daemon-reload
systemctl restart corosync.service

Im not sure if this gets overwritten in next update to corosync or not. Or if i even need it since knet is updated.
 
No crash of cluster yet since 22 hours after knet 1.11.

So my systemd change of the most problem server (i guess its best to set it on all servers later, if needed):
nano /lib/systemd/system/corosync.service
Add code in [service]
Restart=on-failure
Save
systemctl daemon-reload
systemctl restart corosync.service

Im not sure if this gets overwritten in next update to corosync or not. Or if i even need it since knet is updated.

The best way with systemd is to do override
simply create a
/etc/systemd/system/corosync.service.d/override.conf
with
[service]
Restart=on-failure


you can check corosync logs in systemd if restart have occured (journalctl -u corosync)
 
do you have logs ? 'cat /var/log/daemon.log|grep corosync' or journalctl -u corosync

I got for 2 nodes 11/SEGV

Code:
Aug 28 09:55:48 node-29 corosync[1684]:   [TOTEM ] Retransmit List: f2512 f2513 f2514
Aug 28 10:06:54 node-29 corosync[1684]:   [TOTEM ] Retransmit List: f390c
Aug 28 18:19:24 node-29 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Aug 28 18:19:24 node-29 systemd[1]: corosync.service: Failed with result 'signal'.


https://forum.proxmox.com/threads/pve6-0-5-corosync3-segvaults-randomly-on-nodes.56903/

or

https://bugzilla.proxmox.com/show_bug.cgi?id=2326
 
@astnwt @Fusel

do you have some more info about the segfault in /var/log/kernel.log ? (or #dmesg).

also,

maybe could you try to install
#apt install systemd-coredump

it should log info about segfault in
/var/lib/systemd/coredump/
and with command
#coredumpctl info

Hey spirit,

there was no other entries in /var/log/kernel.log or dmesg

I have installed systemd-coredump
 
Status
Not open for further replies.

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!