[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Stewart Flood · Aug 21, 2019

Ahhh...while no one likes to see problems I'm actually glad I just spotted this thread. Same issues, started with v6 upgrade from v5. I though my problem was because my cluster (which actually started on v3, with rebuild to v4 then upgrades from there) is running in multiple locations with a layer 2 vpn tunnel between them.

Same exact symptoms others are reporting: cluster quorum goes, restarting corosync service fixes it -- sometimes having to restart it on one node, sometimes on more or even all nodes to get it back. Lasts randomly anywhere from 5-10 minutes to 5-10 hours. Hasn't lasted an entire day since the upgrade.

My cluster is "production", but it is production for my company with no other customers on it other than some sites being monitored via omd/check_mk, so I can try things.

Is there any way to "tune" the timing on corosync? My guess is that this is a factor -- at least in mine.

spirit · Aug 21, 2019

Stewart Flood said:
Is there any way to "tune" the timing on corosync? My guess is that this is a factor -- at least in mine.

you can increase timeout with

totem {
....
token: 10000
}

(10s in this example)

Stewart Flood · Aug 21, 2019

I'll try that. Assistance much appreciated!

Jema · Aug 21, 2019

I don't know if that 'increase' in token time is an actual solution.
We actually didn't experience this in proxmox VE 4, we only started getting it on VE 6.

So it's better to find the actual cause of this and since more people are reporting it, there must be a solution found.

spirit · Aug 22, 2019

Jema said:
So it's better to find the actual cause of this and since more people are reporting it, there must be a solution found.

They are known bug with mtu auto detection mainly.

A new version is coming with big fixes
https://github.com/kronosnet/kronosnet/pull/245
I think proxmox team will release it in coming days.

About token, it should be autocompute in corosync3/knet. for me, in corosync2, with udpu (so unicast), it should be increase with cluster > 10 nodes. (even with low latency swithes)

cquest · Aug 22, 2019

One more corosync crash (with 1.10_pve2)... I hope the upcoming bug fixes will help !

Code:

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Thu 2019-08-22 12:29:22 UTC; 2h 55min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 3423798 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=FPE)
 Main PID: 3423798 (code=killed, signal=FPE)

août 22 12:13:23 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
août 22 12:13:23 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 1366
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] link: host: 4 link: 0 is down
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 65382
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] rx: host: 4 link: 0 is up
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 1366
août 22 12:29:22 proxmox72 systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
août 22 12:29:22 proxmox72 systemd[1]: corosync.service: Failed with result 'signal'.

spirit · Aug 22, 2019

cquest said:
One more corosync crash (with 1.10_pve2)... I hope the upcoming bug fixes will help !

yes maybe the new patch will fix it.

Code:

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Thu 2019-08-22 12:29:22 UTC; 2h 55min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 3423798 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=FPE)
Main PID: 3423798 (code=killed, signal=FPE)

août 22 12:13:23 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
août 22 12:13:23 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 1366
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] link: host: 4 link: 0 is down
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)
août 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 65382
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] rx: host: 4 link: 0 is up
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
août 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 1366
août 22 12:29:22 proxmox72 systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
août 22 12:29:22 proxmox72 systemd[1]: corosync.service: Failed with result 'signal'.

your log is interesting, we see that corosync have switched to link1 because link0 was down.(was it really down ?)
I still don't known if mtu 65000 is supported (they was a known bug with mtu 65000),
the 3 seconds later, it's going back to link0 again.
and then crash.
(Not sure why it's crashing, maybe other nodes are flooding too,maybe because of mtu 65000,.....)

I never had problem with corosync3 since 6 month of testing, but I'm with 2link lacp, so they never go down
(my interface mtu is 1500, and corosync "pmtud: Global data MTU changed to: 1446")

Stewart Flood · Aug 22, 2019

The 10000 value for token seems to have had some affect, but I'm still losing the cluster. MTU sounds interesting...I've checked all of the interfaces/tunnels this passes through and everything is 1500.

cquest · Aug 23, 2019

Meanwhile... why not ask systemd to restart corosync when it crashed ?

Restart=on-failure

spirit said:
yes maybe the new patch will fix it.

your log is interesting, we see that corosync have switched to link1 because link0 was down.(was it really down ?)
I still don't known if mtu 65000 is supported (they was a known bug with mtu 65000),
the 3 seconds later, it's going back to link0 again.
and then crash.
(Not sure why it's crashing, maybe other nodes are flooding too,maybe because of mtu 65000,.....)

I never had problem with corosync3 since 6 month of testing, but I'm with 2link lacp, so they never go down
(my interface mtu is 1500, and corosync "pmtud: Global data MTU changed to: 1446")

To my knowledge, link0 was not down

spirit said:
yes maybe the new patch will fix it.

your log is interesting, we see that corosync have switched to link1 because link0 was down.(was it really down ?)
I still don't known if mtu 65000 is supported (they was a known bug with mtu 65000),
the 3 seconds later, it's going back to link0 again.
and then crash.
(Not sure why it's crashing, maybe other nodes are flooding too,maybe because of mtu 65000,.....)

I never had problem with corosync3 since 6 month of testing, but I'm with 2link lacp, so they never go down
(my interface mtu is 1500, and corosync "pmtud: Global data MTU changed to: 1446")

Here is a full syslog during the same period:

Code:

Aug 22 12:29:00 proxmox72 systemd[1]: Starting Proxmox VE replication runner...
Aug 22 12:29:00 proxmox72 systemd[1]: pvesr.service: Succeeded.
Aug 22 12:29:00 proxmox72 systemd[1]: Started Proxmox VE replication runner.
Aug 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] link: host: 4 link: 0 is down
Aug 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)
Aug 22 12:29:13 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 65382
Aug 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] rx: host: 4 link: 0 is up
Aug 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 22 12:29:19 proxmox72 corosync[3423798]:   [KNET  ] pmtud: Global data MTU changed to: 1366
Aug 22 12:29:21 proxmox72 kernel: [330697.957774] traps: corosync[3423834] trap divide error ip:7eff1bb3b8c6 sp:7eff0ff1ea50 error:0 in libknet.so.1.2.0[7eff1bb30000+13000]
Aug 22 12:29:22 proxmox72 systemd[1]: corosync.service: Main process exited, code=killed, status=8/FPE
Aug 22 12:29:22 proxmox72 systemd[1]: corosync.service: Failed with result 'signal'.
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [quorum] crit: quorum_dispatch failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] notice: node lost quorum
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [dcdb] crit: cpg_dispatch failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [dcdb] crit: cpg_leave failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [confdb] crit: cmap_dispatch failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] crit: cpg_dispatch failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] crit: cpg_leave failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [quorum] crit: quorum_initialize failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [quorum] crit: can't initialize service
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [confdb] crit: cmap_initialize failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [confdb] crit: can't initialize service
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [dcdb] notice: start cluster connection
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [dcdb] crit: cpg_initialize failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [dcdb] crit: can't initialize service
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] notice: start cluster connection
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] crit: cpg_initialize failed: 2
Aug 22 12:29:22 proxmox72 pmxcfs[252725]: [status] crit: can't initialize service
Aug 22 12:29:28 proxmox72 pmxcfs[252725]: [quorum] crit: quorum_initialize failed: 2
Aug 22 12:29:28 proxmox72 pmxcfs[252725]: [confdb] crit: cmap_initialize failed: 2
Aug 22 12:29:28 proxmox72 pmxcfs[252725]: [dcdb] crit: cpg_initialize failed: 2
Aug 22 12:29:28 proxmox72 pmxcfs[252725]: [status] crit: cpg_initialize failed: 2

As you can see, no error is reported on the ethernet link (which is a bond0 composed of 2 ethernet ports).
We have a divide by zero...

spirit · Aug 26, 2019

Hi,
proxmox team have push the patched

libknet1-1.11-pve1

to non subscription repo.

Can you try to update and see if it's help ?

elmacus · Aug 26, 2019

Last fail was this morning, all servers down.
I have updated 2 hours ago, all good, nothing yet to report ;-)

astnwt · Aug 27, 2019

cquest said:
Restart=on-failure

How did I NOT think of this ... *sigh* ... thanks!
I'm going to do that after taking libknet1-1.11-pve1 for a spin.

That being said, I installed libknet1-1.11-pve1 from non-sub around 12 hours ago and the
cluster did not fall apart yet, which is the longest period of time corosync3 survived since I
updated. So while I hate being optimistic too soon, that might be the fix we're all waiting for...

elmacus · Aug 27, 2019

No crash of cluster yet since 22 hours after knet 1.11.

So my systemd change of the most problem server (i guess its best to set it on all servers later, if needed):
nano /lib/systemd/system/corosync.service
Add code in [service]
Restart=on-failure
Save
systemctl daemon-reload
systemctl restart corosync.service

Im not sure if this gets overwritten in next update to corosync or not. Or if i even need it since knet is updated.

spirit · Aug 27, 2019

elmacus said:
No crash of cluster yet since 22 hours after knet 1.11.

So my systemd change of the most problem server (i guess its best to set it on all servers later, if needed):
nano /lib/systemd/system/corosync.service
Add code in [service]
Restart=on-failure
Save
systemctl daemon-reload
systemctl restart corosync.service

Im not sure if this gets overwritten in next update to corosync or not. Or if i even need it since knet is updated.

The best way with systemd is to do override
simply create a
/etc/systemd/system/corosync.service.d/override.conf
with
[service]
Restart=on-failure

you can check corosync logs in systemd if restart have occured (journalctl -u corosync)

astnwt · Aug 28, 2019

astnwt said:
So while I hate being optimistic too soon, that might be the fix we're all waiting for...

nope, the cluster broke :-/ back to the drawing board

cquest · Aug 28, 2019

Non corosync problem in my syslog... the fix seems good so far

spirit · Aug 28, 2019

astnwt said:
nope, the cluster broke :-/ back to the drawing board

do you have logs ? 'cat /var/log/daemon.log|grep corosync' or journalctl -u corosync

Fusel · Aug 28, 2019

spirit said:
do you have logs ? 'cat /var/log/daemon.log|grep corosync' or journalctl -u corosync

I got for 2 nodes 11/SEGV

Code:

Aug 28 09:55:48 node-29 corosync[1684]:   [TOTEM ] Retransmit List: f2512 f2513 f2514
Aug 28 10:06:54 node-29 corosync[1684]:   [TOTEM ] Retransmit List: f390c
Aug 28 18:19:24 node-29 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Aug 28 18:19:24 node-29 systemd[1]: corosync.service: Failed with result 'signal'.

https://forum.proxmox.com/threads/pve6-0-5-corosync3-segvaults-randomly-on-nodes.56903/

or

https://bugzilla.proxmox.com/show_bug.cgi?id=2326

spirit · Aug 29, 2019

@astnwt @Fusel

do you have some more info about the segfault in /var/log/kernel.log ? (or #dmesg).

also,

maybe could you try to install
#apt install systemd-coredump

it should log info about segfault in
/var/lib/systemd/coredump/
and with command
#coredumpctl info

Fusel · Aug 29, 2019

spirit said:
@astnwt @Fusel

do you have some more info about the segfault in /var/log/kernel.log ? (or #dmesg).

also,

maybe could you try to install
#apt install systemd-coredump

it should log info about segfault in
/var/lib/systemd/coredump/
and with command
#coredumpctl info

Hey spirit,

there was no other entries in /var/log/kernel.log or dmesg

I have installed systemd-coredump

[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Member

Distinguished Member

Member

New Member

Distinguished Member

Renowned Member

Distinguished Member

Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Member

Distinguished Member

Member

We value your privacy