Corosync service dead - [TOTEM ] Retransmit List

Hoang.NV

New Member
Aug 28, 2019
13
0
1
33
Hello,

I have a cluster HA with proxmox 6.0-4 (Corosync 3 using Kronosnet). The cluster includes 7 nodes.

After a period of time, one node in cluster alert corosyn service dead !!! Detail log:

Sep 20 00:07:19 node01 corosync[2175]: [TOTEM ] Retransmit List: 12c29b8
Sep 20 00:10:19 node01 corosync[2175]: [TOTEM ] Retransmit List: 12c33cc
Sep 20 00:58:29 node01 corosync[2175]: [TOTEM ] Retransmit List: 12cd42b
Sep 20 02:26:39 node01 corosync[2175]: [TOTEM ] Retransmit List: 12df8e3
Sep 20 06:21:29 node01 corosync[2175]: [TOTEM ] Retransmit List: 13107cc
Sep 20 07:41:29 node01 corosync[2175]: [TOTEM ] Retransmit List: 13212be
Sep 20 08:27:19 node01 corosync[2175]: [TOTEM ] Retransmit List: 132abe9
Sep 20 09:45:40 node01 corosync[2175]: [TOTEM ] Retransmit List: 133b0de
Sep 20 12:40:20 node01 corosync[2175]: [TOTEM ] Retransmit List: 135f362
Sep 20 12:55:40 node01 corosync[2175]: [TOTEM ] Retransmit List: 1362679
Sep 20 16:03:41 node01 corosync[2175]: [TOTEM ] Retransmit List: 138985b
Sep 20 16:17:26 node01 corosync[2175]: [TOTEM ] Retransmit List: 138c66c
Sep 20 16:35:11 node01 corosync[2175]: [TOTEM ] Retransmit List: 139018e
Sep 20 17:15:31 node01 corosync[2175]: [TOTEM ] Retransmit List: 139881b
Sep 20 17:50:11 node01 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Sep 20 17:50:11 node01 systemd[1]: corosync.service: Failed with result 'signal'.

What happened in my network ? Could you help me !
 
Hi,
they are a 2 known bug in corosync made it crashing..

1 bug has already been fixed (package libknet 1.12,).
can you give the result of:

#pve-version -v

and if you have update libknet, please also restart corosync service, because it's not done automaticaly.


They still have another crash bug in libknet 1.12 (less frequent, much harder to reproduce but still segfault), proxmox devs and corosync devs are working on it
https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/page-12
https://github.com/kronosnet/kronosnet/issues/261


If you use HA, please disable it for now.
 
Hi,
they are a 2 known bug in corosync made it crashing..

1 bug has already been fixed (package libknet 1.12,).
can you give the result of:

#pve-version -v

and if you have update libknet, please also restart corosync service, because it's not done automaticaly.


They still have another crash bug in libknet 1.12 (less frequent, much harder to reproduce but still segfault), proxmox devs and corosync devs are working on it
https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/page-12
https://github.com/kronosnet/kronosnet/issues/261


If you use HA, please disable it for now.

Hi spirit,

Thanks for your reply.

This is my pve-version result:

# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
 
>>libknet1: 1.10-pve1

you should really upgrade this one. (and restart corosync service after).
It should help a lot.

for upgrade:

if you don't have a subscription, don't forget to change repo to no-subscription
https://pve.proxmox.com/wiki/Package_Repositories

then apt-get update, apt dist-upgrade.
 
  • Like
Reactions: Hoang.NV
>>libknet1: 1.10-pve1

you should really upgrade this one. (and restart corosync service after).
It should help a lot.

for upgrade:

if you don't have a subscription, don't forget to change repo to no-subscription
https://pve.proxmox.com/wiki/Package_Repositories

then apt-get update, apt dist-upgrade.

Hi Spirit,

I will schedule maintenance and updates libknet1 package. Before restarting the service, I will turn off HA.

When i restart corosync service (to apply new libknet1 package ) one by one on 7 nodes in cluster, are there any risks ?

Thanks you!
 
Hi guys
I'm having the same issue, on my Proxmox 6 and corosync 3
Almost all of nodes are throwing this error "corosync[2175]: [TOTEM ] Retransmit List "
Today morning one of them got unresponsive after many of this and rebootet it self.

I see that libknet1 is upgraded to libknet1: 1.15-pve1

here is the pveversion -v

proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph: 14.2.8-pve1
ceph-fuse: 14.2.8-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
I'm having the same issue too... 7 nodes proxmox and corosync 3.0.4
5 from 7 nodes reboot itself... and showing same error... [ TOTEM ] Retransmit List: 33

hereis the pveversion

# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.55-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-5
pve-kernel-helper: 6.2-5
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.10-pve1
ceph-fuse: 14.2.10-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-1
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-10
pve-cluster: 6.1-8
pve-container: 3.1-13
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-2
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.1.0-1
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1
 
@myman03

do you see a lot of retransmit ? or only 1 or 2 by day ?

do you have dedicated network link for corosync ? what is your physical switch hardware ?

can you send result of :

#corosync-cmapctl -m stats


retransmit with 7nodes is strange... (maybe a bad/slow node, or overloaded network could explain that)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!