Random reboot due multicast hasg table full

aychprox

Renowned Member
Oct 27, 2015
76
7
73
Hi,

I am experienced random reboot on nodes in a cluster. after much screening logs line found out this messages in kernel.log:

Feb 22 16:31:06 node6 kernel: [2486811.543673] fwbr104i0: Multicast hash table maximum of 512 reached, disabling snooping: fwln104o0
Feb 22 16:31:06 node6 kernel: [2486811.543683] fwbr105i0: Multicast hash table maximum of 512 reached, disabling snooping: fwln105o0
Feb 22 16:31:06 node6 kernel: [2486811.543689] fwbr106i0: Multicast hash table maximum of 512 reached, disabling snooping: fwln106o0

may I know this is kernel issue or something else?
Hope the guru here can guide me to the right direction of random reboot.
Thanks.
 
Can you post the syslog from when this message is appearing up to the next boot ? ( maybe at pastebin or something similar)
 
Ok I had a look at your log:

Feb 22 16:31:06 node6 kernel: [2486811.543673] fwbr104i0: Multicast hash table maximum of 512 reached, disabling snooping: fwln104o0

here starts the multicast problems you rightly mentionned

Feb 22 16:34:28 node6 corosync[2025]: [TOTEM ] Retransmit List: 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 1 2 3 4 5 6 7 8 9 a b c d

begin of corosync communication problem. Corosync reports here that some messages sent over the network were not acked by the other nodes in the expected time frame.

Feb 22 16:34:40 node6 corosync[2025]: [TOTEM ] A new membership (**MASK IP**:42596) was formed. Members left: 7 1 2 3

since the cluster communication is broken ( partition) this node and other nodes form a new cluster with the members that this node manage to contact

Feb 22 16:34:50 node6 pmxcfs[1943]: [status] notice: node lost quorum

this node notices that the total number of members in this cluster is not enough to provide a quorum. The cluster file system is turned read only

Feb 22 16:34:50 node6 pve-ha-lrm[2150]: lost lock 'ha_agent_node6_lock - cfs lock update failed - Operation not permitted
Feb 22 16:34:50 node6 pve-ha-lrm[2150]: status change active => lost_agent_lock

the node tries to release the lock it has on its HA resource, to allow other cluster members to take in

rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1858" x-info="http://www.rsyslog.com"] start
Feb 22 16:42:04 node6 systemd-modules-load[448]: Module 'fuse' is builtin
Feb 22 16:42:04 node6 systemd-modules-load[448]: Inserted module 'ipmi_devintf'
Feb 22 16:42:04 node6 systemd-modules-load[448]: Inserted module 'ipmi_powe

kernel reboot

probably your node self-fenced itself with a reboot after finally it noticed it lost quorum


my advise:
* check that multicasting is properly working on in your cluster. Corosync is *fussy* about network quality.

see http://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network on how to test your network with corosync
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!