Network caused Quorum errors? - Free'ing ringbuffer - Corosync

Gandalf1783

Member
Oct 4, 2022
6
1
8
Germany
Hey yall,

my 2 Node Cluster started to do weird things I cant explain - at least for now.

All nodes run on the latest 7.3.4 release.

Serv1 seems to be offline? It can be pinged, and I can open terminals (even from VMs), view stats etc.
It is marked with a question mark from both webinterfaces. It also marks all VMs with a question mark.

No statistics are currently gathered (diagrams at stats are empty, even though CPU/RAM Readout bars are working)

"pvecm status" outputs following (executed on serv1):
Code:
Cluster information
-------------------
Name:             KOSMOS
Config Version:   8
Transport:        knet
Secure auth:      on


Quorum information
------------------
Date:             Fri Dec 30 15:27:22 2022
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.169
Quorate:          Yes


Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate


Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.5.11 (local)
0x00000002          1 192.168.5.12

It outputs the same on serv2.

However, these are the entries on serv1 (journalctl -xef):
Code:
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-1832-102925-33-KsnQTJ/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] connecting to client [102925]
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] connection created
Dec 30 15:27:22 serv1 corosync[1832]:   [CMAP  ] lib_init_fn: conn=0x55b86e65f3b0
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-1832-102925-34-Qcej7I/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] connecting to client [102925]
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] connection created
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] lib_init_fn: conn=0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] got quorum_type request on 0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-1832-102925-35-UwcbFI/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] connecting to client [102925]
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] connection created
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-1832-102925-36-MmGwjH/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] connecting to client [102925]
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] connection created
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] got quorate request on 0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] got trackstart request on 0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] sending initial status to 0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] sending nodelist notification to 0x55b86e642810, length = 72
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] sending quorum notification to 0x55b86e642810, length = 56/64
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] got trackstop request on 0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] sending initial status to 0x55b86e64c030
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] Sending nodelist callback. ring_id = 1.169
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] Sending quorum callback, quorate = 1
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] got getinfo request on 0x55b86e64c030 for node 1
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] getinfo response error: 1
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] got getinfo request on 0x55b86e64c030 for node 1
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] getinfo response error: 1
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] got getinfo request on 0x55b86e64c030 for node 2
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] getinfo response error: 1
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] HUP conn (/dev/shm/qb-1832-102925-33-KsnQTJ/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] qb_ipcs_disconnect(/dev/shm/qb-1832-102925-33-KsnQTJ/qb) state:2
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_closed()
Dec 30 15:27:22 serv1 corosync[1832]:   [CMAP  ] exit_fn for conn=0x55b86e65f3b0
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_destroyed()
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-33-KsnQTJ/qb-response-cmap-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-33-KsnQTJ/qb-event-cmap-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-33-KsnQTJ/qb-request-cmap-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] HUP conn (/dev/shm/qb-1832-102925-34-Qcej7I/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] qb_ipcs_disconnect(/dev/shm/qb-1832-102925-34-Qcej7I/qb) state:2
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_closed()
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] lib_exit_fn: conn=0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_destroyed()
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-34-Qcej7I/qb-response-quorum-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-34-Qcej7I/qb-event-quorum-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-34-Qcej7I/qb-request-quorum-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] HUP conn (/dev/shm/qb-1832-102925-35-UwcbFI/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] qb_ipcs_disconnect(/dev/shm/qb-1832-102925-35-UwcbFI/qb) state:2
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_closed()
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_destroyed()
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-35-UwcbFI/qb-response-cfg-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-35-UwcbFI/qb-event-cfg-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-35-UwcbFI/qb-request-cfg-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] HUP conn (/dev/shm/qb-1832-102925-36-MmGwjH/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] qb_ipcs_disconnect(/dev/shm/qb-1832-102925-36-MmGwjH/qb) state:2
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_closed()
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_destroyed()
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-36-MmGwjH/qb-response-votequorum-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-36-MmGwjH/qb-event-votequorum-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-36-MmGwjH/qb-request-votequorum-header
Dec 30 15:27:23 serv1 kernel: nfs: server 192.168.1.24 not responding, still trying
Dec 30 15:27:29 serv1 corosync[1832]:   [KNET  ] pmtud: Starting PMTUD for host: 2 link: 0
Dec 30 15:27:29 serv1 corosync[1832]:   [KNET  ] udp: detected kernel MTU: 1500
Dec 30 15:27:29 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397
Dec 30 15:27:59 serv1 corosync[1832]:   [KNET  ] pmtud: Starting PMTUD for host: 2 link: 0
Dec 30 15:27:59 serv1 corosync[1832]:   [KNET  ] udp: detected kernel MTU: 1500
Dec 30 15:27:59 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397

It seems to have a network-issue where it disconnects or the protocol does weird stuff?
I cannot figure out what proxmox currently does, especially since the HA-Tab says "active" on both servers.
 
Also, corosync outputs these messages (service corosync status) continously in a cycle of exactly 1 minute:

Code:
Dec 30 15:30:59 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397
Dec 30 15:31:29 serv1 corosync[1832]:   [KNET  ] pmtud: Starting PMTUD for host: 2 link: 0
Dec 30 15:31:29 serv1 corosync[1832]:   [KNET  ] udp: detected kernel MTU: 1500
Dec 30 15:31:29 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397
Dec 30 15:31:59 serv1 corosync[1832]:   [KNET  ] pmtud: Starting PMTUD for host: 2 link: 0
Dec 30 15:31:59 serv1 corosync[1832]:   [KNET  ] udp: detected kernel MTU: 1500
Dec 30 15:31:59 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397
Dec 30 15:32:29 serv1 corosync[1832]:   [KNET  ] pmtud: Starting PMTUD for host: 2 link: 0
Dec 30 15:32:29 serv1 corosync[1832]:   [KNET  ] udp: detected kernel MTU: 1500
Dec 30 15:32:29 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397


EDIT:
Forgot to add info about the network:
The 2 onboard NICs are just passthrough for the VMs and dont have any IP associated. Just bridging.
The 3rd, 10G Nic from Mellanox (10G Ethernet over SFP) both connect to a Dell Switch and have static IPs (the 192.168.5.X which you see in the logs).
Im accessing the Webinterface over the LAN-network, which is one of the onboard nics, which gets passed through pfsense (on a VM) to the 10G interfaces.
 
Last edited: