Network caused Quorum errors? - Free'ing ringbuffer - Corosync

Gandalf1783

New Member
Oct 4, 2022
6
1
3
Germany
Hey yall,

my 2 Node Cluster started to do weird things I cant explain - at least for now.

All nodes run on the latest 7.3.4 release.

Serv1 seems to be offline? It can be pinged, and I can open terminals (even from VMs), view stats etc.
It is marked with a question mark from both webinterfaces. It also marks all VMs with a question mark.

No statistics are currently gathered (diagrams at stats are empty, even though CPU/RAM Readout bars are working)

"pvecm status" outputs following (executed on serv1):
Code:
Cluster information
-------------------
Name:             KOSMOS
Config Version:   8
Transport:        knet
Secure auth:      on


Quorum information
------------------
Date:             Fri Dec 30 15:27:22 2022
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.169
Quorate:          Yes


Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate


Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.5.11 (local)
0x00000002          1 192.168.5.12

It outputs the same on serv2.

However, these are the entries on serv1 (journalctl -xef):
Code:
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-1832-102925-33-KsnQTJ/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] connecting to client [102925]
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] connection created
Dec 30 15:27:22 serv1 corosync[1832]:   [CMAP  ] lib_init_fn: conn=0x55b86e65f3b0
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-1832-102925-34-Qcej7I/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] connecting to client [102925]
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] connection created
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] lib_init_fn: conn=0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] got quorum_type request on 0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-1832-102925-35-UwcbFI/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] connecting to client [102925]
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] connection created
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-1832-102925-36-MmGwjH/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] connecting to client [102925]
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] connection created
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] got quorate request on 0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] got trackstart request on 0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] sending initial status to 0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] sending nodelist notification to 0x55b86e642810, length = 72
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] sending quorum notification to 0x55b86e642810, length = 56/64
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] got trackstop request on 0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] sending initial status to 0x55b86e64c030
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] Sending nodelist callback. ring_id = 1.169
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] Sending quorum callback, quorate = 1
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] got getinfo request on 0x55b86e64c030 for node 1
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] getinfo response error: 1
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] got getinfo request on 0x55b86e64c030 for node 1
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] getinfo response error: 1
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] got getinfo request on 0x55b86e64c030 for node 2
Dec 30 15:27:22 serv1 corosync[1832]:   [VOTEQ ] getinfo response error: 1
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] HUP conn (/dev/shm/qb-1832-102925-33-KsnQTJ/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] qb_ipcs_disconnect(/dev/shm/qb-1832-102925-33-KsnQTJ/qb) state:2
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_closed()
Dec 30 15:27:22 serv1 corosync[1832]:   [CMAP  ] exit_fn for conn=0x55b86e65f3b0
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_destroyed()
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-33-KsnQTJ/qb-response-cmap-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-33-KsnQTJ/qb-event-cmap-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-33-KsnQTJ/qb-request-cmap-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] HUP conn (/dev/shm/qb-1832-102925-34-Qcej7I/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] qb_ipcs_disconnect(/dev/shm/qb-1832-102925-34-Qcej7I/qb) state:2
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_closed()
Dec 30 15:27:22 serv1 corosync[1832]:   [QUORUM] lib_exit_fn: conn=0x55b86e642810
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_destroyed()
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-34-Qcej7I/qb-response-quorum-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-34-Qcej7I/qb-event-quorum-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-34-Qcej7I/qb-request-quorum-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] HUP conn (/dev/shm/qb-1832-102925-35-UwcbFI/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] qb_ipcs_disconnect(/dev/shm/qb-1832-102925-35-UwcbFI/qb) state:2
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_closed()
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_destroyed()
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-35-UwcbFI/qb-response-cfg-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-35-UwcbFI/qb-event-cfg-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-35-UwcbFI/qb-request-cfg-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] HUP conn (/dev/shm/qb-1832-102925-36-MmGwjH/qb)
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] qb_ipcs_disconnect(/dev/shm/qb-1832-102925-36-MmGwjH/qb) state:2
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_closed()
Dec 30 15:27:22 serv1 corosync[1832]:   [MAIN  ] cs_ipcs_connection_destroyed()
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-36-MmGwjH/qb-response-votequorum-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-36-MmGwjH/qb-event-votequorum-header
Dec 30 15:27:22 serv1 corosync[1832]:   [QB    ] Free'ing ringbuffer: /dev/shm/qb-1832-102925-36-MmGwjH/qb-request-votequorum-header
Dec 30 15:27:23 serv1 kernel: nfs: server 192.168.1.24 not responding, still trying
Dec 30 15:27:29 serv1 corosync[1832]:   [KNET  ] pmtud: Starting PMTUD for host: 2 link: 0
Dec 30 15:27:29 serv1 corosync[1832]:   [KNET  ] udp: detected kernel MTU: 1500
Dec 30 15:27:29 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397
Dec 30 15:27:59 serv1 corosync[1832]:   [KNET  ] pmtud: Starting PMTUD for host: 2 link: 0
Dec 30 15:27:59 serv1 corosync[1832]:   [KNET  ] udp: detected kernel MTU: 1500
Dec 30 15:27:59 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397

It seems to have a network-issue where it disconnects or the protocol does weird stuff?
I cannot figure out what proxmox currently does, especially since the HA-Tab says "active" on both servers.
 
Also, corosync outputs these messages (service corosync status) continously in a cycle of exactly 1 minute:

Code:
Dec 30 15:30:59 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397
Dec 30 15:31:29 serv1 corosync[1832]:   [KNET  ] pmtud: Starting PMTUD for host: 2 link: 0
Dec 30 15:31:29 serv1 corosync[1832]:   [KNET  ] udp: detected kernel MTU: 1500
Dec 30 15:31:29 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397
Dec 30 15:31:59 serv1 corosync[1832]:   [KNET  ] pmtud: Starting PMTUD for host: 2 link: 0
Dec 30 15:31:59 serv1 corosync[1832]:   [KNET  ] udp: detected kernel MTU: 1500
Dec 30 15:31:59 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397
Dec 30 15:32:29 serv1 corosync[1832]:   [KNET  ] pmtud: Starting PMTUD for host: 2 link: 0
Dec 30 15:32:29 serv1 corosync[1832]:   [KNET  ] udp: detected kernel MTU: 1500
Dec 30 15:32:29 serv1 corosync[1832]:   [KNET  ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397


EDIT:
Forgot to add info about the network:
The 2 onboard NICs are just passthrough for the VMs and dont have any IP associated. Just bridging.
The 3rd, 10G Nic from Mellanox (10G Ethernet over SFP) both connect to a Dell Switch and have static IPs (the 192.168.5.X which you see in the logs).
Im accessing the Webinterface over the LAN-network, which is one of the onboard nics, which gets passed through pfsense (on a VM) to the 10G interfaces.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!