Corosync verliert Mitgliedschaft im Cluster

Michael Gusek

New Member
Aug 9, 2016
8
0
1
49
Hi

wir betreiben einen Cluster mit 16 Nodes. Einer dieser Nodes verliert regelmäßig und reproduzierbar seine Mitgliedschaft im Cluster. Hier die relevanten Logs:
<29>2016-08-09T12:11:51.766421+02:00 pve-sm-prod-03 corosync[12928]: [TOTEM ] A new membership (10.1.1.175:114512) was formed. Members left: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16
<29>2016-08-09T12:11:51.766640+02:00 pve-sm-prod-03 corosync[12928]: [TOTEM ] Failed to receive the leave message. failed: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16
<29>2016-08-09T12:11:51.766758+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: members: 3/2126
<29>2016-08-09T12:11:51.766886+02:00 pve-sm-prod-03 pmxcfs[2126]: [status] notice: members: 3/2126
<29>2016-08-09T12:11:51.766996+02:00 pve-sm-prod-03 corosync[12928]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
<29>2016-08-09T12:11:51.767104+02:00 pve-sm-prod-03 corosync[12928]: [QUORUM] Members[1]: 3
<29>2016-08-09T12:11:51.767213+02:00 pve-sm-prod-03 corosync[12928]: [MAIN ] Completed service synchronization, ready to provide service.
<29>2016-08-09T12:11:51.767320+02:00 pve-sm-prod-03 pmxcfs[2126]: [status] notice: node lost quorum
<29>2016-08-09T12:11:51.866594+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: cpg_send_message retried 1 times
<26>2016-08-09T12:11:51.866855+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] crit: received write while not quorate - trigger resync
<26>2016-08-09T12:11:51.866995+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] crit: leaving CPG group
<27>2016-08-09T12:11:51.867121+02:00 pve-sm-prod-03 pve-ha-lrm[2212]: unable to write lrm status file - closing file '/etc/pve/nodes/pve-sm-prod-03/lrm_status.tmp.2212' failed - Operation not permitted
<30>2016-08-09T12:11:52.086956+02:00 pve-sm-prod-03 lxcfs[1883]: Internal error: truncated write to cache
<29>2016-08-09T12:11:52.678263+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: start cluster connection
<29>2016-08-09T12:11:52.678487+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: members: 3/2126
<29>2016-08-09T12:11:52.678604+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: all data is up to date

Wird der Service corosync neugestartet, verliert die Node nach einigen Minuten wieder die Mitgliedschaft. Ein Neustart des gesamten Server's hat keine Besserung gebracht. Wie kann dem abgeholfen werden ?

Micha

# pveversion -v
proxmox-ve: 4.2-56 (running kernel: 4.4.13-1-pve)
pve-manager: 4.2-17 (running version: 4.2-17/e1400248)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.13-1-pve: 4.4.13-56
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-43
qemu-server: 4.0-85
pve-firmware: 1.1-8
libpve-common-perl: 4.0-71
libpve-access-control: 4.0-18
libpve-storage-perl: 4.0-56
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6-1
pve-container: 1.0-71
pve-firewall: 2.0-29
pve-ha-manager: 1.0-32
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5.7-pve10~bpo80
openvswitch-switch: 2.5.0-1
 
Hi,

kann es sein das ein Netzwerkfehler vorliegt z.B die Nic Probleme macht.

Bitte mal multicast testen.

es sollte reichen wen das mit 3 Server gemacht wird, einer davon sollte der Problemserver sein.

omping -c 10000 -i 0.001 -F -q node1 node2 node3
 
Ok, Layer 8 Problem, Ausgabe:

pve-sm-prod-01 : joined (S,G) = (*, 232.43.211.234), pinging
pve-sm-prod-03 : waiting for response msg
pve-sm-prod-03 : waiting for response msg
pve-sm-prod-03 : joined (S,G) = (*, 232.43.211.234), pinging
pve-sm-prod-01 : given amount of query messages was sent
pve-sm-prod-03 : waiting for response msg
pve-sm-prod-03 : server told us to stop

pve-sm-prod-01 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.063/0.094/0.519/0.022
pve-sm-prod-01 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.084/0.117/0.528/0.024
pve-sm-prod-03 : unicast, xmt/rcv/%loss = 9682/9682/0%, min/avg/max/std-dev = 0.063/0.087/0.389/0.015
pve-sm-prod-03 : multicast, xmt/rcv/%loss = 9682/9682/0%, min/avg/max/std-dev = 0.083/0.109/1.068/0.019

pve-sm-prod-03 ist der defekte
 
Sind auf den anderen Nodes auch openvswitch installiert?

Wie sieht das log auf den anderen Servern zur Ausfallzeit aus (ein log reicht).
 
Omping sieht gut aus.
 
Auf allen Nodes ist Openvswitch drauf (und konfiguriert). Hier der Logauszug, wenn die Node rausfällt:
<29>2016-08-10T13:51:21.645994+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: received log
<29>2016-08-10T13:51:42.244135+02:00 pve-sm-prod-17 corosync[3533]: [TOTEM ] A new membership (10.1.1.173:142296) was formed. Members left: 3
<29>2016-08-10T13:51:42.244319+02:00 pve-sm-prod-17 corosync[3533]: [TOTEM ] Failed to receive the leave message. failed: 3
<29>2016-08-10T13:51:42.251414+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: members: 1/22832, 2/5166, 4/14083, 5/2143, 6/1975, 7/3758, 8/1418, 9/3218, 10/1426, 11/2910, 12/1420, 13/4091, 14/1435, 15/1402, 16/6001, 17/3514
<29>2016-08-10T13:51:42.251549+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: starting data syncronisation
<29>2016-08-10T13:51:42.251647+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: members: 1/22832, 2/5166, 4/14083, 5/2143, 6/1975, 7/3758, 8/1418, 9/3218, 10/1426, 11/2910, 12/1420, 13/4091, 14/1435, 15/1402, 16/6001, 17/3514
<29>2016-08-10T13:51:42.251739+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: starting data syncronisation
<29>2016-08-10T13:51:42.253831+02:00 pve-sm-prod-17 corosync[3533]: [QUORUM] Members[16]: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17
<29>2016-08-10T13:51:42.253962+02:00 pve-sm-prod-17 corosync[3533]: [MAIN ] Completed service synchronization, ready to provide service.
<29>2016-08-10T13:51:55.147153+02:00 pve-sm-prod-17 corosync[3533]: [TOTEM ] A new membership (10.1.1.173:142300) was formed. Members
<29>2016-08-10T13:51:55.157519+02:00 pve-sm-prod-17 corosync[3533]: [QUORUM] Members[16]: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17
<29>2016-08-10T13:51:55.157648+02:00 pve-sm-prod-17 corosync[3533]: [MAIN ] Completed service synchronization, ready to provide service.
<29>2016-08-10T13:51:55.157805+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: received sync request (epoch 1/22832/0000030A)
<29>2016-08-10T13:51:55.158780+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: received sync request (epoch 1/22832/0000030A)
<29>2016-08-10T13:51:55.178253+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: received all states
<29>2016-08-10T13:51:55.178383+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: leader is 1/22832
<29>2016-08-10T13:51:55.178482+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: synced members: 1/22832, 2/5166, 4/14083, 5/2143, 6/1975, 7/3758, 8/1418, 9/3218, 10/1426, 11/2910, 12/1420, 13/4091, 14/1435, 15/1402, 16/6001, 17/3514
<29>2016-08-10T13:51:55.178577+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: all data is up to date
<29>2016-08-10T13:51:55.178670+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: dfsm_deliver_queue: queue length 16
<29>2016-08-10T13:51:55.253184+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: received all states
<29>2016-08-10T13:51:55.253919+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: all data is up to date
<29>2016-08-10T13:51:55.254045+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: dfsm_deliver_queue: queue length 277
 
Hallo Wolfgang,

brauchst Du weitere Informationen ? Um das Problem dauerhaft zu lösen, wären wir auch bereit, weitergehenden kostenpflichtigen Support in Anspruch zu nehmen.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!