Corosync verliert Mitgliedschaft im Cluster

Michael Gusek · Aug 9, 2016

Hi

wir betreiben einen Cluster mit 16 Nodes. Einer dieser Nodes verliert regelmäßig und reproduzierbar seine Mitgliedschaft im Cluster. Hier die relevanten Logs:
<29>2016-08-09T12:11:51.766421+02:00 pve-sm-prod-03 corosync[12928]: [TOTEM ] A new membership (10.1.1.175:114512) was formed. Members left: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16
<29>2016-08-09T12:11:51.766640+02:00 pve-sm-prod-03 corosync[12928]: [TOTEM ] Failed to receive the leave message. failed: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16
<29>2016-08-09T12:11:51.766758+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: members: 3/2126
<29>2016-08-09T12:11:51.766886+02:00 pve-sm-prod-03 pmxcfs[2126]: [status] notice: members: 3/2126
<29>2016-08-09T12:11:51.766996+02:00 pve-sm-prod-03 corosync[12928]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
<29>2016-08-09T12:11:51.767104+02:00 pve-sm-prod-03 corosync[12928]: [QUORUM] Members[1]: 3
<29>2016-08-09T12:11:51.767213+02:00 pve-sm-prod-03 corosync[12928]: [MAIN ] Completed service synchronization, ready to provide service.
<29>2016-08-09T12:11:51.767320+02:00 pve-sm-prod-03 pmxcfs[2126]: [status] notice: node lost quorum
<29>2016-08-09T12:11:51.866594+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: cpg_send_message retried 1 times
<26>2016-08-09T12:11:51.866855+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] crit: received write while not quorate - trigger resync
<26>2016-08-09T12:11:51.866995+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] crit: leaving CPG group
<27>2016-08-09T12:11:51.867121+02:00 pve-sm-prod-03 pve-ha-lrm[2212]: unable to write lrm status file - closing file '/etc/pve/nodes/pve-sm-prod-03/lrm_status.tmp.2212' failed - Operation not permitted
<30>2016-08-09T12:11:52.086956+02:00 pve-sm-prod-03 lxcfs[1883]: Internal error: truncated write to cache
<29>2016-08-09T12:11:52.678263+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: start cluster connection
<29>2016-08-09T12:11:52.678487+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: members: 3/2126
<29>2016-08-09T12:11:52.678604+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: all data is up to date

Wird der Service corosync neugestartet, verliert die Node nach einigen Minuten wieder die Mitgliedschaft. Ein Neustart des gesamten Server's hat keine Besserung gebracht. Wie kann dem abgeholfen werden ?

Micha

# pveversion -v
proxmox-ve: 4.2-56 (running kernel: 4.4.13-1-pve)
pve-manager: 4.2-17 (running version: 4.2-17/e1400248)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.13-1-pve: 4.4.13-56
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-43
qemu-server: 4.0-85
pve-firmware: 1.1-8
libpve-common-perl: 4.0-71
libpve-access-control: 4.0-18
libpve-storage-perl: 4.0-56
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6-1
pve-container: 1.0-71
pve-firewall: 2.0-29
pve-ha-manager: 1.0-32
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5.7-pve10~bpo80
openvswitch-switch: 2.5.0-1

wolfgang · Aug 10, 2016

Hi,

kann es sein das ein Netzwerkfehler vorliegt z.B die Nic Probleme macht.

Bitte mal multicast testen.

es sollte reichen wen das mit 3 Server gemacht wird, einer davon sollte der Problemserver sein.

omping -c 10000 -i 0.001 -F -q node1 node2 node3

Michael Gusek · Aug 10, 2016

Hmm, da passiert nix: 'waiting for response msg' für alle nodes im loop

Michael Gusek · Aug 10, 2016

Ok, Layer 8 Problem, Ausgabe:

pve-sm-prod-01 : joined (S,G) = (*, 232.43.211.234), pinging
pve-sm-prod-03 : waiting for response msg
pve-sm-prod-03 : waiting for response msg
pve-sm-prod-03 : joined (S,G) = (*, 232.43.211.234), pinging
pve-sm-prod-01 : given amount of query messages was sent
pve-sm-prod-03 : waiting for response msg
pve-sm-prod-03 : server told us to stop

pve-sm-prod-01 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.063/0.094/0.519/0.022
pve-sm-prod-01 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.084/0.117/0.528/0.024
pve-sm-prod-03 : unicast, xmt/rcv/%loss = 9682/9682/0%, min/avg/max/std-dev = 0.063/0.087/0.389/0.015
pve-sm-prod-03 : multicast, xmt/rcv/%loss = 9682/9682/0%, min/avg/max/std-dev = 0.083/0.109/1.068/0.019

pve-sm-prod-03 ist der defekte

wolfgang · Aug 10, 2016

Sind auf den anderen Nodes auch openvswitch installiert?

Wie sieht das log auf den anderen Servern zur Ausfallzeit aus (ein log reicht).

wolfgang · Aug 10, 2016

Omping sieht gut aus.

Michael Gusek · Aug 10, 2016

Auf allen Nodes ist Openvswitch drauf (und konfiguriert). Hier der Logauszug, wenn die Node rausfällt:
<29>2016-08-10T13:51:21.645994+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: received log
<29>2016-08-10T13:51:42.244135+02:00 pve-sm-prod-17 corosync[3533]: [TOTEM ] A new membership (10.1.1.173:142296) was formed. Members left: 3
<29>2016-08-10T13:51:42.244319+02:00 pve-sm-prod-17 corosync[3533]: [TOTEM ] Failed to receive the leave message. failed: 3
<29>2016-08-10T13:51:42.251414+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: members: 1/22832, 2/5166, 4/14083, 5/2143, 6/1975, 7/3758, 8/1418, 9/3218, 10/1426, 11/2910, 12/1420, 13/4091, 14/1435, 15/1402, 16/6001, 17/3514
<29>2016-08-10T13:51:42.251549+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: starting data syncronisation
<29>2016-08-10T13:51:42.251647+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: members: 1/22832, 2/5166, 4/14083, 5/2143, 6/1975, 7/3758, 8/1418, 9/3218, 10/1426, 11/2910, 12/1420, 13/4091, 14/1435, 15/1402, 16/6001, 17/3514
<29>2016-08-10T13:51:42.251739+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: starting data syncronisation
<29>2016-08-10T13:51:42.253831+02:00 pve-sm-prod-17 corosync[3533]: [QUORUM] Members[16]: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17
<29>2016-08-10T13:51:42.253962+02:00 pve-sm-prod-17 corosync[3533]: [MAIN ] Completed service synchronization, ready to provide service.
<29>2016-08-10T13:51:55.147153+02:00 pve-sm-prod-17 corosync[3533]: [TOTEM ] A new membership (10.1.1.173:142300) was formed. Members
<29>2016-08-10T13:51:55.157519+02:00 pve-sm-prod-17 corosync[3533]: [QUORUM] Members[16]: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17
<29>2016-08-10T13:51:55.157648+02:00 pve-sm-prod-17 corosync[3533]: [MAIN ] Completed service synchronization, ready to provide service.
<29>2016-08-10T13:51:55.157805+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: received sync request (epoch 1/22832/0000030A)
<29>2016-08-10T13:51:55.158780+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: received sync request (epoch 1/22832/0000030A)
<29>2016-08-10T13:51:55.178253+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: received all states
<29>2016-08-10T13:51:55.178383+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: leader is 1/22832
<29>2016-08-10T13:51:55.178482+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: synced members: 1/22832, 2/5166, 4/14083, 5/2143, 6/1975, 7/3758, 8/1418, 9/3218, 10/1426, 11/2910, 12/1420, 13/4091, 14/1435, 15/1402, 16/6001, 17/3514
<29>2016-08-10T13:51:55.178577+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: all data is up to date
<29>2016-08-10T13:51:55.178670+02:00 pve-sm-prod-17 pmxcfs[3514]: [dcdb] notice: dfsm_deliver_queue: queue length 16
<29>2016-08-10T13:51:55.253184+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: received all states
<29>2016-08-10T13:51:55.253919+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: all data is up to date
<29>2016-08-10T13:51:55.254045+02:00 pve-sm-prod-17 pmxcfs[3514]: [status] notice: dfsm_deliver_queue: queue length 277

Michael Gusek · Aug 12, 2016

Hallo Wolfgang,

brauchst Du weitere Informationen ? Um das Problem dauerhaft zu lösen, wären wir auch bereit, weitergehenden kostenpflichtigen Support in Anspruch zu nehmen.

Search

Search

Corosync verliert Mitgliedschaft im Cluster

Michael Gusek

New Member

wolfgang

Proxmox Retired Staff

Michael Gusek

New Member

Michael Gusek

New Member

wolfgang

Proxmox Retired Staff

wolfgang

Proxmox Retired Staff

Michael Gusek

New Member

Michael Gusek

New Member