Hi
wir betreiben einen Cluster mit 16 Nodes. Einer dieser Nodes verliert regelmäßig und reproduzierbar seine Mitgliedschaft im Cluster. Hier die relevanten Logs:
<29>2016-08-09T12:11:51.766421+02:00 pve-sm-prod-03 corosync[12928]: [TOTEM ] A new membership (10.1.1.175:114512) was formed. Members left: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16
<29>2016-08-09T12:11:51.766640+02:00 pve-sm-prod-03 corosync[12928]: [TOTEM ] Failed to receive the leave message. failed: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16
<29>2016-08-09T12:11:51.766758+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: members: 3/2126
<29>2016-08-09T12:11:51.766886+02:00 pve-sm-prod-03 pmxcfs[2126]: [status] notice: members: 3/2126
<29>2016-08-09T12:11:51.766996+02:00 pve-sm-prod-03 corosync[12928]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
<29>2016-08-09T12:11:51.767104+02:00 pve-sm-prod-03 corosync[12928]: [QUORUM] Members[1]: 3
<29>2016-08-09T12:11:51.767213+02:00 pve-sm-prod-03 corosync[12928]: [MAIN ] Completed service synchronization, ready to provide service.
<29>2016-08-09T12:11:51.767320+02:00 pve-sm-prod-03 pmxcfs[2126]: [status] notice: node lost quorum
<29>2016-08-09T12:11:51.866594+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: cpg_send_message retried 1 times
<26>2016-08-09T12:11:51.866855+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] crit: received write while not quorate - trigger resync
<26>2016-08-09T12:11:51.866995+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] crit: leaving CPG group
<27>2016-08-09T12:11:51.867121+02:00 pve-sm-prod-03 pve-ha-lrm[2212]: unable to write lrm status file - closing file '/etc/pve/nodes/pve-sm-prod-03/lrm_status.tmp.2212' failed - Operation not permitted
<30>2016-08-09T12:11:52.086956+02:00 pve-sm-prod-03 lxcfs[1883]: Internal error: truncated write to cache
<29>2016-08-09T12:11:52.678263+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: start cluster connection
<29>2016-08-09T12:11:52.678487+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: members: 3/2126
<29>2016-08-09T12:11:52.678604+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: all data is up to date
Wird der Service corosync neugestartet, verliert die Node nach einigen Minuten wieder die Mitgliedschaft. Ein Neustart des gesamten Server's hat keine Besserung gebracht. Wie kann dem abgeholfen werden ?
Micha
# pveversion -v
proxmox-ve: 4.2-56 (running kernel: 4.4.13-1-pve)
pve-manager: 4.2-17 (running version: 4.2-17/e1400248)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.13-1-pve: 4.4.13-56
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-43
qemu-server: 4.0-85
pve-firmware: 1.1-8
libpve-common-perl: 4.0-71
libpve-access-control: 4.0-18
libpve-storage-perl: 4.0-56
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6-1
pve-container: 1.0-71
pve-firewall: 2.0-29
pve-ha-manager: 1.0-32
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5.7-pve10~bpo80
openvswitch-switch: 2.5.0-1
wir betreiben einen Cluster mit 16 Nodes. Einer dieser Nodes verliert regelmäßig und reproduzierbar seine Mitgliedschaft im Cluster. Hier die relevanten Logs:
<29>2016-08-09T12:11:51.766421+02:00 pve-sm-prod-03 corosync[12928]: [TOTEM ] A new membership (10.1.1.175:114512) was formed. Members left: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16
<29>2016-08-09T12:11:51.766640+02:00 pve-sm-prod-03 corosync[12928]: [TOTEM ] Failed to receive the leave message. failed: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16
<29>2016-08-09T12:11:51.766758+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: members: 3/2126
<29>2016-08-09T12:11:51.766886+02:00 pve-sm-prod-03 pmxcfs[2126]: [status] notice: members: 3/2126
<29>2016-08-09T12:11:51.766996+02:00 pve-sm-prod-03 corosync[12928]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
<29>2016-08-09T12:11:51.767104+02:00 pve-sm-prod-03 corosync[12928]: [QUORUM] Members[1]: 3
<29>2016-08-09T12:11:51.767213+02:00 pve-sm-prod-03 corosync[12928]: [MAIN ] Completed service synchronization, ready to provide service.
<29>2016-08-09T12:11:51.767320+02:00 pve-sm-prod-03 pmxcfs[2126]: [status] notice: node lost quorum
<29>2016-08-09T12:11:51.866594+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: cpg_send_message retried 1 times
<26>2016-08-09T12:11:51.866855+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] crit: received write while not quorate - trigger resync
<26>2016-08-09T12:11:51.866995+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] crit: leaving CPG group
<27>2016-08-09T12:11:51.867121+02:00 pve-sm-prod-03 pve-ha-lrm[2212]: unable to write lrm status file - closing file '/etc/pve/nodes/pve-sm-prod-03/lrm_status.tmp.2212' failed - Operation not permitted
<30>2016-08-09T12:11:52.086956+02:00 pve-sm-prod-03 lxcfs[1883]: Internal error: truncated write to cache
<29>2016-08-09T12:11:52.678263+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: start cluster connection
<29>2016-08-09T12:11:52.678487+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: members: 3/2126
<29>2016-08-09T12:11:52.678604+02:00 pve-sm-prod-03 pmxcfs[2126]: [dcdb] notice: all data is up to date
Wird der Service corosync neugestartet, verliert die Node nach einigen Minuten wieder die Mitgliedschaft. Ein Neustart des gesamten Server's hat keine Besserung gebracht. Wie kann dem abgeholfen werden ?
Micha
# pveversion -v
proxmox-ve: 4.2-56 (running kernel: 4.4.13-1-pve)
pve-manager: 4.2-17 (running version: 4.2-17/e1400248)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.13-1-pve: 4.4.13-56
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-43
qemu-server: 4.0-85
pve-firmware: 1.1-8
libpve-common-perl: 4.0-71
libpve-access-control: 4.0-18
libpve-storage-perl: 4.0-56
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6-1
pve-container: 1.0-71
pve-firewall: 2.0-29
pve-ha-manager: 1.0-32
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5.7-pve10~bpo80
openvswitch-switch: 2.5.0-1