Cluster Node goes offline

rene.losert

New Member
Jul 20, 2017
2
0
1
27
Guten Abend,

eigentlich habe ich schon wesentlich komplexere Setups hinter mir, aber bei mir zuhause habe ich ein sehr spannendes Symptom:

Kurz zur Topologie:

hpve1: Cluster Master (Cluster: 'HomeCluster', LACP Bond, 192.168.1.201)
hpve2: Cluster Node (LACP Bond, 192.168.1.202)

Wenn ich an beiden Systemen den Corosync dienst neu starte sieht alles für 2-3 Minuten ganz okay aus, aber dann verabschiedet sich immer mein hpve2 vom Cluster.

Folgendes konnte ich im Syslog finden, nur wie kann ich das Problem lösen?

hpve1:
Code:
Jul 20 01:04:01 hpve1 systemd[1]: Starting Proxmox VE replication runner...
Jul 20 01:04:15 hpve1 systemd[1]: Started Proxmox VE replication runner.
Jul 20 01:04:17 hpve1 pvestatd[4780]: status update time (7.024 seconds)
Jul 20 01:04:27 hpve1 corosync[22657]: error   [TOTEM ] FAILED TO RECEIVE
Jul 20 01:04:27 hpve1 corosync[22657]:  [TOTEM ] FAILED TO RECEIVE
Jul 20 01:04:28 hpve1 corosync[22657]: notice  [TOTEM ] A new membership (192.168.1.201:3444) was formed. Members left: 2
Jul 20 01:04:28 hpve1 corosync[22657]: notice  [TOTEM ] Failed to receive the leave message. failed: 2
Jul 20 01:04:28 hpve1 corosync[22657]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 20 01:04:28 hpve1 corosync[22657]: notice  [QUORUM] Members[1]: 1
Jul 20 01:04:28 hpve1 corosync[22657]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 20 01:04:28 hpve1 corosync[22657]:  [TOTEM ] A new membership (192.168.1.201:3444) was formed. Members left: 2
Jul 20 01:04:28 hpve1 corosync[22657]:  [TOTEM ] Failed to receive the leave message. failed: 2
Jul 20 01:04:28 hpve1 corosync[22657]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 20 01:04:28 hpve1 corosync[22657]:  [QUORUM] Members[1]: 1
Jul 20 01:04:28 hpve1 corosync[22657]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 20 01:04:28 hpve1 pmxcfs[12597]: [status] notice: node lost quorum
Jul 20 01:04:28 hpve1 pmxcfs[12597]: [dcdb] notice: members: 1/12597
Jul 20 01:04:28 hpve1 pmxcfs[12597]: [status] notice: members: 1/12597
Jul 20 01:04:28 hpve1 pmxcfs[12597]: [dcdb] crit: received write while not quorate - trigger resync
Jul 20 01:04:28 hpve1 pmxcfs[12597]: [dcdb] crit: leaving CPG group
Jul 20 01:04:28 hpve1 pve-ha-lrm[4805]: unable to write lrm status file - unable to open file '/etc/pve/nodes/hpve1/lrm_status.tmp.4805' - Permission denied
Jul 20 01:04:29 hpve1 pmxcfs[12597]: [dcdb] notice: start cluster connection
Jul 20 01:04:29 hpve1 pmxcfs[12597]: [dcdb] notice: members: 1/12597
Jul 20 01:04:29 hpve1 pmxcfs[12597]: [dcdb] notice: all data is up to date
Jul 20 01:05:00 hpve1 systemd[1]: Starting Proxmox VE replication runner...
Jul 20 01:05:04 hpve1 systemd[1]: Started Proxmox VE replication runner.

hpve2
Code:
Jul 20 01:04:27 hpve2 corosync[3908]:  [TOTEM ] Retransmit List: 3a5 3a6 3a7
Jul 20 01:04:27 hpve2 corosync[3908]:  [TOTEM ] Retransmit List: 3a5 3a6 3a7
Jul 20 01:04:27 hpve2 corosync[3908]:  [TOTEM ] Retransmit List: 3a5 3a6 3a7
Jul 20 01:04:27 hpve2 corosync[3908]:  [TOTEM ] A new membership (192.168.1.202:3444) was formed. Members left: 1
Jul 20 01:04:27 hpve2 corosync[3908]:  [TOTEM ] Failed to receive the leave message. failed: 1
Jul 20 01:04:27 hpve2 corosync[3908]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 20 01:04:27 hpve2 corosync[3908]:  [QUORUM] Members[1]: 2
Jul 20 01:04:27 hpve2 corosync[3908]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 20 01:04:28 hpve2 pvestatd[3146]: could not activate storage 'LocalSpace', zfs error: cannot import 'rpool': no such pool available
Jul 20 01:04:28 hpve2 corosync[3908]: notice  [TOTEM ] A new membership (192.168.1.202:3448) was formed. Members
Jul 20 01:04:28 hpve2 corosync[3908]:  [TOTEM ] A new membership (192.168.1.202:3448) was formed. Members
Jul 20 01:04:28 hpve2 corosync[3908]: notice  [QUORUM] Members[1]: 2
Jul 20 01:04:28 hpve2 corosync[3908]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 20 01:04:28 hpve2 corosync[3908]:  [QUORUM] Members[1]: 2
Jul 20 01:04:28 hpve2 corosync[3908]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 20 01:04:30 hpve2 corosync[3908]: notice  [TOTEM ] A new membership (192.168.1.202:3452) was formed. Members
Jul 20 01:04:30 hpve2 corosync[3908]:  [TOTEM ] A new membership (192.168.1.202:3452) was formed. Members
Jul 20 01:04:30 hpve2 corosync[3908]: notice  [QUORUM] Members[1]: 2
Jul 20 01:04:30 hpve2 corosync[3908]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 20 01:04:30 hpve2 corosync[3908]:  [QUORUM] Members[1]: 2
Jul 20 01:04:30 hpve2 corosync[3908]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 20 01:04:31 hpve2 corosync[3908]: notice  [TOTEM ] A new membership (192.168.1.202:3456) was formed. Members
Jul 20 01:04:31 hpve2 corosync[3908]:  [TOTEM ] A new membership (192.168.1.202:3456) was formed. Members
Jul 20 01:04:31 hpve2 corosync[3908]: notice  [QUORUM] Members[1]: 2
Jul 20 01:04:31 hpve2 corosync[3908]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 20 01:04:31 hpve2 corosync[3908]:  [QUORUM] Members[1]: 2
Jul 20 01:04:31 hpve2 corosync[3908]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 20 01:04:37 hpve2 pvestatd[3146]: could not activate storage 'LocalSpace', zfs error: cannot import 'rpool': no such pool available
Jul 20 01:04:47 hpve2 pvestatd[3146]: could not activate storage 'LocalSpace', zfs error: cannot import 'rpool': no such pool available
Jul 20 01:04:57 hpve2 pvestatd[3146]: could not activate storage 'LocalSpace', zfs error: cannot import 'rpool': no such pool available
Jul 20 01:05:00 hpve2 systemd[1]: Starting Proxmox VE replication runner...
Jul 20 01:05:02 hpve2 systemd[1]: Started Proxmox VE replication runner.
Jul 20 01:05:07 hpve2 pvestatd[3146]: could not activate storage 'LocalSpace', zfs error: cannot import 'rpool': no such pool available

Liegt es wirklich daran, dass er am Node den einen ZFS Pool nicht findet?

Ich vermute die Ursache am Master, aber wo fange ich hier an?

mfg,
René
 
FYI: Es gibt keine Master node - alle Nodes sind gleichwertig.

Das corosync problem hängt wahrscheinlich mit multicast zusammen, siehe:

https://pve.proxmox.com/wiki/Multicast_notes

Würde ich zuerst mal mit omping testen.

Stimmt, ich vergaß :)

"HP - ProCurve
HP Procurve switches, by default, has disabled IGMP on all vlans as can be seen by this config snippet:"

Leider habe ich einen ProCurve, welcher nur SmartManaged ist, sprich kein IGMP,
da ich hier eh nur 2 Nodes habe werde ich diese heute abend direkt miteinander verbinden und ein eigenes Cluster Netzwerk machen.

Danke für den Hinweis.

Liebe Grüße,
René


_____
Edit:

Wie beschrieben löste eine direkte Verbindung mein Problem :)

Neuer Switch ist schon bestellt ;-)

Lg
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!