4 Node Cluster - automatische Reboots von Nodes

Uwe

New Member
Jul 19, 2019
7
0
1
38
Hallo zusammen,

wir haben ein Proxmox Cluster mit 4 Hardware Nodes (HP DL360 G9) am Laufen.
Soweit alles prima!

Als Shared Storage für die VMs nutzen wir eine HP 3PAR mit jeweils eigenen SAN Volumes pro Node. Diese werden mit einem GlusterFS gleich gehalten.
Funktioniert soweit super.

Nun bin ich dabei die Nodes auf neue Proxmox Versionen zu updaten.
Beim ersten ging das noch prima...

Das Cluster lief auf "5.0-23" und update läuft über das Enterprise Repo und wird auf "5.4.11" gezogen.

Mein Ablauf:
- alle VMs des Nodes auf andere Nodes verteilen
- Node aus GlusterFS Cluster nehmen
- Node updaten
- Node rebooten
- alles checken
- Node wieder in den GlusterFS Verbund nehmen
- VMs wieder rauf ziehen


Wie gesagt hatte das bei den ersten beiden Nodes auch prima geklappt.

Ich ziehe die Aktion auch immer über mehrere Tage...da das ClusterVolume sich dann auch schön syncen kann.

Heute kam der 3. Node dran und da passierte was fatales!
Als der Node neugestartet hat sind auf einmal 2 weiter Nodes (alte und neue Version) ebenfalls neugestartet!!!
Alles VMs die da drauf laufen waren somit offline!

Im Syslog ist kurz vorher zu sehen das das Qourum neu ausgehandelt wird...da der eine Node ja aus ist...und dann auf einmal nur noch der Neustart. Als ob einfach auf den Resettknopf gedrückt wird. :-/


Kann sich das einer erklären? Ich will das natürlich nicht beim letzten Node der noch zu updaten ist wieder erleben. :-/

Welche Infos braucht ihr noch?

Vielen Dank und viele Grüße
Uwe
 
Ist HA aktiv und sind Resourcen auf den Nodes definiert? Ansonsten bitte das Log von dem Zeitraum posten.
 
Hallo,

HA ist aktiv...jedoch sind aktuell keine VMs dort eingerichtet da eine Livemigration mit aktivem HA derzeit nicht klappt. Der HA stößt die Migration zwar an...passiert aber nix. Manuelle Versuche eine VM mit aktivem HA zu migrieren schlagen mit Error 255 fehl.
Wenn ich die VM aus dem HA nehme klappt die Live Migration ohne Probleme.

Ich hab da erstmal auf die großen Versionsunterschiede innerhalb des Clusters geschoben...da vor dem Updaten es hier keine Probleme gab.

Hier mal ein Log Auszug von einem der Nodes die automatisch neugestartet sind.

Zuerst sieht man noch das Quorum neu verhandelt wird...das begann quasi als ich mit dem Updaten des "pve00" begonnen hatte. Klar...eigentlich logisch.

Um 11:51 ging dann der Reboot des pve00 los und kurz darauf startete dann auch der pve02...siehe Log...neu. :-/

VIELEN DANK für die Unterstützung vorab!


Code:
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]: notice  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]: notice  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]: notice  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]: notice  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]: notice  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]: notice  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]: notice  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]: notice  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List 2
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Retransmit List: 3 4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] entering GATHER state from 11(merge during join).
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] entering GATHER state from 11(merge during join).
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Creating commit token because I am the rep.
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Creating commit token because I am the rep.
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Saving state aru 5 high seq received 5
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Saving state aru 5 high seq received 5
Jul 19 11:48:38 pve02 corosync[3436]: debug   [MAIN  ] Storing new sequence id for ring c4c
Jul 19 11:48:38 pve02 corosync[3436]:  [MAIN  ] Storing new sequence id for ring c4c
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] entering COMMIT state.
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] entering COMMIT state.
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] got commit token
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] entering RECOVERY state.
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] got commit token
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] entering RECOVERY state.
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] TRANS [0] member 172.22.10.156:
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] TRANS [1] member 172.22.10.157:
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] position [0] member 172.22.10.156:
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] previous ring seq c48 rep 172.22.10.156
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] aru 5 high delivered 5 received flag 1
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] position [1] member 172.22.10.157:
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] previous ring seq c48 rep 172.22.10.156
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] aru 5 high delivered 5 received flag 1
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] TRANS [0] member 172.22.10.156:
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Did not need to originate any messages in recovery.
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] TRANS [1] member 172.22.10.157:
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] position [0] member 172.22.10.156:
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] previous ring seq c48 rep 172.22.10.156
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] aru 5 high delivered 5 received flag 1
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] position [1] member 172.22.10.157:
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] previous ring seq c48 rep 172.22.10.156
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] aru 5 high delivered 5 received flag 1
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] got commit token
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Sending initial ORF token
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Did not need to originate any messages in recovery.
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] got commit token
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Sending initial ORF token
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] install seq 0 aru 0 high seq received 0
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] install seq 0 aru 0 high seq received 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] install seq 0 aru 0 high seq received 0
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] install seq 0 aru 0 high seq received 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] install seq 0 aru 0 high seq received 0
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] install seq 0 aru 0 high seq received 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] install seq 0 aru 0 high seq received 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Resetting old ring state
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] recovery to regular 1-0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] Can't find UDPU member 172.22.10.158 (should be marked as inactive)
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [MAIN  ] Member left: r(0) ip(172.22.10.158)
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] waiting_trans_ack changed to 1
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] install seq 0 aru 0 high seq received 0
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] entering OPERATIONAL state.
Jul 19 11:48:38 pve02 corosync[3436]: notice  [TOTEM ] A new membership (172.22.10.156:3148) was formed. Members left: 4
Jul 19 11:48:38 pve02 corosync[3436]: notice  [TOTEM ] Failed to receive the leave message. failed: 4
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Resetting old ring state
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] recovery to regular 1-0
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Can't find UDPU member 172.22.10.158 (should be marked as inactive)
Jul 19 11:48:38 pve02 corosync[3436]:  [MAIN  ] Member left: r(0) ip(172.22.10.158)
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] waiting_trans_ack changed to 1
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] entering OPERATIONAL state.
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] A new membership (172.22.10.156:3148) was formed. Members left: 4
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] Failed to receive the leave message. failed: 4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [SYNC  ] Committing synchronization for corosync configuration map access
Jul 19 11:48:38 pve02 corosync[3436]: debug   [CMAP  ] Not first sync -> no action
Jul 19 11:48:38 pve02 corosync[3436]:  [SYNC  ] Committing synchronization for corosync configuration map access
Jul 19 11:48:38 pve02 corosync[3436]:  [CMAP  ] Not first sync -> no action
Jul 19 11:48:38 pve02 corosync[3436]: debug   [CPG   ] got joinlist message from node 0x3
Jul 19 11:48:38 pve02 corosync[3436]:  [CPG   ] got joinlist message from node 0x3
Jul 19 11:48:38 pve02 corosync[3436]: debug   [CPG   ] comparing: sender r(0) ip(172.22.10.156) ; members(old:2 left:0)
Jul 19 11:48:38 pve02 corosync[3436]: debug   [CPG   ] comparing: sender r(0) ip(172.22.10.157) ; members(old:2 left:0)
Jul 19 11:48:38 pve02 corosync[3436]: debug   [CPG   ] chosen downlist: sender r(0) ip(172.22.10.156) ; members(old:2 left:0)
Jul 19 11:48:38 pve02 corosync[3436]: debug   [CPG   ] got joinlist message from node 0x2
Jul 19 11:48:38 pve02 corosync[3436]:  [CPG   ] comparing: sender r(0) ip(172.22.10.156) ; members(old:2 left:0)
Jul 19 11:48:38 pve02 corosync[3436]: debug   [SYNC  ] Committing synchronization for corosync cluster closed process group service v1.01
Jul 19 11:48:38 pve02 corosync[3436]: debug   [CPG   ] joinlist_messages[0] group:pve_dcdb_v1\x00, ip:r(0) ip(172.22.10.156) , pid:3396
Jul 19 11:48:38 pve02 corosync[3436]: debug   [CPG   ] joinlist_messages[1] group:pve_kvstore_v1\x00, ip:r(0) ip(172.22.10.156) , pid:3396
Jul 19 11:48:38 pve02 corosync[3436]: debug   [CPG   ] joinlist_messages[2] group:pve_dcdb_v1\x00, ip:r(0) ip(172.22.10.157) , pid:4019
Jul 19 11:48:38 pve02 corosync[3436]: debug   [CPG   ] joinlist_messages[3] group:pve_kvstore_v1\x00, ip:r(0) ip(172.22.10.157) , pid:4019
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] Sending nodelist callback. ring_id = 2/3148
Jul 19 11:48:38 pve02 corosync[3436]:  [CPG   ] comparing: sender r(0) ip(172.22.10.157) ; members(old:2 left:0)
Jul 19 11:48:38 pve02 corosync[3436]:  [CPG   ] chosen downlist: sender r(0) ip(172.22.10.156) ; members(old:2 left:0)
Jul 19 11:48:38 pve02 corosync[3436]:  [CPG   ] got joinlist message from node 0x2
Jul 19 11:48:38 pve02 corosync[3436]:  [SYNC  ] Committing synchronization for corosync cluster closed process group service v1.01
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] got nodeinfo message from cluster node 2
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 4 flags: 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] total_votes=2, expected_votes=4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] node 1 state=2, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] node 2 state=1, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] node 3 state=1, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]:  [CPG   ] joinlist_messages[0] group:pve_dcdb_v1\x00, ip:r(0) ip(172.22.10.156) , pid:3396
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] node 4 state=2, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] got nodeinfo message from cluster node 2
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Jul 19 11:48:38 pve02 corosync[3436]:  [CPG   ] joinlist_messages[1] group:pve_kvstore_v1\x00, ip:r(0) ip(172.22.10.156) , pid:3396
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] got nodeinfo message from cluster node 3
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 4 flags: 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] got nodeinfo message from cluster node 3
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Jul 19 11:48:38 pve02 corosync[3436]:  [CPG   ] joinlist_messages[2] group:pve_dcdb_v1\x00, ip:r(0) ip(172.22.10.157) , pid:4019
Jul 19 11:48:38 pve02 corosync[3436]: debug   [SYNC  ] Committing synchronization for corosync vote quorum service v1.0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] total_votes=2, expected_votes=4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] node 1 state=2, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] node 2 state=1, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] node 3 state=1, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] node 4 state=2, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]: notice  [QUORUM] Members[2]: 2 3
Jul 19 11:48:38 pve02 corosync[3436]: debug   [QUORUM] sending quorum notification to (nil), length = 56
Jul 19 11:48:38 pve02 corosync[3436]: debug   [VOTEQ ] Sending quorum callback, quorate = 0
Jul 19 11:48:38 pve02 corosync[3436]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] waiting_trans_ack changed to 0
Jul 19 11:48:38 pve02 corosync[3436]:  [CPG   ] joinlist_messages[3] group:pve_kvstore_v1\x00, ip:r(0) ip(172.22.10.157) , pid:4019
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] Sending nodelist callback. ring_id = 2/3148
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] got nodeinfo message from cluster node 2
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 4 flags: 0
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] total_votes=2, expected_votes=4
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] node 1 state=2, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] node 2 state=1, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] node 3 state=1, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] node 4 state=2, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] got nodeinfo message from cluster node 2
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] got nodeinfo message from cluster node 3
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 4 flags: 0
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] got nodeinfo message from cluster node 3
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Jul 19 11:48:38 pve02 corosync[3436]:  [SYNC  ] Committing synchronization for corosync vote quorum service v1.0
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] total_votes=2, expected_votes=4
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] node 1 state=2, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] node 2 state=1, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] node 3 state=1, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] node 4 state=2, votes=1, expected=4
Jul 19 11:48:38 pve02 corosync[3436]:  [QUORUM] Members[2]: 2 3
Jul 19 11:48:38 pve02 corosync[3436]:  [QUORUM] sending quorum notification to (nil), length = 56
Jul 19 11:48:38 pve02 corosync[3436]:  [VOTEQ ] Sending quorum callback, quorate = 0
Jul 19 11:48:38 pve02 corosync[3436]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] waiting_trans_ack changed to 0
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] entering GATHER state from 11(merge during join).
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] entering GATHER state from 11(merge during join).
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] entering GATHER state from 11(merge during join).
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] entering GATHER state from 11(merge during join).
Jul 19 11:48:38 pve02 corosync[3436]: debug   [TOTEM ] entering GATHER state from 11(merge during join).
Jul 19 11:48:38 pve02 corosync[3436]:  [TOTEM ] entering GATHER state from 11(merge during join).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Jul 19 11:52:04 pve02 systemd[1]: Started Create list of required static device nodes for the current kernel.
Jul 19 11:52:04 pve02 kernel: [    0.000000] Linux version 4.10.15-1-pve (root@stretchbuild) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP PVE 4.10.15-15 (Fri, 23 Jun 2017 08:57:55 +0200) ()
Jul 19 11:52:04 pve02 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.10.15-1-pve root=/dev/mapper/pve-root ro quiet
Jul 19 11:52:04 pve02 systemd[1]: Starting Create Static Device Nodes in /dev...
Jul 19 11:52:04 pve02 kernel: [    0.000000] KERNEL supported cpus:
Jul 19 11:52:04 pve02 systemd[1]: Mounted POSIX Message Queue File System.
Jul 19 11:52:04 pve02 systemd[1]: Mounted Debug File System.
Jul 19 11:52:04 pve02 kernel: [    0.000000]   Intel GenuineIntel
Jul 19 11:52:04 pve02 systemd[1]: Mounted Huge Pages File System.
Jul 19 11:52:04 pve02 kernel: [    0.000000]   AMD AuthenticAMD
Jul 19 11:52:04 pve02 kernel: [    0.000000]   Centaur CentaurHauls
Jul 19 11:52:04 pve02 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jul 19 11:52:04 pve02 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jul 19 11:52:04 pve02 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jul 19 11:52:04 pve02 systemd[1]: Started Remount Root and Kernel File Systems.
Jul 19 11:52:04 pve02 kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Jul 19 11:52:04 pve02 systemd[1]: Starting Flush Journal to Persistent Storage...
Jul 19 11:52:04 pve02 kernel: [    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Jul 19 11:52:04 pve02 kernel: [    0.000000] e820: BIOS-provided physical RAM map:
Jul 19 11:52:04 pve02 systemd[1]: Starting Load/Save Random Seed...
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000093fff] usable
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000094000-0x000000000009ffff] reserved
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000005a7a0fff] usable
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x000000005a7a1000-0x000000005b5e0fff] reserved
Jul 19 11:52:04 pve02 systemd-modules-load[589]: Inserted module 'iscsi_tcp'
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x000000005b5e1000-0x00000000790fefff] usable
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x00000000790ff000-0x00000000791fefff] reserved
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x00000000791ff000-0x000000007b5fefff] ACPI NVS
Jul 19 11:52:04 pve02 systemd[1]: Started Flush Journal to Persistent Storage.
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x000000007b5ff000-0x000000007b7fefff] ACPI data
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x000000007b7ff000-0x000000007b7fffff] usable
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x000000007b800000-0x000000008fffffff] reserved
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x00000000ff800000-0x00000000ffffffff] reserved
Jul 19 11:52:04 pve02 kernel: [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000607fffffff] usable
Jul 19 11:52:04 pve02 systemd[1]: Started Load/Save Random Seed.
Jul 19 11:52:04 pve02 kernel: [    0.000000] NX (Execute Disable) protection: active
Jul 19 11:52:04 pve02 kernel: [    0.000000] SMBIOS 2.8 present.
Jul 19 11:52:04 pve02 kernel: [    0.000000] DMI: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
Jul 19 11:52:04 pve02 kernel: [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Jul 19 11:52:04 pve02 kernel: [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
Jul 19 11:52:04 pve02 kernel: [    0.000000] e820: last_pfn = 0x6080000 max_arch_pfn = 0x400000000
Jul 19 11:52:04 pve02 systemd[1]: Mounted RPC Pipe File System.
Jul 19 11:52:04 pve02 kernel: [    0.000000] MTRR default type: write-back
Jul 19 11:52:04 pve02 kernel: [    0.000000] MTRR fixed ranges enabled:
Jul 19 11:52:04 pve02 kernel: [    0.000000]   00000-9FFFF write-back
Jul 19 11:52:04 pve02 kernel: [    0.000000]   A0000-BFFFF uncachable
Jul 19 11:52:04 pve02 kernel: [    0.000000]   C0000-DFFFF write-protect
Jul 19 11:52:04 pve02 kernel: [    0.000000]   E0000-EFFFF uncachable
Jul 19 11:52:04 pve02 kernel: [    0.000000]   F0000-FFFFF write-protect
Jul 19 11:52:04 pve02 systemd[1]: Started LVM2 metadata daemon.
Jul 19 11:52:04 pve02 kernel: [    0.000000] MTRR variable ranges enabled:
Jul 19 11:52:04 pve02 kernel: [    0.000000]   0 base 000080000000 mask 3FFF80000000 uncachable
Jul 19 11:52:04 pve02 kernel: [    0.000000]   1 base 006080000000 mask 3FFF80000000 uncachable
Jul 19 11:52:04 pve02 kernel: [    0.000000]   2 base 008000000000 mask 3F8000000000 uncachable
Jul 19 11:52:04 pve02 kernel: [    0.000000]   3 base 010000000000 mask 3F0000000000 uncachable
Jul 19 11:52:04 pve02 kernel: [    0.000000]   4 base 038000000000 mask 3FC000000000 uncachable
Jul 19 11:52:04 pve02 kernel: [    0.000000]   5 base 00007C000000 mask 3FFFFC000000 uncachable
Jul 19 11:52:04 pve02 systemd-modules-load[589]: Inserted module 'ib_iser'
Jul 19 11:52:04 pve02 kernel: [    0.000000]   6 base 00007FC00000 mask 3FFFFFC00000 uncachable
Jul 19 11:52:04 pve02 kernel: [    0.000000]   7 disabled
Jul 19 11:52:04 pve02 kernel: [    0.000000]   8 disabled
Jul 19 11:52:04 pve02 kernel: [    0.000000]   9 disabled
 

Attachments

  • ha.PNG
    ha.PNG
    40.2 KB · Views: 4
Scheint als ob das Corosync Netz überlastet ist. Ist dieses ein eigenes, von den VMs (physisch) getrenntes Netz?
 
das sollte mich aber echt wundern.

Ja wir haben das alles getrennt und arbeiten da mit Bondings auf den NICs des Hosts.

Wir haben da 2x 1GE Bond für den AdminZugang...quasi für SSH und das Webinterface
dann 2x 1GE NUR für Corosync
2x 10GE für internes und INet Traffic der VMs und 2x 10GE für die Anbindung der VMs an unsere NetApp...falls eine VM externen Shared Storage benötigt.

Alles wie gesagt über Active/Passive Bondings.

Eine Überlastung der 1GE Corosync Leitung wäre schon krass... :-/

Das GlusterFS Sync läuft über die 10GE StorageLAN Leitung.
 
Normalerweise tritt die Meldung 'Retransmit List ...' in den Fällen auf wo es Probleme mit dem Netzwerk gibt.
Die corosync Konfig (/etc/pve/corosync.conf) sowie die '/etc/hosts' Datei und wenn möglich '/etc/network/interfaces' posten. (Achtung falls public IPs enthalten sind!)
 
Aber hätte das solch einen Effekt zur Folge? Also das wenn 1 Host neustartet...das dann andere Nodes mit neustarten?
Das komische ist..das pve02 noch die alte Proxmox Version hat...pve04 jedoch auch automatisch neugestartet wurde aber bereits auf der neuen Version ist.
pve06 ist ebenfalls neue Version...blieb aber weiter online.


Code:
root@pve02:~# cat /etc/pve/corosync.conf
logging {
  debug: on
  to_syslog: yes
}

nodelist {
  node {
    name: pve06
    nodeid: 4
    quorum_votes: 1
    ring0_addr: pve06-sync.ak.hadcs.de
  }

  node {
    name: pve04
    nodeid: 3
    quorum_votes: 1
    ring0_addr: pve04-sync.ak.hadcs.de
  }

  node {
    name: pve02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: pve02-sync.ak.hadcs.de
  }

  node {
    name: pve00
    nodeid: 1
    quorum_votes: 1
    ring0_addr: pve00-sync.ak.hadcs.de
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: sdp-pve-cluster
  config_version: 10
  ip_version: ipv4
  secauth: on
  transport: udpu
  version: 2
}





root@pve02:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
172.22.10.151 pve02.ak.hadcs.de pve02 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

#corosync hosts
172.22.10.155 pve00-sync.ak.hadcs.de pve00-sync
172.22.10.156 pve02-sync.ak.hadcs.de pve02-sync
172.22.10.157 pve04-sync.ak.hadcs.de pve04-sync
172.22.10.158 pve06-sync.ak.hadcs.de pve06-sync









root@pve02:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage part of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto eno3
iface eno3 inet manual

auto eno4
iface eno4 inet manual

auto eno49
iface eno49 inet manual

auto eno50
iface eno50 inet manual

auto ens2f0
iface ens2f0 inet manual

auto ens2f1
iface ens2f1 inet manual

auto bond0
iface bond0 inet static
        address  172.22.10.151
        netmask  255.255.255.0
        gateway  172.22.10.1
        slaves eno1 eno2
        bond_miimon 100
        bond_mode active-backup
#Admin Access
iface bond0 inet6 static
        address  fd6d:05f8:67e9:0010:172:22:10:151
        netmask  64
        gateway  fd6d:05f8:67e9:0010:172:22:10:1

auto bond1
iface bond1 inet static
        address  172.22.10.156
        netmask  255.255.255.0
        slaves eno3 eno4
        bond_miimon 100
        bond_mode active-backup
#Corosync

auto bond2
iface bond2 inet manual
        slaves eno49 eno50
        bond_miimon 100
        bond_mode active-backup
#Trunk AdminLAN + KoppelLAN

auto bond3
iface bond3 inet manual
        slaves ens2f0 ens2f1
        bond_miimon 100
        bond_mode active-backup
#Trunk StorageLAN

auto vmbr0
iface vmbr0 inet manual
        bridge_ports bond2
        bridge_stp off
        bridge_fd 0
        bridge_vlan_aware yes
#Trunk AL+KL

auto vmbr1
iface vmbr1 inet manual
        bridge_ports bond3
        bridge_stp off
        bridge_fd 0
        bridge_vlan_aware yes
#Trunk SL

auto vmbr1.3040
iface vmbr1.3040 inet static
        address 172.22.100.151
        netmask  255.255.255.0
        bridge_ports bond3
        bridge_stp off
        bridge_fd 0
#NFS Share+GlusterFS
 
bond0 und bond1 sind im selben Subnetz mit unterschiedlichen NICs, dies ist vermutlich die Ursache für die Probleme.
 
Hmmm...nur warum?
Weil das Routing dann nicht richtig weiß über welches Interface er nun gehen soll weil das gleiche Netz über 2 Interface geroutet werden?

Code:
root@pve02:~# ip route show
default via 172.22.10.1 dev bond0 onlink
172.22.10.0/24 dev bond0 proto kernel scope link src 172.22.10.151
172.22.10.0/24 dev bond1 proto kernel scope link src 172.22.10.156
172.22.100.0/24 dev vmbr1.3040 proto kernel scope link src 172.22.100.151

Seltsam jedoch das bislang es keine solchen Probleme gab...wir haben auch noch einen zweiten Standort mit gleichen Settings.. Jedoch (noch) mit anderer Proxmox Version...
 
Danke für die Info !
Das ergibt durchaus Sinn... Werde ich lieber mal umbauen und das Corosync in ein anderes Netz stecken.

Was mich aber wundert ist das die Corosync Meldungen erst so gehäuft auftreten wenn ein Node im Cluster aus ist und noch schlimmer das dann andere Nodes einfach entscheiden den Dienst einzustellen und sich rebooten.

Kann das einer erklären ?

Vielleicht weil der Corosync dann nicht richtig arbeiten kann wenn er es müsste durch die gleichen Routen und dann steigt das System lieber ganz aus?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!