kein quorum nach upgrade

fflorian

Member
Feb 11, 2020
7
0
21
37
Guten Tag in die Runde. Nachdem ich mittlerweile etwas planlos an meinem Cluster mit 3 Maschinen hänge bitte ich um eure Hilfe.
Die Tage wurde das Upgrade laut Anleitung auf Proxmox 6 von 5.4 durchgeführt. Natürlich vorher Corosync hochgezogen auf die 3.
Nach Neustart der Maschinen laufen nur noch 2 so halbwegs Reibungslos. Die dritte Maschine hängt im Netz (ist auch der Backupspace für die anderen beiden) und die Backups laufen täglich drauf rein, jedoch quorum bekommt sie keines mehr.

Bildschirmfoto 2020-02-11 um 08.49.32.png
pvecm status der Problemmaschine:
Cluster information
-------------------
Name: pool
Config Version: 15
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Feb 11 08:45:08 2020
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.4d698
Quorate: No

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.3.4 (local)

Die Corosync files sehen auf allen 3 Maschinen gleich aus:
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: srv1
nodeid: 1
quorum_votes: 1
ring0_addr: srv1
}
node {
name: srv3
nodeid: 2
quorum_votes: 1
ring0_addr: srv3
}
node {
name: srv4
nodeid: 3
quorum_votes: 1
ring0_addr: srv4
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: pool
config_version: 15
interface {
bindnetaddr: 192.168.3.2
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

Der Service Status sieht wie folgt aus:

Feb 11 08:50:31 srv4 corosync[1525]: [TOTEM ] A new membership (1.4d974) was formed. Members
Feb 11 08:50:31 srv4 corosync[1525]: [CPG ] downlist left_list: 0 received
Feb 11 08:50:31 srv4 corosync[1525]: [CPG ] downlist left_list: 0 received
Feb 11 08:50:31 srv4 corosync[1525]: [QUORUM] Members[2]: 1 3
Feb 11 08:50:31 srv4 corosync[1525]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 11 08:50:36 srv4 corosync[1525]: [TOTEM ] A new membership (1.4d980) was formed. Members
Feb 11 08:50:36 srv4 corosync[1525]: [CPG ] downlist left_list: 0 received
Feb 11 08:50:36 srv4 corosync[1525]: [CPG ] downlist left_list: 0 received
Feb 11 08:50:36 srv4 corosync[1525]: [QUORUM] Members[2]: 1 3
Feb 11 08:50:36 srv4 corosync[1525]: [MAIN ] Completed service synchronization, ready to provide service


Hier noch der Status einer anderen Maschine:
Cluster information
-------------------
Name: pool
Config Version: 15
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Feb 11 08:47:18 2020
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000003
Ring ID: 1.4d7b8
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.3.2
0x00000003 1 192.168.3.5 (local)
 
Hallo,

ring0_addr: srv3

der srv3 hostname kann eh von allen nodenamen richtig aufgelöst werden? Dass war nähmlich mit corosync 2 dank multicast nicht immer ein problem, ist aber ein muss für corosync3.

Weiters könntest du sicherstellen dass am problematischen host /etc/pve/corosync.conf nicht von /etc/corosync/corosync.conf abweicht.

Der Service Status sieht wie folgt aus

Kannst du bitte auch die logs/den service status vom problematischen host posten?

journalctl -u corosync -u pve-cluster
 
Hallo,



der srv3 hostname kann eh von allen nodenamen richtig aufgelöst werden? Dass war nähmlich mit corosync 2 dank multicast nicht immer ein problem, ist aber ein muss für corosync3.

Weiters könntest du sicherstellen dass am problematischen host /etc/pve/corosync.conf nicht von /etc/corosync/corosync.conf abweicht.



Kannst du bitte auch die logs/den service status vom problematischen host posten?

journalctl -u corosync -u pve-cluster

die Corosync.conf stimmt soweit überein, siehe aus srv3 /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: srv1
nodeid: 1
quorum_votes: 1
ring0_addr: srv1
}
node {
name: srv3
nodeid: 2
quorum_votes: 1
ring0_addr: srv3
}
node {
name: srv4
nodeid: 3
quorum_votes: 1
ring0_addr: srv4
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: pool
config_version: 15
interface {
bindnetaddr: 192.168.3.2
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

Aufgelöst wird der srv3 korrekt.

Anbei das Journal:
-- Logs begin at Mon 2020-02-10 14:59:55 CET, end at Tue 2020-02-11 09:22:38 CET. --


Feb 10 15:00:09 srv3 systemd[1]: Starting The Proxmox VE cluster filesystem...


Feb 10 15:00:09 srv3 pmxcfs[2015]: [quorum] crit: quorum_initialize failed: 2


Feb 10 15:00:09 srv3 pmxcfs[2015]: [quorum] crit: can't initialize service


Feb 10 15:00:09 srv3 pmxcfs[2015]: [confdb] crit: cmap_initialize failed: 2


Feb 10 15:00:09 srv3 pmxcfs[2015]: [confdb] crit: can't initialize service


Feb 10 15:00:09 srv3 pmxcfs[2015]: [dcdb] crit: cpg_initialize failed: 2


Feb 10 15:00:09 srv3 pmxcfs[2015]: [dcdb] crit: can't initialize service


Feb 10 15:00:09 srv3 pmxcfs[2015]: [status] crit: cpg_initialize failed: 2


Feb 10 15:00:09 srv3 pmxcfs[2015]: [status] crit: can't initialize service


Feb 10 15:00:10 srv3 systemd[1]: Started The Proxmox VE cluster filesystem.


Feb 10 15:00:10 srv3 systemd[1]: Starting Corosync Cluster Engine...


Feb 10 15:00:10 srv3 corosync[2028]: [MAIN ] Corosync Cluster Engine 3.0.3 starting up


Feb 10 15:00:10 srv3 corosync[2028]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlco


Feb 10 15:00:10 srv3 corosync[2028]: [MAIN ] interface section bindnetaddr is used together with nodelist. Node


Feb 10 15:00:10 srv3 corosync[2028]: [MAIN ] Please migrate config file to nodelist.


Feb 10 15:00:10 srv3 corosync[2028]: [TOTEM ] Initializing transport (Kronosnet).


Feb 10 15:00:10 srv3 corosync[2028]: [TOTEM ] kronosnet crypto initialized: aes256/sha256


Feb 10 15:00:10 srv3 corosync[2028]: [TOTEM ] totemknet initialized


Feb 10 15:00:10 srv3 corosync[2028]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-g


Feb 10 15:00:11 srv3 corosync[2028]: [SERV ] Service engine loaded: corosync configuration map access [0]


Feb 10 15:00:11 srv3 corosync[2028]: [QB ] server name: cmap


Feb 10 15:00:11 srv3 corosync[2028]: [SERV ] Service engine loaded: corosync configuration service [1]


Feb 10 15:00:11 srv3 corosync[2028]: [QB ] server name: cfg


Feb 10 15:00:11 srv3 corosync[2028]: [SERV ] Service engine loaded: corosync cluster closed process group servi


Feb 10 15:00:11 srv3 corosync[2028]: [QB ] server name: cpg


Feb 10 15:00:11 srv3 corosync[2028]: [SERV ] Service engine loaded: corosync profile loading service [4]


Feb 10 15:00:11 srv3 corosync[2028]: [SERV ] Service engine loaded: corosync resource monitoring service [6]


Feb 10 15:00:11 srv3 corosync[2028]: [WD ] Watchdog not enabled by configuration


Feb 10 15:00:11 srv3 corosync[2028]: [WD ] resource load_15min missing a recovery key.


Feb 10 15:00:11 srv3 corosync[2028]: [WD ] resource memory_used missing a recovery key.


Feb 10 15:00:11 srv3 corosync[2028]: [WD ] no resources configured.


Feb 10 15:00:11 srv3 corosync[2028]: [SERV ] Service engine loaded: corosync watchdog service [7]


Feb 10 15:00:11 srv3 corosync[2028]: [QUORUM] Using quorum provider corosync_votequorum


Feb 10 15:00:11 srv3 corosync[2028]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]


Feb 10 15:00:11 srv3 corosync[2028]: [QB ] server name: votequorum


Feb 10 15:00:11 srv3 corosync[2028]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]


Feb 10 15:00:11 srv3 corosync[2028]: [QB ] server name: quorum


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 0)


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 1 has no active links


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 1 has no active links


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 1 has no active links


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 0)


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 3 has no active links


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 3 has no active links


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)


Feb 10 15:00:11 srv3 corosync[2028]: [KNET ] host: host: 3 has no active links


Feb 10 15:00:11 srv3 corosync[2028]: [TOTEM ] A new membership (2.2a51c) was formed. Members joined: 2


Feb 10 15:00:11 srv3 corosync[2028]: [CPG ] downlist left_list: 0 received


Feb 10 15:00:11 srv3 corosync[2028]: [QUORUM] Members[1]: 2


Feb 10 15:00:11 srv3 corosync[2028]: [MAIN ] Completed service synchronization, ready to provide service.


Feb 10 15:00:11 srv3 systemd[1]: Started Corosync Cluster Engine.


Feb 10 15:00:13 srv3 corosync[2028]: [KNET ] rx: host: 1 link: 0 is up


Feb 10 15:00:13 srv3 corosync[2028]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)


Feb 10 15:00:13 srv3 corosync[2028]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397


Feb 10 15:00:13 srv3 corosync[2028]: [KNET ] pmtud: Global data MTU changed to: 1397


Feb 10 15:00:15 srv3 pmxcfs[2015]: [status] notice: update cluster info (cluster name pool, version = 15)


Feb 10 15:00:15 srv3 corosync[2028]: [TOTEM ] Token has not been received in 1331 ms


Feb 10 15:00:18 srv3 corosync[2028]: [TOTEM ] A new membership (2.2a524) was formed. Members


Feb 10 15:00:18 srv3 corosync[2028]: [CPG ] downlist left_list: 0 received


Feb 10 15:00:18 srv3 corosync[2028]: [QUORUM] Members[1]: 2


Feb 10 15:00:18 srv3 corosync[2028]: [MAIN ] Completed service synchronization, ready to provide service.


Feb 10 15:00:18 srv3 pmxcfs[2015]: [dcdb] notice: members: 2/2015


Feb 10 15:00:18 srv3 pmxcfs[2015]: [dcdb] notice: all data is up to date


Feb 10 15:00:18 srv3 pmxcfs[2015]: [status] notice: members: 2/2015


Feb 10 15:00:18 srv3 pmxcfs[2015]: [status] notice: all data is up to date


Feb 10 15:00:19 srv3 corosync[2028]: [TOTEM ] Token has not been received in 1238 ms


Feb 10 15:00:21 srv3 corosync[2028]: [TOTEM ] Token has not been received in 2889 ms


Feb 10 15:00:23 srv3 corosync[2028]: [TOTEM ] A new membership (2.2a530) was formed. Members


Feb 10 15:00:23 srv3 corosync[2028]: [CPG ] downlist left_list: 0 received


Feb 10 15:00:23 srv3 corosync[2028]: [QUORUM] Members[1]: 2

Dies wiederholt sich anschließend in folge:

Feb 10 15:47:33 srv3 corosync[2028]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 10 15:47:34 srv3 corosync[2028]: [TOTEM ] Token has not been received in 1244 ms
Feb 10 15:47:36 srv3 corosync[2028]: [TOTEM ] Token has not been received in 2938 ms
Feb 10 15:47:38 srv3 corosync[2028]: [TOTEM ] A new membership (2.2be14) was formed. Members
Feb 10 15:47:38 srv3 corosync[2028]: [CPG ] downlist left_list: 0 received
Feb 10 15:47:38 srv3 corosync[2028]: [QUORUM] Members[1]: 2

Diese Zeile:
cpg_initialize failed: 2

heißt ich habe ein problem mit dem private key?
 
Last edited:
Es ist und bleibt für mich ein Mysterium. nachdem ich auf der fehlerhaften Maschine, welche zwischenzeitlich etwa 10x restarted wurde sowohl den Corosync service als auch den pve-cluster service beendet habe, mir mit "pmxcfs -l" zugriff auf das filesystem verschafft habe, anschließend in der /etc/pve/corosync.cfg die Hostnamen in ip Adressen verwandelt habe, den server neu gestartet habe läuft alles.

Die Corosync.cfg sieht aber jetzt wieder komplett gleich aus wie vorher, mit Hostnamen statt ip-adressen,... ich verstehe es nicht o_O

Dies aber nur fürs erste.
Aktuell läuft gerade mein Backup, daher werde ich nichts angreifen und abwarten bis morgen. Morgen VM steht dann ein neuer restart an und ich halte euch auf dem laufenden.
 
Die einzige Lösung war die Corosync auf jedem Server zu ändern auf die IP Adresse anstatt des Hostnamen. Nun läuft wieder alles stabil
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!