Corosync Cluster Engine startet nicht

Gahrt · Oct 2, 2018

Liebe Community, ich hoffe ihr könnt mir helfe. Ich sitze seit mehren Tagen an folgendem Problem,

ich habe 2 Servern auf denen Proxmox läuft (pve-manager/5.2-9/4b30e8f9 (running kernel: 4.15.18-5-pve)) ich habe auf node1 (hermes) ein Cluster erstellt und dann versucht den node2 (nike) in dieses Cluster zu adden. Leider schläg dies immer fehl.

Die beiden Server sind via Vlan verbunden und laut omping ist multicast möglich.
hermes (10.8.0.1)
nike(10.8.06)
mit diesen daten sind die Nodes auch in den hosts datein eingetragen.

corosync.conf

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: hermes
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.8.0.1
  }
  node {
    name: nike
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.8.0.6
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: brainClust
  config_version: 7
  interface {
    bindnetaddr: 10.8.0.1
    ringnumber: 0
    member {
        memberaddr: 10.8.0.1
    }
    member {
        memberaddr: 10.8.0.6
    }
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

systemctl status corosync.service

Code:

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2018-10-02 04:57:18 UTC; 4min 51s ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 3544 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=20)
 Main PID: 3544 (code=exited, status=20)
      CPU: 45ms

Oct 02 04:57:18 nike corosync[3544]: info    [WD    ] no resources configured.
Oct 02 04:57:18 nike corosync[3544]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Oct 02 04:57:18 nike corosync[3544]: notice  [QUORUM] Using quorum provider corosync_votequorum
Oct 02 04:57:18 nike corosync[3544]: crit    [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
Oct 02 04:57:18 nike corosync[3544]: error   [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'con
Oct 02 04:57:18 nike corosync[3544]: error   [MAIN  ] Corosync Cluster Engine exiting with status 20 at service.c:356
Oct 02 04:57:18 nike systemd[1]: corosync.service: Main process exited, code=exited, status=20/n/a
Oct 02 04:57:18 nike systemd[1]: Failed to start Corosync Cluster Engine.
Oct 02 04:57:18 nike systemd[1]: corosync.service: Unit entered failed state.
Oct 02 04:57:18 nike systemd[1]: corosync.service: Failed with result 'exit-code'.

pvecm add

Code:

Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1538456867.sql.gz'
delete old backup '/var/lib/pve-cluster/backup/config-1538415458.sql.gz'
Job for corosync.service failed because the control process exited with error code.
starting pve-cluster failed: See "systemctl status corosync.service" and "journalctl -xe" for details.
root@nike:/etc/pve#

omping

Code:

root@nike:/etc/pve# omping nike hermes
hermes : waiting for response msg
hermes : waiting for response msg
hermes : waiting for response msg
hermes : waiting for response msg
hermes : joined (S,G) = (*, 232.43.211.234), pinging
hermes :   unicast, seq=1, size=69 bytes, dist=0, time=1.068ms
hermes : multicast, seq=1, size=69 bytes, dist=0, time=1.087ms

journalctl -xe

Code:

Oct 02 05:11:36 nike pveproxy[1781]: worker 5763 started
Oct 02 05:11:36 nike pveproxy[1781]: worker 5764 started
Oct 02 05:11:36 nike pveproxy[5763]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) a

Stoiko Ivanov · Oct 2, 2018

Gahrt said:
Oct 02 04:57:18 nike corosync[3544]: crit [QUORUM] Quorum provider: corosync_votequorum failed to initialize. Oct 02 04:57:18 nike corosync[3544]: error [SERV ] Service engine 'corosync_quorum' failed to load for reason 'con

Die untere Zeile ist nicht vollstaendig - und koennte einen hinweis auf den Fehler liefern - wenn moeglich bitte vollstaendig posten.

Nur um auf Nummer sicher zu gehen: omping muss fuer multicast tests auf allen nodes gleichzeitig laufen - siehe https://pve.proxmox.com/wiki/Multicast_notes
Bitte auch hier das output von beiden nodes und beiden commands posten

Auf einen Verdacht hin - Firewall zwischen den Nodes ist nicht aktiv ?

Gahrt · Oct 2, 2018

Danke für die schnelle Reaktion - Anbei die zusätzlichen infos!Eine Firewall sollte zwischen den System nicht bestehen. Zudem sind die beiden Server via VPN miteinander verbunden!

omping nike

Code:

root@nike:/etc/pve# omping nike hermes                                          hermes : waiting for response msg
hermes : waiting for response msg
hermes : waiting for response msg
hermes : waiting for response msg
hermes : waiting for response msg
hermes : joined (S,G) = (*, 232.43.211.234), pinging
hermes :   unicast, seq=1, size=69 bytes, dist=0, time=1.090ms
hermes : multicast, seq=1, size=69 bytes, dist=0, time=1.114ms
hermes :   unicast, seq=2, size=69 bytes, dist=0, time=1.068ms
hermes : multicast, seq=2, size=69 bytes, dist=0, time=1.092ms
hermes :   unicast, seq=3, size=69 bytes, dist=0, time=1.123ms
hermes : multicast, seq=3, size=69 bytes, dist=0, time=1.134ms
hermes :   unicast, seq=4, size=69 bytes, dist=0, time=1.126ms
hermes : multicast, seq=4, size=69 bytes, dist=0, time=1.151ms
hermes :   unicast, seq=5, size=69 bytes, dist=0, time=1.102ms
hermes : multicast, seq=5, size=69 bytes, dist=0, time=1.126ms
hermes :   unicast, seq=6, size=69 bytes, dist=0, time=1.086ms
hermes : multicast, seq=6, size=69 bytes, dist=0, time=1.096ms
hermes :   unicast, seq=7, size=69 bytes, dist=0, time=1.055ms
hermes : multicast, seq=7, size=69 bytes, dist=0, time=1.080ms

omping hermes

Code:

root@hermes:~# omping nike hermes
nike : waiting for response msg
nike : joined (S,G) = (*, 232.43.211.234), pinging
nike :   unicast, seq=1, size=69 bytes, dist=0, time=1.017ms
nike : multicast, seq=1, size=69 bytes, dist=0, time=1.033ms
nike :   unicast, seq=2, size=69 bytes, dist=0, time=1.095ms
nike : multicast, seq=2, size=69 bytes, dist=0, time=1.110ms
nike :   unicast, seq=3, size=69 bytes, dist=0, time=0.972ms
nike : multicast, seq=3, size=69 bytes, dist=0, time=0.987ms
nike :   unicast, seq=4, size=69 bytes, dist=0, time=1.094ms
nike : multicast, seq=4, size=69 bytes, dist=0, time=1.110ms
nike :   unicast, seq=5, size=69 bytes, dist=0, time=1.045ms
nike : multicast, seq=5, size=69 bytes, dist=0, time=1.060ms
nike :   unicast, seq=6, size=69 bytes, dist=0, time=1.042ms
nike : multicast, seq=6, size=69 bytes, dist=0, time=1.058ms
nike :   unicast, seq=7, size=69 bytes, dist=0, time=1.160ms
nike : multicast, seq=7, size=69 bytes, dist=0, time=1.175ms
nike :   unicast, seq=8, size=69 bytes, dist=0, time=0.982ms
nike : multicast, seq=8, size=69 bytes, dist=0, time=0.997ms

systemctl status corosync.service

Code:

root@nike:/etc/pve# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2018-10-02 09:06:35 UTC; 41s ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 8362 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=20)
 Main PID: 8362 (code=exited, status=20)
      CPU: 49ms

Oct 02 09:06:35 nike corosync[8362]: info    [WD    ] no resources configured.
Oct 02 09:06:35 nike corosync[8362]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Oct 02 09:06:35 nike corosync[8362]: notice  [QUORUM] Using quorum provider corosync_votequorum
Oct 02 09:06:35 nike corosync[8362]: crit    [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
Oct 02 09:06:35 nike corosync[8362]: error   [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'
Oct 02 09:06:35 nike corosync[8362]: error   [MAIN  ] Corosync Cluster Engine exiting with status 20 at service.c:356.
Oct 02 09:06:35 nike systemd[1]: corosync.service: Main process exited, code=exited, status=20/n/a
Oct 02 09:06:35 nike systemd[1]: Failed to start Corosync Cluster Engine.
Oct 02 09:06:35 nike systemd[1]: corosync.service: Unit entered failed state.
Oct 02 09:06:35 nike systemd[1]: corosync.service: Failed with result 'exit-code'.

Stoiko Ivanov · Oct 2, 2018

Gahrt said:
Oct 02 09:06:35 nike corosync[8362]: error [SERV ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'

scheint ein fehler in der corosync.conf zu sein - whitespace aside habe ich in meiner config unter `totem->interface` keine member keys (dafuer ist ja die explizite nodelist da) - vl. mal backuppen, und die zeilen rauschloeschen, danach corosync neu starten.

Gahrt · Oct 2, 2018

Habe die config jetzt mal geändert und bekomme leider immer noch den gleichen Fehler!

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: hermes
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.8.0.1
  }
  node {
    name: nike
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.8.0.6
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: brainClust
  config_version: 9
  interface {
    bindnetaddr: 10.8.0.1
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Code:

root@nike:/etc/pve# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2018-10-02 10:14:04 UTC; 6s ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 18518 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=20)
 Main PID: 18518 (code=exited, status=20)
      CPU: 46ms

Oct 02 10:14:04 nike corosync[18518]: info    [WD    ] no resources configured.
Oct 02 10:14:04 nike corosync[18518]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Oct 02 10:14:04 nike corosync[18518]: notice  [QUORUM] Using quorum provider corosync_votequorum
Oct 02 10:14:04 nike corosync[18518]: crit    [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
Oct 02 10:14:04 nike corosync[18518]: error   [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'
Oct 02 10:14:04 nike corosync[18518]: error   [MAIN  ] Corosync Cluster Engine exiting with status 20 at service.c:356.
Oct 02 10:14:04 nike systemd[1]: corosync.service: Main process exited, code=exited, status=20/n/a
Oct 02 10:14:04 nike systemd[1]: Failed to start Corosync Cluster Engine.
Oct 02 10:14:04 nike systemd[1]: corosync.service: Unit entered failed state.
Oct 02 10:14:04 nike systemd[1]: corosync.service: Failed with result 'exit-code'.

Stoiko Ivanov · Oct 2, 2018

Ich konnte den Fehler dadurch reproduzieren, dass ich eine falsche ip fuer eine der nodes in die /etc/corosync/corosync.conf geschrieben habe.

Ich wuerde mal die ips genau ueberpruefen (auch mit der derzeitigen config): `ip a`, `ip r` , sowie die eintraege in der `/etc/hosts`

Gahrt · Oct 2, 2018

Kann es damit zusammen hängen das jeder Server 2 IPs hat einen Öffentlichen und eine Private die durch ein VPN gesetzt wird? Kann ich dem Cluster irgendwie mitgeben welchen Netzadapter er verwenden soll?

ip a und ip r (öffentliche IP aus Sicherheitsgründen anonymisiert)

Code:

root@nike:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP gr                                                                                                oup default qlen 1000
    link/ether 4c:72:b9:bxxxxxe7:cxxxxbrd ff:ff:ff:ff:ff:ff
3: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qle                                                                                                n 1000
    link/ether 4c:72:b9:xx:xxxx:xxxx brd ff:ff:ff:ff:ff:ff
    inet 46.105.xx.xxx/24 brd 46.105.xx.xxx scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 2001:41d0:xx:xxxx::/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::4e72xxxxx:feb0:exxxxc/64 scope link
       valid_lft forever preferred_lft forever
4: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN gr                                                                                                oup default qlen 100
    link/none
    inet 10.8.0.6 peer 10.8.0.5/32 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::c1bd:1744:2c48:30ff/64 scope link stable-privacy
       valid_lft forever preferred_lft forever

root@nike:~# ip r
default via 46.105.xx.xxx dev vmbr0 proto static
10.8.0.0/24 via 10.8.0.5 dev tun0
10.8.0.1 via 10.8.0.5 dev tun0
10.8.0.5 dev tun0 proto kernel scope link src 10.8.0.6
46.105.xx.x/24 dev vmbr0 proto kernel scope link src 46.105.xx.xxx
192.168.10.0/24 via 10.8.0.5 dev tun0
192.168.20.0/24 via 10.8.0.5 dev tun0

hosts selbst so angelegt

Code:

root@nike:~# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1       localhost.localdomain localhost
10.8.0.6        ns382342.ip-46-105-99.eu        nike
10.8.0.1 hermes hermes

Stoiko Ivanov · Oct 3, 2018

what kind of vpn are you using? (openvpn by any chance with tun/layer3 devices?) - In that case I doubt, that the multicast packets would get send over the openvpn link.

On another note - corosync is very sensitive to latency - and we wouldn't recommend running it over a VPN.

I would test where the corosync traffic gets send out on (with tcpdump) and whether it arrives on the other note - for a 2 node cluster you could also consider running in unicast mode - see https://pve.proxmox.com/wiki/Multicast_notes

Would be grateful if you let us know, whether it works out!

Gahrt · Oct 3, 2018

#deutsch
ja es handelt sich um openvpn und es ist auf "tun" eingestellt - alles was im internen IP Bereich geht über das VPN (10.** und 192.**)
leider gibt es bei dem Hoster aber keine andere Möglichkeit als OpenVPN zu arbeiten!
ich werde mir tcpdump mal anschauen - wir haben zwar jetzt nur 2 Nodes dies wir aber bald erweitert!

#englisch
yes we us openvpn with tun for traffic by IP (10.** and 192.**) - but we have no other option as openvpn
i'am testing wie tcpdump - in the near future we have more then two nodes

Stoiko Ivanov · Oct 3, 2018

ups - hab uebersehen, dass das das deutsche forum ist - sorry.

openvpn fuehrt halt noch einen weiteren single point of failure ein - den openvpn server.
potentiell wuerde sich tinc oder wireguard als alternative anbieten (wobei auch hier wuerde ich annehmen, dass multicast problematisch wird)

Search

Search

Corosync Cluster Engine startet nicht

Gahrt

New Member

Stoiko Ivanov

Proxmox Staff Member

Gahrt

New Member

Stoiko Ivanov

Proxmox Staff Member

Gahrt

New Member

Stoiko Ivanov

Proxmox Staff Member

Gahrt

New Member

Stoiko Ivanov

Proxmox Staff Member

Gahrt

New Member

Stoiko Ivanov

Proxmox Staff Member