PVE 4 / Alle Cluster Nodes starten unerwartet neu

gugpat · Aug 17, 2016

Hallo und guten Morgen,

ich habe seit letzter Woche einen neuen PVE4 Cluster installiert.
Insgesamt 3 Nodes (2x HP Server, 1x Dell) + Synology 1515+ SAN mit NFS als Speicher
Die Nodes+Synology hängen jeder mit Bond an einem Switch.

Die Einrichtung verlief ohne Zwischenfälle, das System lief performant und ohne Probleme.

Heute Morgen 6:20 Uhr starteten dann plötzlich alle Nodes zeitgleich neu?!

Die Systemlogs in der gui zeigen leider erst Einträge nach dem Neustart an.
Was kann hier die Ursache sein?

Jemand eine Idee? Danke.

dcsapak · Aug 17, 2016

Vielleicht ein Multicast Problem?
siehe https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues

gugpat · Aug 17, 2016

Danke, gleich mal eine Frage zum ersten Punkt:
"- Ensure all the nodes are in the same subnet."

Nodes haben 4 Netzwerkschnittstellen mit der gleichen Konfig:
bond0: eth0, eth1 (192.168.3.X) -> SAN im gleichen Subnetz
bond1: eth0. eth1 (192.168.4.X)
vmbr0: bond1 (192.168.2.X) / hier Defaultgateway-definiert (192.168.2.254) > für VMs

die Bonds sind mit LACP, ebenfalls so am Switch, konfiguriert (TP-Link TL-SG3424)

oder ist das hier schon falsch ge(d)macht?

//update:
die ping -f auf die Adressen .3.x und .2.x laufen einwandfrei 0% packet loss, bei .4.x 100%?! Hab ich was mit der Bondkonfig nicht verstanden?

in der hosts file sind die Nodes über das .2.x miteinander bekannt, über dieses Subnetzt wurden auch die nodes zum cluster hinzugefügt.
- 192.168.2.x pve2.private.local pve2 pvelocalhost
- pvecm add 192.168.2.x

dcsapak · Aug 17, 2016

Können Sie vielleciht den Inhalt der Dateien /etc/network/interfaces und /etc/hosts hier posten?

bond1 und bond0 sind beide auf eth0/eth1? (ich hoffe das ist nur ein schreibfehler)

und warum eine ip auf dem bond und auf der vmbr0? was ist hier der zweck?

gugpat · Aug 17, 2016

Natürlich war das ein Schreibfehler (copy & paste...)
Wusste nicht, dass es bonds auch keine IP haben dürfen - sonst kein spezieller Zweck.
Wäre es ratsam die Storage auch eher über eine Bridge zu verbinden?

- Gibt es keine LOGs in der vermerkt wird warum ein Neustart durchgeführt wird? Irgendwoher muss die Aktion ja kommen, also Watchdog etc?

Hier die /interfaces (pve1)

auto lo
iface lo inet loopback
iface eth0 inet manual
iface eth1 inet manual
iface eth2 inet manual
iface eth3 inet manual
auto bond0
iface bond0 inet static
address 192.168.3.222
netmask 255.255.255.0
slaves eth2 eth3
bond_miimon 100
bond_mode 802.3ad
#SAN
auto bond1
iface bond1 inet static
address 192.168.4.222
netmask 255.255.255.0
slaves eth0 eth1
bond_miimon 100
bond_mode 802.3ad
#VM
auto vmbr0
iface vmbr0 inet static
address 192.168.2.222
netmask 255.255.255.0
gateway 192.168.2.254
bridge_ports bond1
bridge_stp off
bridge_fd 0
#VirtualGuests

Hier die /hosts (pve1)
127.0.0.1 localhost.localdomain localhost
192.168.2.222 pve1.private.local pve1 pvelocalhost
192.168.2.223 pve2.private.local pve2 pvelocalhost
192.168.2.224 pve3.private.local pve3 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

dcsapak · Aug 18, 2016

Mhmm möglicherweise is das journal nicht persistent
führen Sie mal folgende Befehle aus um das log auf die disk zu speichern:

Code:

mkdir /var/log/journal
systemctl restart systemd-journald.service

gugpat · Aug 18, 2016

Ok erledigt - weder Fehler- noch Erfolgsmeldung erhalten. Das Journal entspricht dem Syslog in der GUI?

Habe die .4.x IP-Adressen der Bond entfernt und die Nodes neu gestartet. Bisher keine Zwischenfälle mehr.
Ich beobachte weiter...

dcsapak · Aug 18, 2016

gugpat said:
Das Journal entspricht dem Syslog in der GUI?

genau

gugpat said:
Habe die .4.x IP-Adressen der Bond entfernt und die Nodes neu gestartet. Bisher keine Zwischenfälle mehr.
Ich beobachte weiter...

okay, am besten dann wieder einfach in diesem thread antworten mit den entprechenden log einträgen.

gugpat · Aug 19, 2016

Okay, die Server haben eben wieder neu gestartet

Folgender Einträge der Syslogs vor dem Reboot jeweils mit Timestamp des ersten und letzten Eintrags:

Aug 19 15:34:48 pve1 kernel: vmbr0: received packet on bond1 with own address as source address
....
pve1 pve1 kernel: net_ratelimit: 1 callbacks suppressed
pve1 kernel: vmbr0: received packet on bond1 with own address as source address
pve1 corosync[1930]: [TOTEM ] A processor failed, forming new configuration.
pve1 kernel: vmbr0: received packet on bond1 with own address as source address
...
pve1 kernel: net_ratelimit: 2 callbacks suppressed
pve1 kernel: vmbr0: received packet on bond1 with own address as source address
...
Aug 19 15:35:45 pve1 kernel: vmbr0: received packet on bond1 with own address as source address
-- Reboot --
===================================================================
Aug 19 15:34:49 pve2 kernel: vmbr0: received packet on bond1 with own address as source address
...
pve2 corosync[1934]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
pve2 corosync[1934]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
pve2 corosync[1934]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
pve2 kernel: net_ratelimit: 93 callbacks suppressed
pve2 kernel: vmbr0: received packet on bond1 with own address as source address
...
pve2 corosync[1934]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
pve2 corosync[1934]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Aug 19 15:35:23 pve2 corosync[1934]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
-- Reboot --
===================================================================
Aug 19 15:34:50 pve3 kernel: vmbr0: received packet on bond1 with own address as source address
...
pve3 corosync[2115]: [TOTEM ] A processor failed, forming new configuration.
pve3 kernel: net_ratelimit: 26 callbacks suppressed
pve3 kernel: vmbr0: received packet on bond1 with own address as source address
...
pve3 kernel: net_ratelimit: 92 callbacks suppressed
pve3 kernel: vmbr0: received packet on bond1 with own address as source address
...
pve3 kernel: net_ratelimit: 93 callbacks suppressed
pve3 kernel: vmbr0: received packet on bond1 with own address as source address
pve3 corosync[2115]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
pve3 kernel: net_ratelimit: 94 callbacks suppressed
pve3 kernel: vmbr0: received packet on bond1 with own address as source address
...
pve3 corosync[2115]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
pve3 corosync[2115]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Aug 19 15:35:24 pve3 corosync[2115]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
-- Reboot --

Search

Search

PVE 4 / Alle Cluster Nodes starten unerwartet neu

gugpat

New Member

dcsapak

Proxmox Staff Member

gugpat

New Member

dcsapak

Proxmox Staff Member

gugpat

New Member

dcsapak

Proxmox Staff Member

gugpat

New Member

dcsapak

Proxmox Staff Member

gugpat

New Member