[SOLVED] PMX Cluster im WebUI nicht OK, funktioniert aber normal

Der Corosync startet, aber das pmxcfs möchte nicht starten.
Hier ein Auszug aus dem Journal dazu, vielleicht ist da was interessantes drin:
Mar 12 16:04:30 vm-1 corosync[2678781]: notice [MAIN ] Node was shut down by a signal
Mar 12 16:04:30 vm-1 corosync[2678781]: [MAIN ] Node was shut down by a signal
Mar 12 16:04:30 vm-1 systemd[1]: Stopping Corosync Cluster Engine...
Mar 12 16:04:30 vm-1 corosync[2678781]: notice [SERV ] Unloading all Corosync service engines.
Mar 12 16:04:30 vm-1 corosync[2678781]: info [QB ] withdrawing server sockets
Mar 12 16:04:30 vm-1 corosync[2678781]: notice [SERV ] Service engine unloaded: corosync vote quorum service v1.0
Mar 12 16:04:30 vm-1 corosync[2678781]: [SERV ] Unloading all Corosync service engines.
Mar 12 16:04:30 vm-1 corosync[2678781]: info [QB ] withdrawing server sockets
Mar 12 16:04:30 vm-1 corosync[2678781]: notice [SERV ] Service engine unloaded: corosync configuration map access
Mar 12 16:04:30 vm-1 corosync[2678781]: info [QB ] withdrawing server sockets
Mar 12 16:04:30 vm-1 corosync[2678781]: notice [SERV ] Service engine unloaded: corosync configuration service
Mar 12 16:04:30 vm-1 corosync[2678781]: [QB ] withdrawing server sockets
Mar 12 16:04:30 vm-1 corosync[2678781]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0
Mar 12 16:04:30 vm-1 corosync[2678781]: [QB ] withdrawing server sockets
Mar 12 16:04:30 vm-1 corosync[2678781]: [SERV ] Service engine unloaded: corosync configuration map access
Mar 12 16:04:30 vm-1 corosync[2678781]: [QB ] withdrawing server sockets
Mar 12 16:04:30 vm-1 corosync[2678781]: [SERV ] Service engine unloaded: corosync configuration service
Mar 12 16:04:30 vm-1 corosync[2678781]: info [QB ] withdrawing server sockets
Mar 12 16:04:30 vm-1 corosync[2678781]: notice [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 12 16:04:30 vm-1 corosync[2678781]: [QB ] withdrawing server sockets
Mar 12 16:04:30 vm-1 corosync[2678781]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 12 16:04:30 vm-1 corosync[2678781]: info [QB ] withdrawing server sockets
Mar 12 16:04:30 vm-1 corosync[2678781]: notice [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 12 16:04:30 vm-1 corosync[2678781]: [QB ] withdrawing server sockets
Mar 12 16:04:30 vm-1 corosync[2678781]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 12 16:04:31 vm-1 corosync[2678781]: notice [SERV ] Service engine unloaded: corosync profile loading service
Mar 12 16:04:31 vm-1 corosync[2678781]: [SERV ] Service engine unloaded: corosync profile loading service
Mar 12 16:04:31 vm-1 corosync[2678781]: notice [SERV ] Service engine unloaded: corosync resource monitoring service
Mar 12 16:04:31 vm-1 corosync[2678781]: [SERV ] Service engine unloaded: corosync resource monitoring service
Mar 12 16:04:31 vm-1 corosync[2678781]: notice [SERV ] Service engine unloaded: corosync watchdog service
Mar 12 16:04:31 vm-1 corosync[2678781]: [SERV ] Service engine unloaded: corosync watchdog service
Mar 12 16:04:31 vm-1 corosync[2678781]: notice [MAIN ] Corosync Cluster Engine exiting normally
Mar 12 16:04:31 vm-1 corosync[2678781]: [MAIN ] Corosync Cluster Engine exiting normally
Mar 12 16:04:31 vm-1 systemd[1]: Stopped Corosync Cluster Engine.
Mar 12 16:04:31 vm-1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Mar 12 16:04:31 vm-1 pmxcfs[2702775]: fuse: failed to access mountpoint /etc/pve: Transport endpoint is not connected
Mar 12 16:04:31 vm-1 pmxcfs[2702775]: [main] crit: fuse_mount error: Transport endpoint is not connected
Mar 12 16:04:31 vm-1 pmxcfs[2702775]: [main] crit: fuse_mount error: Transport endpoint is not connected
Mar 12 16:04:31 vm-1 pmxcfs[2702775]: [main] notice: exit proxmox configuration filesystem (-1)
Mar 12 16:04:31 vm-1 pmxcfs[2702775]: [main] notice: exit proxmox configuration filesystem (-1)
Mar 12 16:04:31 vm-1 systemd[1]: pve-cluster.service: Control process exited, code=exited status=255
Mar 12 16:04:31 vm-1 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Mar 12 16:04:31 vm-1 systemd[1]: pve-cluster.service: Unit entered failed state.
Mar 12 16:04:31 vm-1 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Mar 12 16:04:31 vm-1 systemd[1]: Starting Corosync Cluster Engine...
Mar 12 16:04:31 vm-1 corosync[2702786]: [MAIN ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
Mar 12 16:04:31 vm-1 corosync[2702786]: [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [MAIN ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
Mar 12 16:04:31 vm-1 corosync[2702786]: info [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [MAIN ] Please migrate config file to nodelist.
Mar 12 16:04:31 vm-1 corosync[2702786]: [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Mar 12 16:04:31 vm-1 corosync[2702786]: [MAIN ] Please migrate config file to nodelist.
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [TOTEM ] Initializing transport (UDP/IP Multicast).
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Mar 12 16:04:31 vm-1 corosync[2702786]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Mar 12 16:04:31 vm-1 corosync[2702786]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [TOTEM ] The network interface [192.168.16.1] is now up.
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [SERV ] Service engine loaded: corosync configuration map access [0]
Mar 12 16:04:31 vm-1 corosync[2702786]: [TOTEM ] The network interface [192.168.16.1] is now up.
Mar 12 16:04:31 vm-1 corosync[2702786]: info [QB ] server name: cmap
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [SERV ] Service engine loaded: corosync configuration service [1]
Mar 12 16:04:31 vm-1 corosync[2702786]: info [QB ] server name: cfg
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Mar 12 16:04:31 vm-1 corosync[2702786]: info [QB ] server name: cpg
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [SERV ] Service engine loaded: corosync profile loading service [4]
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [SERV ] Service engine loaded: corosync resource monitoring service [6]
Mar 12 16:04:31 vm-1 corosync[2702786]: [SERV ] Service engine loaded: corosync configuration map access [0]
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [WD ] Watchdog not enabled by configuration
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [WD ] resource load_15min missing a recovery key.
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [WD ] resource memory_used missing a recovery key.
Mar 12 16:04:31 vm-1 corosync[2702786]: info [WD ] no resources configured.
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [SERV ] Service engine loaded: corosync watchdog service [7]
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [QUORUM] Using quorum provider corosync_votequorum
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Mar 12 16:04:31 vm-1 corosync[2702786]: info [QB ] server name: votequorum
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Mar 12 16:04:31 vm-1 corosync[2702786]: info [QB ] server name: quorum
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [TOTEM ] A new membership (192.168.16.1:475140) was formed. Members joined: 1
Mar 12 16:04:31 vm-1 corosync[2702786]: [QB ] server name: cmap
Mar 12 16:04:31 vm-1 systemd[1]: Started Corosync Cluster Engine.
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [QUORUM] Members[1]: 1
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [MAIN ] Completed service synchronization, ready to provide service.
Mar 12 16:04:31 vm-1 corosync[2702786]: [SERV ] Service engine loaded: corosync configuration service [1]
Mar 12 16:04:31 vm-1 corosync[2702786]: [QB ] server name: cfg
Mar 12 16:04:31 vm-1 corosync[2702786]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Mar 12 16:04:31 vm-1 corosync[2702786]: [QB ] server name: cpg
Mar 12 16:04:31 vm-1 corosync[2702786]: [SERV ] Service engine loaded: corosync profile loading service [4]
Mar 12 16:04:31 vm-1 corosync[2702786]: [SERV ] Service engine loaded: corosync resource monitoring service [6]
Mar 12 16:04:31 vm-1 corosync[2702786]: [WD ] Watchdog not enabled by configuration
Mar 12 16:04:31 vm-1 corosync[2702786]: [WD ] resource load_15min missing a recovery key.
Mar 12 16:04:31 vm-1 corosync[2702786]: [WD ] resource memory_used missing a recovery key.
Mar 12 16:04:31 vm-1 corosync[2702786]: [WD ] no resources configured.
Mar 12 16:04:31 vm-1 corosync[2702786]: [SERV ] Service engine loaded: corosync watchdog service [7]
Mar 12 16:04:31 vm-1 corosync[2702786]: [QUORUM] Using quorum provider corosync_votequorum
Mar 12 16:04:31 vm-1 corosync[2702786]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Mar 12 16:04:31 vm-1 corosync[2702786]: [QB ] server name: votequorum
Mar 12 16:04:31 vm-1 corosync[2702786]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Mar 12 16:04:31 vm-1 corosync[2702786]: [QB ] server name: quorum
Mar 12 16:04:31 vm-1 corosync[2702786]: [TOTEM ] A new membership (192.168.16.1:475140) was formed. Members joined: 1
Mar 12 16:04:31 vm-1 corosync[2702786]: [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: [QUORUM] Members[1]: 1
Mar 12 16:04:31 vm-1 corosync[2702786]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [TOTEM ] A new membership (192.168.16.1:475144) was formed. Members joined: 2 3 4 5 6 8
Mar 12 16:04:31 vm-1 corosync[2702786]: [TOTEM ] A new membership (192.168.16.1:475144) was formed. Members joined: 2 3 4 5 6 8
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: warning [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: [CPG ] downlist left_list: 0 received
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [QUORUM] This node is within the primary component and will provide service.
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [QUORUM] Members[7]: 1 2 3 4 5 6 8
Mar 12 16:04:31 vm-1 corosync[2702786]: notice [MAIN ] Completed service synchronization, ready to provide service.
Mar 12 16:04:31 vm-1 corosync[2702786]: [QUORUM] This node is within the primary component and will provide service.
Mar 12 16:04:31 vm-1 corosync[2702786]: [QUORUM] Members[7]: 1 2 3 4 5 6 8
Mar 12 16:04:31 vm-1 corosync[2702786]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 12 16:04:34 vm-1 pvestatd[2736]: ipcc_send_rec[1] failed: Connection refused
Mar 12 16:04:34 vm-1 pvestatd[2736]: ipcc_send_rec[2] failed: Connection refused
Mar 12 16:04:34 vm-1 pvestatd[2736]: ipcc_send_rec[3] failed: Connection refused
Mar 12 16:04:34 vm-1 pvestatd[2736]: ipcc_send_rec[4] failed: Connection refused
Mar 12 16:04:34 vm-1 pvestatd[2736]: status update error: Connection refused
 
Mar 12 16:04:31 vm-1 pmxcfs[2702775]: fuse: failed to access mountpoint /etc/pve: Transport endpoint is not connected

da duerfte der pmxcfs prozess zuvor mal gekillt worden sein:
* mit `mount` ueberpruefen, ob '/etc/pve' noch gemounted ist
* falls ja -> `fusermount -u /etc/pve`
* sollte das nicht gehen `lsof -n|grep '/etc/pve'` - schauen welche prozesse noch etwas darauf offen haben
* `systemctl pve-cluster restart`
* sollte das nicht gehen - im journal nachschauen, ob vl. noch ein lockfile/pidfile rumliegt - dieses entfernen - nochmals versuche
* falls auch das nicht geht - `systemctl pve-cluster stop`, `pkill pmxcfs`, `pmxcfs -l` (pmxcfs im local mode starten - hier beachten, dass saemtliche Aenderungen verloren gehen sollte es wieder mal im cluster gestartet werden)

hoffe das hilft mal!
 
Wenn es hilft, kann ich gerne auch einen von euch auf die Nodes drauf lassen. Ich würde euch dazu einen SSL-VPN Zugang schicken. Das könnten wir ja dann per PN klären.

Wenn ihre Cluster über die entsprechende subscription verfügen dann machen wir das natürlich gerne, dafür melden sie sich aber bitte im Enterprise Support Portal https://my.proxmox.com/
Evtl auf den Thread linken.

In einem der Gebäude habe ich genau zu dem fraglichen Zeitpunkt einen Switch getauscht

Hmm klingt verdächtig, wenn der Cluster bis dahin immer funktioniert hatte wird dass das Problem direkt oder indirekt verursacht haben.

Am Cluster tat sich da aber nichts. In dem Moment wo ich den Uplink wieder aktiviert hatte, hatte sich aber dann der Cluster für wenige Sekunden gefangen und ein paar Nodes wurden als OK angezeigt. Dieses Verhalten lies sich aber später nicht noch einmal reproduzieren. Deshalb würde ich das auch als "Zufall" einstufen.
Die Switches, an denen der Cluster angeschlossen ist, wurden zu dem Zeitpunkt weder konfiguriert noch irgendwie upgedated.

Sie schreiben zwar das problemlos 10.000 Pakete mit omping test geschiuckt werden konnten, aber konnte auch ein längerer Test durchgeführt werden, z.B. der letzte omping von: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements
Code:
omping -c 600 -i 1 -q NODE1-IP NODE2-IP ...
(dauert ca 10 Minuten)

Weil dann würde es bedeuten dass IGMP snooping am neuen Switch an ist, aber kein IGMP multicast querier aktiv ist, und so der switch Multicast pakete von neuen Multicastgruppen nach 5 Minuten droppt (da sich kein querier mit der Information über die multicast Gruppenmitglieder meldet). Hier entweder snooping ausmachen oder einen querier an machen, kann auch ein PVE host sein.

Dazu müsste ich das pmxfs nur so "manipulieren", dass ich wieder Schreibzugriff bekomme um ein Backup machen zu können. Oder ich müsste irgendwie die vHDDs aus dem Ceph raus bekommen um die Maschinen manuell auf den Notfall Node umzuziehen.
Das ist machbar, falls es wirklich soweit kommt können wir auch assistieren, aber ich würde zuerst noch obiges versuchen, und oder remote draufschauen.
 
da duerfte der pmxcfs prozess zuvor mal gekillt worden sein:

das war eher nur temporär, denn dann würde ein touch/ls drauf nicht hängen sondern ein "... cannot access '/etc/pve': Transport endpoint is not connected" ausgeben...
Auch das kurze "sich wieder fangen" des Clusters wo die Switch Einstellungen geändert wurden passen mit dem nicht wirklich zusammen.
 
da duerfte der pmxcfs prozess zuvor mal gekillt worden sein:
* mit `mount` ueberpruefen, ob '/etc/pve' noch gemounted ist
* falls ja -> `fusermount -u /etc/pve`
* sollte das nicht gehen `lsof -n|grep '/etc/pve'` - schauen welche prozesse noch etwas darauf offen haben
Da war noch eine bash offen, danach konnte ich unmounten.

* `systemctl pve-cluster restart`
* sollte das nicht gehen - im journal nachschauen, ob vl. noch ein lockfile/pidfile rumliegt - dieses entfernen - nochmals versuche
* falls auch das nicht geht - `systemctl pve-cluster stop`, `pkill pmxcfs`, `pmxcfs -l` (pmxcfs im local mode starten - hier beachten, dass saemtliche Aenderungen verloren gehen sollte es wieder mal im cluster gestartet werden)
Ich hab noch ein lockfile gefunden unter /var/lib/pve-cluster/
Das habe ich dann auch gelöscht und versucht den pmxcfs zu starten, leider ohne Erfolg:
Code:
Mar 13 08:20:50 vm-1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Mar 13 08:20:50 vm-1 pmxcfs[2746755]: [status] notice: update cluster info (cluster name  Langeoog, version = 11)
Mar 13 08:20:50 vm-1 pmxcfs[2746755]: [status] notice: node has quorum
Mar 13 08:20:50 vm-1 pmxcfs[2746755]: [dcdb] notice: members: 1/2746755, 2/2545, 3/270012, 4/258432, 5/260810, 6/2033, 8/249898
Mar 13 08:20:50 vm-1 pmxcfs[2746755]: [dcdb] notice: starting data syncronisation
Mar 13 08:20:50 vm-1 pmxcfs[2746755]: [status] notice: members: 1/2746755, 2/2545, 3/270012, 4/258432, 5/260810, 6/2033, 8/249898
Mar 13 08:20:50 vm-1 pmxcfs[2746755]: [status] notice: starting data syncronisation
Mar 13 08:20:50 vm-1 pmxcfs[2746755]: [dcdb] notice: received sync request (epoch 1/2746755/00000001)
Mar 13 08:20:50 vm-1 pmxcfs[2746755]: [status] notice: received sync request (epoch 1/2746755/00000001)
Mar 13 08:21:01 vm-1 cron[2448]: (*system*vzdump) RELOAD (/etc/cron.d/vzdump)
Mar 13 08:21:21 vm-1 pvecm[2746764]: got timeout
Mar 13 08:22:21 vm-1 systemd[1]: pve-cluster.service: Start-post operation timed out. Stopping.
Mar 13 08:22:31 vm-1 systemd[1]: pve-cluster.service: State 'stop-sigterm' timed out. Killing.
Mar 13 08:22:31 vm-1 systemd[1]: pve-cluster.service: Killing process 2746755 (pmxcfs) with signal SIGKILL.
Mar 13 08:22:31 vm-1 systemd[1]: pve-cluster.service: Killing process 2746764 (pvecm) with signal SIGKILL.
Mar 13 08:22:31 vm-1 systemd[1]: pve-cluster.service: Killing process 2746767 (pvecm) with signal SIGKILL.
Mar 13 08:22:31 vm-1 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=9/KILL
Mar 13 08:22:31 vm-1 pve-ha-lrm[5875]: unable to write lrm status file - unable to delete old temp file: Software caused connection abort
Mar 13 08:22:31 vm-1 systemd[1]: pve-cluster.service: Killing process 2746767 (pvecm) with signal SIGKILL.
Mar 13 08:22:31 vm-1 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Mar 13 08:22:31 vm-1 systemd[1]: pve-cluster.service: Unit entered failed state.
Mar 13 08:22:31 vm-1 systemd[1]: pve-cluster.service: Failed with result 'timeout'.
Mar 13 08:22:36 vm-1 pve-ha-lrm[5875]: loop take too long (103 seconds)

Ich will nicht ausschließen, dass es was mit Multicast zu tun hat. An den Multicast Einstellungen auf den Switches an denen der Cluster hängt hat allerdings niemand rumgebastelt. Es könnte aber vielleicht sein, dass der neue Switch irgendwas mit dem Querier gemacht hat, weil das VLAN116 ja über den Trunk auch zu dem neuen Switch gelangt ist. Allerdings müsste sich doch dann das Problem lösen, wenn ich den Uplink zu dem neuen Switch abschalte.
Ich werde jetzt nochmal Multicast testen und dann berichten.
 
Der Multicast Test ist durch und verlief erfolgreich, auch über die 10min weg. Leider kann ich bei cssh keinen Text kopieren, deshalb als Screenshot beispielhaft von VM-2:

upload_2019-3-13_8-57-51.png

Als weitere Vorsichtsmaßnahme werde ich das Corosync VLAN auf den PMX Cluster Switches isolieren.
Das pmxcfs im local mode starten habe ich jetzt noch nicht gemacht, wäre aber evtl einen Test wert um zu sehen ob es überhaupt irgendwie gestartet werden kann.
 
Der Multicast Test ist durch und verlief erfolgreich, auch über die 10min weg. Leider kann ich bei cssh keinen Text kopieren, deshalb als Screenshot beispielhaft von VM-2:

OK, dann passt dass mit IGMP doch.

Mar 13 08:22:21 vm-1 systemd[1]: pve-cluster.service: Start-post operation timed out. Stopping.

Hmm, d.h., das dass
Code:
ExecStartPost=-/usr/bin/pvecm updatecerts --silent
fehlschlägt, und das höchstwahrscheinlich weil es auf /etc/pve zugreifen will und dann hängt.

Komische ist dass pmxcfs sonst aber eigentlich sauber hochkommt... Auf allen nodes diesen Doppel neustart der corosync und cluster services durchzuführen ist nicht möglich?
Ich denke das einige, oder alle pmxcfs in einen state hängen und somit auch die restlichen blockieren - Filesystem updates müssen ja an alle ausgetauscht werden, wenn da ein Großteil nicht mitspielt kann das schon zu solchen Effekten führen. Es würde aber leider immer noch nicht die genaue Ursache für das Entstehen des Problems erklären. Auch wenn der Switch Tausch da recht sicher mitgespielt haben dürfte, sollte sich das schon wieder fangen sobald das multicast wieder funktioniert, was es jetzt ja tut.

Wurden node interfaces mal mit ip or ifdown down genommen?
 
Sie können auch den "pve-cluster" service nur stoppen und dann "pmxcfs" ausführen, somit wird das "ExecStartPost" nicht ausgeführt und systemd killt pmxcfs nicht.
"pmxcfs" forked sich selber zum Daemon in den Hintergrund sofern nicht "-f" mitgegeben wird.
 
Auf allen nodes diesen Doppel neustart der corosync und cluster services durchzuführen ist nicht möglich?
Möglich ist das schon. Auf den Nodes außer VM-1 laufen halt die ganzen wichtigen Maschinen, die spätestens morgen früh verfügbar sein müssen, wenn hier wieder gearbeitet wird. Ich bekomme ja die Maschinen allerdings zumindest mit einem lokal laufenden pmxcfs wieder hochgefahren. Deshalb würde ich das jetzt einfach mal probieren.

Wurden node interfaces mal mit ip or ifdown down genommen?
Also die Switches für die Nodes hatte zu dem Zeitpunk keiner angefasst. Ich hab in der Zwischenzeit auch nicht versucht mal die Ports ab und wieder an zu schalten.


Sie können auch den "pve-cluster" service nur stoppen und dann "pmxcfs" ausführen, somit wird das "ExecStartPost" nicht ausgeführt und systemd killt pmxcfs nicht.
"pmxcfs" forked sich selber zum Daemon in den Hintergrund sofern nicht "-f" mitgegeben wird.
Das würde ich jetzt zuerst mal auf VM-1 testen. Wenn dann das pmxcfs dort erstmal läuft könnte ich das nach und nach auf allen Nodes machen und sehen ob das hilft. Wenn nicht, würde ich den simultanen "Doppel-Neustart" von Corosync und Cluster Services auf allen Nodes versuchen.
Ich berichte dann gleich...
 
So wie es aussieht, hat der pmxcfs auf VM-6 das gesamte FS blockiert. Ich hatte nach und nach auf allen Nodes pve-cluster.service gestoppt und darauf geachtet das die pmxcfs prozesse beendet sind. Dann einfach mit pmxcfs gestartet. auf allen Nodes lief es dann, aber ohne das der Cluster sich rührte.
In dem Moment, wo ich jedoch den pmxcfs Prozess auf VM-6 beendet hatte, kamen VM-1 bis VM-5 zusammen in der Web UI als OK hoch. Dann habe ich pmxcfs auf VM-6 und VM-8 ebenso gestartet und seitdem ist der Cluster wieder OK.

Da ich mir keinen Mechanismus vorstellen kann über den ein Switch, auf dem kein Port irgendwas mit dem VLAN 116 zu tun hat ,das Corosync LAN so aus dem Tritt bringen kann, wage ich anzunehmen, das es Zufall war das der pmxcfs Prozess auf VM-6 in einen undefinierten Zustand geraten ist. Dieser hat das FS dann für den Schreibzugriff blockiert und den ganzen Cluster in diesen Zustand gebracht. Zufälligerweise gerade zu dem Zeitpunkt wo ich den Switch da montiert habe.

Falls ihr da irgendwie auf Bug Suche gehen wollt, biete ich gerne meine Unterstützung an. Die Situation war ziemlich Scary und ich würde gern dabei helfen aufzuklären, was dahinter gesteckt hat.

Ich bin aber sehr froh das es nun wieder OK ist. Jetzt kann ich mich um die ganzen Aufgaben kümmern, die seitdem zwangsweise liegen blieben mussten.
Vielen Dank für die Unterstützung!
 
Gut das sie ohne reboot und VM downtime aus der unguten Situation gekommen sind!

In dem Moment, wo ich jedoch den pmxcfs Prozess auf VM-6 beendet hatte, kamen VM-1 bis VM-5 zusammen in der Web UI als OK hoch. Dann habe ich pmxcfs auf VM-6 und VM-8 ebenso gestartet und seitdem ist der Cluster wieder OK.
Falls ihr da irgendwie auf Bug Suche gehen wollt, biete ich gerne meine Unterstützung an. Die Situation war ziemlich Scary und ich würde gern dabei helfen aufzuklären, was dahinter gesteckt hat.

Ja scary ist das in der tat... Wie schaut das Journal von VM-6 aus, etwas auffälliges, besonders was was bei den anderen nodes nicht vorhanden ist?

Da ich mir keinen Mechanismus vorstellen kann über den ein Switch, auf dem kein Port irgendwas mit dem VLAN 116 zu tun hat ,das Corosync LAN so aus dem Tritt bringen kann, wage ich anzunehmen, das es Zufall war das der pmxcfs Prozess auf VM-6 in einen undefinierten Zustand geraten ist. Dieser hat das FS dann für den Schreibzugriff blockiert und den ganzen Cluster in diesen Zustand gebracht. Zufälligerweise gerade zu dem Zeitpunkt wo ich den Switch da montiert habe.

Ich verstehe sie und will da nicht sagen dass das überhaupt nicht sein könnte, aber es wäre schon ein großer Zufall. Aber auch wenn das der direkte oder indirekte Trigger gewesen wäre, ist der Grund wieso er das war unbekannt aber eben relevanter... Im gleichen Netz gibt es nicht zufällig einen zweiten Proxmox VE Cluster?
 
Mir ist jetzt nichts besonderes im Log von VM-6 aufgefallen. Das sieht eigentlich so aus wie auf den anderen Hosts auch.
-- Logs begin at Sun 2019-03-03 13:13:00 CET, end at Thu 2019-03-14 07:54:01 CET. --
Mar 07 15:04:00 vm-6 pmxcfs[2033]: [status] notice: received log
Mar 07 15:05:20 vm-6 pmxcfs[2033]: [status] notice: received log
Mar 07 15:09:19 vm-6 corosync[2304]: notice [TOTEM ] A processor failed, forming new configuration.
Mar 07 15:09:19 vm-6 corosync[2304]: [TOTEM ] A processor failed, forming new configuration.
Mar 07 15:09:25 vm-6 corosync[2304]: notice [TOTEM ] A new membership (192.168.16.3:475060) was formed. Members left: 1 2 8
Mar 07 15:09:25 vm-6 corosync[2304]: notice [TOTEM ] Failed to receive the leave message. failed: 1 2 8
Mar 07 15:09:25 vm-6 corosync[2304]: [TOTEM ] A new membership (192.168.16.3:475060) was formed. Members left: 1 2 8
Mar 07 15:09:25 vm-6 corosync[2304]: [TOTEM ] Failed to receive the leave message. failed: 1 2 8
Mar 07 15:09:25 vm-6 corosync[2304]: warning [CPG ] downlist left_list: 3 received
Mar 07 15:09:25 vm-6 corosync[2304]: [CPG ] downlist left_list: 3 received
Mar 07 15:09:25 vm-6 corosync[2304]: [CPG ] downlist left_list: 3 received
Mar 07 15:09:25 vm-6 corosync[2304]: warning [CPG ] downlist left_list: 3 received
Mar 07 15:09:25 vm-6 corosync[2304]: [CPG ] downlist left_list: 3 received
Mar 07 15:09:25 vm-6 corosync[2304]: warning [CPG ] downlist left_list: 3 received
Mar 07 15:09:25 vm-6 corosync[2304]: [CPG ] downlist left_list: 3 received
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 4/258432, 5/260810, 6/2033
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [status] notice: members: 3/270012, 4/258432, 5/260810, 6/2033
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [status] notice: starting data syncronisation
Mar 07 15:09:25 vm-6 corosync[2304]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 07 15:09:25 vm-6 corosync[2304]: notice [QUORUM] Members[4]: 3 4 5 6
Mar 07 15:09:25 vm-6 corosync[2304]: notice [MAIN ] Completed service synchronization, ready to provide service.
Mar 07 15:09:25 vm-6 corosync[2304]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 07 15:09:25 vm-6 corosync[2304]: [QUORUM] Members[4]: 3 4 5 6
Mar 07 15:09:25 vm-6 corosync[2304]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [status] notice: node lost quorum
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/0000002E)
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 3/270012/0000002E)
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 3/270012
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 3/270012, 4/258432, 5/260810, 6/2033
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [dcdb] notice: dfsm_deliver_queue: queue length 4
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [dcdb] crit: received write while not quorate - trigger resync
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [dcdb] crit: leaving CPG group
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [status] notice: received all states
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [status] notice: all data is up to date
Mar 07 15:09:25 vm-6 pmxcfs[2033]: [status] notice: dfsm_deliver_queue: queue length 35
Mar 07 15:09:26 vm-6 pmxcfs[2033]: [dcdb] notice: start cluster connection
Mar 07 15:09:26 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 6/2033
Mar 07 15:09:26 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:26 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/00000036)
Mar 07 15:09:26 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:26 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 3/270012
Mar 07 15:09:26 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 3/270012, 6/2033
Mar 07 15:09:26 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 5/260810, 6/2033
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/00000037)
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 6/2033
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/00000038)
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 3/270012
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 3/270012, 6/2033
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 4/258432, 6/2033
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/00000039)
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 6/2033
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/0000003A)
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 3/270012
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 3/270012, 6/2033
Mar 07 15:09:32 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 5/260810, 6/2033
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/0000003B)
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 6/2033
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/0000003C)
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 3/270012
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 3/270012, 6/2033
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 4/258432, 6/2033
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/0000003D)
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 6/2033
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/0000003E)
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 3/270012
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 3/270012, 6/2033
Mar 07 15:09:38 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 5/260810, 6/2033
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 6/2033
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/0000003F)
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/00000040)
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 3/270012
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 3/270012, 6/2033
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 4/258432, 6/2033
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/00000041)
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: members: 3/270012, 6/2033
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 3/270012/00000042)
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 3/270012
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 3/270012, 6/2033
Mar 07 15:09:44 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:45 vm-6 corosync[2304]: notice [TOTEM ] A new membership (192.168.16.1:475064) was formed. Members joined: 1 2 8
Mar 07 15:09:45 vm-6 corosync[2304]: [TOTEM ] A new membership (192.168.16.1:475064) was formed. Members joined: 1 2 8
Mar 07 15:09:45 vm-6 corosync[2304]: warning [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: warning [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: warning [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: warning [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: warning [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: warning [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 corosync[2304]: [CPG ] downlist left_list: 0 received
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: members: 2/2545, 3/270012, 6/2033
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 6/2033
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: members: 2/2545, 3/270012, 4/258432, 5/260810, 6/2033
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: starting data syncronisation
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: members: 1/2141, 2/2545, 3/270012, 4/258432, 5/260810, 6/2033
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: members: 1/2141, 2/2545, 3/270012, 4/258432, 5/260810, 6/2033, 8/249898
Mar 07 15:09:45 vm-6 corosync[2304]: notice [QUORUM] This node is within the primary component and will provide service.
Mar 07 15:09:45 vm-6 corosync[2304]: notice [QUORUM] Members[7]: 1 2 3 4 5 6 8
Mar 07 15:09:45 vm-6 corosync[2304]: notice [MAIN ] Completed service synchronization, ready to provide service.
Mar 07 15:09:45 vm-6 corosync[2304]: [QUORUM] This node is within the primary component and will provide service.
Mar 07 15:09:45 vm-6 corosync[2304]: [QUORUM] Members[7]: 1 2 3 4 5 6 8
Mar 07 15:09:45 vm-6 corosync[2304]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: node has quorum
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 2/2545/00000017)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 2/2545/00000018)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 2/2545/00000015)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/0000000B)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/0000000C)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 2/2545/00000016)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/0000000D)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/0000000E)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 2/2545/00000017)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 2/2545/00000018)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/00000009)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/0000000A)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/0000000B)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/0000000C)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/0000000D)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/0000000E)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 1/2141
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received all states
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: all data is up to date
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 5/260810, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/0000000F)
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/00000010)
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 1/2141
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: remove message from non-member 5/260810
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 4/258432, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/00000011)
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/00000012)
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 1/2141
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:56 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 5/260810, 6/2033, 8/249898
Mar 07 15:09:56 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:56 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/00000013)
Mar 07 15:09:56 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:56 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/00000014)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 2/2545/00000017)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 2/2545/00000018)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 2/2545/00000015)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/0000000B)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/0000000C)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 2/2545/00000016)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/0000000D)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/0000000E)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 2/2545/00000017)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] crit: ignore sync request from wrong member 2/2545
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 2/2545/00000018)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/00000009)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/0000000A)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/0000000B)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/0000000C)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/0000000D)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received sync request (epoch 1/2141/0000000E)
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 1/2141
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: received all states
Mar 07 15:09:45 vm-6 pmxcfs[2033]: [status] notice: all data is up to date
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 5/260810, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/0000000F)
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/00000010)
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 1/2141
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: remove message from non-member 5/260810
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 4/258432, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: starting data syncronisation
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/00000011)
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received sync request (epoch 1/2141/00000012)
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: received all states
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: leader is 1/2141
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: synced members: 1/2141, 2/2545, 3/270012, 6/2033, 8/249898
Mar 07 15:09:50 vm-6 pmxcfs[2033]: [dcdb] notice: all data is up to date
Im Kernel Log ist folgendes drin:
[Mon Feb 4 11:40:17 2019] systemd[1]: pve-daily-update.timer: Adding 3h 51min 19.313928s random time.
[Mon Feb 4 11:40:24 2019] FS-Cache: Loaded
[Mon Feb 4 11:40:24 2019] FS-Cache: Netfs 'nfs' registered for caching
[Mon Feb 4 20:56:17 2019] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
[Tue Feb 5 04:55:47 2019] perf: interrupt took too long (3127 > 3126), lowering kernel.perf_event_max_sample_rate to 63750
[Wed Feb 6 11:22:11 2019] perf: interrupt took too long (3934 > 3908), lowering kernel.perf_event_max_sample_rate to 50750
[Thu Mar 7 15:15:46 2019] INFO: task pvesr:2590902 blocked for more than 120 seconds.
[Thu Mar 7 15:15:46 2019] Tainted: P O 4.15.18-9-pve #1
[Thu Mar 7 15:15:46 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Mar 7 15:15:46 2019] pvesr D 0 2590902 1 0x00000000
[Thu Mar 7 15:15:46 2019] Call Trace:
[Thu Mar 7 15:15:46 2019] __schedule+0x3e0/0x870
[Thu Mar 7 15:15:46 2019] ? path_parentat+0x3e/0x80
[Thu Mar 7 15:15:46 2019] schedule+0x36/0x80
[Thu Mar 7 15:15:46 2019] rwsem_down_write_failed+0x208/0x390
[Thu Mar 7 15:15:46 2019] call_rwsem_down_write_failed+0x17/0x30
[Thu Mar 7 15:15:46 2019] ? call_rwsem_down_write_failed+0x17/0x30
[Thu Mar 7 15:15:46 2019] down_write+0x2d/0x40
[Thu Mar 7 15:15:46 2019] filename_create+0x7e/0x160
[Thu Mar 7 15:15:46 2019] SyS_mkdir+0x51/0x100
[Thu Mar 7 15:15:46 2019] do_syscall_64+0x73/0x130
[Thu Mar 7 15:15:46 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Thu Mar 7 15:15:46 2019] RIP: 0033:0x7f17ff6d9447
[Thu Mar 7 15:15:46 2019] RSP: 002b:00007ffdaca1fdc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[Thu Mar 7 15:15:46 2019] RAX: ffffffffffffffda RBX: 00005569f2289010 RCX: 00007f17ff6d9447
[Thu Mar 7 15:15:46 2019] RDX: 00005569f12c5e84 RSI: 00000000000001ff RDI: 00005569f5c71c60
[Thu Mar 7 15:15:46 2019] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000030
[Thu Mar 7 15:15:46 2019] R10: 0000000000000000 R11: 0000000000000246 R12: 00005569f389fe48
[Thu Mar 7 15:15:46 2019] R13: 00005569f5ac16b0 R14: 00005569f5c71c60 R15: 00000000000001ff
[Thu Mar 7 15:17:46 2019] INFO: task pvesr:2590902 blocked for more than 120 seconds.
[Thu Mar 7 15:17:46 2019] Tainted: P O 4.15.18-9-pve #1
[Thu Mar 7 15:17:46 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Mar 7 15:17:46 2019] pvesr D 0 2590902 1 0x00000000
[Thu Mar 7 15:17:46 2019] Call Trace:
[Thu Mar 7 15:17:46 2019] __schedule+0x3e0/0x870
[Thu Mar 7 15:17:46 2019] ? path_parentat+0x3e/0x80
[Thu Mar 7 15:17:46 2019] schedule+0x36/0x80
[Thu Mar 7 15:17:46 2019] rwsem_down_write_failed+0x208/0x390
[Thu Mar 7 15:17:46 2019] call_rwsem_down_write_failed+0x17/0x30
[Thu Mar 7 15:17:46 2019] ? call_rwsem_down_write_failed+0x17/0x30
[Thu Mar 7 15:17:46 2019] down_write+0x2d/0x40
[Thu Mar 7 15:17:46 2019] filename_create+0x7e/0x160
[Thu Mar 7 15:17:46 2019] SyS_mkdir+0x51/0x100
[Thu Mar 7 15:17:46 2019] do_syscall_64+0x73/0x130
[Thu Mar 7 15:17:46 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Thu Mar 7 15:17:46 2019] RIP: 0033:0x7f17ff6d9447
[Thu Mar 7 15:17:46 2019] RSP: 002b:00007ffdaca1fdc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[Thu Mar 7 15:17:46 2019] RAX: ffffffffffffffda RBX: 00005569f2289010 RCX: 00007f17ff6d9447
[Thu Mar 7 15:17:46 2019] RDX: 00005569f12c5e84 RSI: 00000000000001ff RDI: 00005569f5c71c60
[Thu Mar 7 15:17:46 2019] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000030
[Thu Mar 7 15:17:46 2019] R10: 0000000000000000 R11: 0000000000000246 R12: 00005569f389fe48
[Thu Mar 7 15:17:46 2019] R13: 00005569f5ac16b0 R14: 00005569f5c71c60 R15: 00000000000001ff

Edit:
Es gibt keinen weiteren physikalischen PVE Cluster. Ich habe nur mal einen virtuellen Test-Cluster gebaut. Dessen Netzwerk lebt aber ausschließlich auf einer Bridge vmbr1, die keine physikalische Netzwerk-Hardware nutzt.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!