Whole cluster reboot when a node comes back online.

fbrachere · Jun 25, 2018

Hi,
I've a cluster of 9 nodes in 3 datacenters that looks like this:

Code:

   DC 3                              DC 1
+--------+      LINK 1->3     +---------------+           
| P-ARB  |--------------------|  P1 P2 P3 P10 |
+--------+                    +--------+------+
         |                             |
         |                             |
         | LINK 2->3                   | LINK 1->2
         |                             |
         |                             |
 +-------+--------+                    |
 |  P7 P8 P9 P11  +--------------------+
 +----------------+
       DC 2

My network config for each node (but P-ARB) is the following:
- 2X 10Gbps for storage (external ceph cluster). LACP bond
- 2X 1Gbps for corosync ring0 (and only for corosync, nothing else on those NICs). LACP bond
- 2X 10Gbps for VMs network, admin network and corosync ring1. Openvswitch controlled.

Network config for P-ARB node:
- 2X 1Gbps for corosync ring0. LACP bond
- 2X 1Gbps for admin network, corosync ring1, and 1 ceph monitor of the external cluster (no VM on this node). LACP bond.

The latency for the links are the following:
LINK1->3: 3.76 ms
LINK1->2: 1.57 ms
LINK2->3: 1.56 ms

For a maintenance operation I stopped 2 nodes (P8 and P9).
When the P8 node came back online, all the other nodes self-fenced (except P-ARB).
The logs didn't show any quorum failure on fenced nodes, only a timeout for the watchdog.

P1:

Code:

Jun 21 11:14:25 proxmox1 pmxcfs[3045]: [status] notice: received log
Jun 21 11:14:25 proxmox1 pmxcfs[3045]: [status] notice: received log
Jun 21 11:14:25 proxmox1 pmxcfs[3045]: [status] notice: received log
Jun 21 11:14:25 proxmox1 pmxcfs[3045]: [status] notice: received log
Jun 21 11:14:26 proxmox1 pmxcfs[3045]: [status] notice: received log
Jun 21 11:14:55 proxmox1 pveproxy[18977]: worker exit
Jun 21 11:14:55 proxmox1 pveproxy[3195]: worker 18977 finished
Jun 21 11:14:55 proxmox1 pveproxy[3195]: starting 1 worker(s)
Jun 21 11:14:55 proxmox1 pveproxy[3195]: worker 2279 started
Jun 21 11:15:00 proxmox1 systemd[1]: Starting Proxmox VE replication runner...
Jun 21 11:15:01 proxmox1 CRON[2395]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 21 11:15:18 proxmox1 pvedaemon[35087]: <root@pam> successful auth for user 'brachere@e-tera.com'
Jun 21 11:15:24 proxmox1 watchdog-mux[1548]: client watchdog expired - disable watchdog updates
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jun 21 11:18:20 proxmox1 systemd-modules-load[659]: Inserted module 'iscsi_tcp'
Jun 21 11:18:20 proxmox1 systemd-modules-load[659]: Inserted module 'ib_iser'

P2:

Code:

Jun 21 11:14:24 proxmox2 pmxcfs[3047]: [status] notice: received log
Jun 21 11:14:24 proxmox2 pmxcfs[3047]: [status] notice: received log
Jun 21 11:14:24 proxmox2 pmxcfs[3047]: [status] notice: received log
Jun 21 11:14:24 proxmox2 pmxcfs[3047]: [status] notice: received log
Jun 21 11:14:24 proxmox2 pmxcfs[3047]: [status] notice: received log
Jun 21 11:14:24 proxmox2 pmxcfs[3047]: [status] notice: received log
Jun 21 11:14:25 proxmox2 pmxcfs[3047]: [status] notice: received log
Jun 21 11:14:25 proxmox2 pmxcfs[3047]: [status] notice: received log
Jun 21 11:14:25 proxmox2 pmxcfs[3047]: [status] notice: received log
Jun 21 11:14:25 proxmox2 pmxcfs[3047]: [status] notice: received log
Jun 21 11:14:26 proxmox2 pmxcfs[3047]: [status] notice: received log
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jun 21 11:17:33 proxmox2 systemd-modules-load[670]: Inserted module 'iscsi_tcp'
Jun 21 11:17:33 proxmox2 systemd-modules-load[670]: Inserted module 'ib_iser'
Jun 21 11:17:33 proxmox2 systemd-modules-load[670]: Inserted module 'vhost_net'
Jun 21 11:17:33 proxmox2 keyboard-setup.sh[665]: cannot open file /tmp/tmpkbd.0yra79

P3:

Code:

Jun 21 11:14:24 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:14:24 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:14:24 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:14:24 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:14:24 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:14:24 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:14:25 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:14:25 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:14:25 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:14:25 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:14:26 proxmox3 pmxcfs[3011]: [status] notice: received log
Jun 21 11:15:00 proxmox3 systemd[1]: Starting Proxmox VE replication runner...
Jun 21 11:15:01 proxmox3 CRON[100686]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 21 11:15:23 proxmox3 watchdog-mux[1561]: client watchdog expired - disable watchdog updates
Jun 21 11:18:18 proxmox3 systemd-modules-load[679]: Inserted module 'iscsi_tcp'
Jun 21 11:18:18 proxmox3 kernel: [    0.000000] Linux version 4.15.17-2-pve (tlamprecht@evita) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.17-10 (Tue, 22 May 2018 11:15:44 +0200) ()
Jun 21 11:18:18 proxmox3 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.17-2-pve root=/dev/mapper/pve-root ro quiet

Same type of logs for P7 and P10.

P8:

Code:

Jun 21 11:07:53 proxmox8 pmxcfs[2673]: [status] notice: received log
Jun 21 11:07:53 proxmox8 pmxcfs[2673]: [status] notice: received log
Jun 21 11:07:54 proxmox8 pmxcfs[2673]: [status] notice: received log
Jun 21 11:07:54 proxmox8 pmxcfs[2673]: [status] notice: received log
Jun 21 11:07:54 proxmox8 systemd[1]: Stopped PVE Cluster Ressource Manager Daemon.
Jun 21 11:14:18 proxmox8 systemd-modules-load[641]: Inserted module 'iscsi_tcp'
Jun 21 11:14:18 proxmox8 kernel: [    0.000000] Linux version 4.15.17-2-pve (tlamprecht@evita) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.17-10 (Tue, 22 May 2018 11:15:44 +0200) ()
Jun 21 11:14:18 proxmox8 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.17-2-pve root=/dev/mapper/pve-root ro quiet
Jun 21 11:14:18 proxmox8 kernel: [    0.000000] KERNEL supported cpus:
Jun 21 11:14:18 proxmox8 kernel: [    0.000000]   Intel GenuineIntel
.
.
.
Jun 21 11:14:26 proxmox8 systemd[1]: Started The Proxmox VE cluster filesystem.
Jun 21 11:14:26 proxmox8 systemd[1]: Starting PVE Status Daemon...
Jun 21 11:14:26 proxmox8 systemd[1]: Starting Corosync Cluster Engine...
Jun 21 11:14:26 proxmox8 systemd[1]: Started Regular background program processing daemon.
Jun 21 11:14:26 proxmox8 systemd[1]: Starting Proxmox VE firewall...
Jun 21 11:14:26 proxmox8 cron[2707]: (CRON) INFO (pidfile fd = 3)
Jun 21 11:14:26 proxmox8 cron[2707]: (CRON) INFO (Running @reboot jobs)
Jun 21 11:14:26 proxmox8 corosync[2703]:  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Jun 21 11:14:26 proxmox8 corosync[2703]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Jun 21 11:14:26 proxmox8 corosync[2703]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun 21 11:14:26 proxmox8 corosync[2703]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun 21 11:14:26 proxmox8 corosync[2703]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun 21 11:14:26 proxmox8 corosync[2703]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun 21 11:14:26 proxmox8 corosync[2703]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [TOTEM ] The network interface [10.0.136.52] is now up.
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun 21 11:14:26 proxmox8 corosync[2703]: info    [QB    ] server name: cmap
Jun 21 11:14:26 proxmox8 corosync[2703]:  [TOTEM ] The network interface [10.0.136.52] is now up.
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Jun 21 11:14:26 proxmox8 corosync[2703]: info    [QB    ] server name: cfg
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 21 11:14:26 proxmox8 corosync[2703]: info    [QB    ] server name: cpg
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun 21 11:14:26 proxmox8 corosync[2703]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
Jun 21 11:14:26 proxmox8 corosync[2703]: warning [WD    ] resource load_15min missing a recovery key.
Jun 21 11:14:26 proxmox8 corosync[2703]: warning [WD    ] resource memory_used missing a recovery key.
Jun 21 11:14:26 proxmox8 corosync[2703]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun 21 11:14:26 proxmox8 corosync[2703]: info    [WD    ] no resources configured.
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [QUORUM] Using quorum provider corosync_votequorum
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun 21 11:14:26 proxmox8 corosync[2703]: info    [QB    ] server name: votequorum
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun 21 11:14:26 proxmox8 corosync[2703]: info    [QB    ] server name: quorum
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [TOTEM ] The network interface [169.254.5.5] is now up.
Jun 21 11:14:26 proxmox8 corosync[2703]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Jun 21 11:14:26 proxmox8 corosync[2703]:  [QB    ] server name: cmap
Jun 21 11:14:26 proxmox8 systemd[1]: Started Corosync Cluster Engine.
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [TOTEM ] A new membership (10.0.136.52:74932) was formed. Members joined: 9
Jun 21 11:14:26 proxmox8 corosync[2703]: warning [CPG   ] downlist left_list: 0 received
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [QUORUM] Members[1]: 9
Jun 21 11:14:26 proxmox8 corosync[2703]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 21 11:14:26 proxmox8 corosync[2703]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Jun 21 11:14:26 proxmox8 systemd[1]: Starting PVE API Daemon...
Jun 21 11:14:26 proxmox8 corosync[2703]:  [QB    ] server name: cfg
Jun 21 11:14:26 proxmox8 corosync[2703]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 21 11:14:26 proxmox8 corosync[2703]:  [QB    ] server name: cpg
Jun 21 11:14:26 proxmox8 corosync[2703]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun 21 11:14:26 proxmox8 corosync[2703]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun 21 11:14:26 proxmox8 corosync[2703]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
Jun 21 11:14:26 proxmox8 corosync[2703]:  [WD    ] resource load_15min missing a recovery key.
Jun 21 11:14:26 proxmox8 corosync[2703]:  [WD    ] resource memory_used missing a recovery key.
Jun 21 11:14:26 proxmox8 corosync[2703]:  [WD    ] no resources configured.
Jun 21 11:14:26 proxmox8 corosync[2703]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun 21 11:14:26 proxmox8 corosync[2703]:  [QUORUM] Using quorum provider corosync_votequorum
Jun 21 11:14:26 proxmox8 corosync[2703]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun 21 11:14:26 proxmox8 corosync[2703]:  [QB    ] server name: votequorum
Jun 21 11:14:26 proxmox8 corosync[2703]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun 21 11:14:26 proxmox8 corosync[2703]:  [QB    ] server name: quorum
Jun 21 11:14:26 proxmox8 corosync[2703]:  [TOTEM ] The network interface [169.254.5.5] is now up.
Jun 21 11:14:26 proxmox8 corosync[2703]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Jun 21 11:14:26 proxmox8 corosync[2703]:  [TOTEM ] A new membership (10.0.136.52:74932) was formed. Members joined: 9
Jun 21 11:14:26 proxmox8 corosync[2703]:  [CPG   ] downlist left_list: 0 received
Jun 21 11:14:26 proxmox8 corosync[2703]:  [QUORUM] Members[1]: 9
Jun 21 11:14:26 proxmox8 corosync[2703]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 21 11:14:26 proxmox8 pve-firewall[2754]: starting server
Jun 21 11:14:26 proxmox8 pvestatd[2755]: starting server
Jun 21 11:14:26 proxmox8 systemd[1]: Started Proxmox VE firewall.
Jun 21 11:14:26 proxmox8 systemd[1]: Started PVE Status Daemon.
Jun 21 11:14:26 proxmox8 kernel: [   13.729489] ip6_tables: (C) 2000-2006 Netfilter Core Team
Jun 21 11:14:27 proxmox8 kernel: [   13.823759] ip_set: protocol 6
Jun 21 11:14:27 proxmox8 pvedaemon[2777]: starting server
Jun 21 11:14:27 proxmox8 pvedaemon[2777]: starting 3 worker(s)
Jun 21 11:14:27 proxmox8 pvedaemon[2777]: worker 2780 started
Jun 21 11:14:27 proxmox8 pvedaemon[2777]: worker 2781 started
Jun 21 11:14:27 proxmox8 pvedaemon[2777]: worker 2782 started
Jun 21 11:14:27 proxmox8 systemd[1]: Started PVE API Daemon.
Jun 21 11:14:27 proxmox8 systemd[1]: Starting PVE API Proxy Server...
Jun 21 11:14:27 proxmox8 systemd[1]: Starting PVE Cluster Ressource Manager Daemon...
Jun 21 11:14:27 proxmox8 pve-ha-crm[2803]: starting server
Jun 21 11:14:27 proxmox8 pve-ha-crm[2803]: status change startup => wait_for_quorum
Jun 21 11:14:27 proxmox8 systemd[1]: Started PVE Cluster Ressource Manager Daemon.
Jun 21 11:14:27 proxmox8 systemd[1]: Starting PVE Local HA Ressource Manager Daemon...
Jun 21 11:14:28 proxmox8 pveproxy[2821]: starting server
Jun 21 11:14:28 proxmox8 pveproxy[2821]: starting 3 worker(s)
Jun 21 11:14:28 proxmox8 pveproxy[2821]: worker 2824 started
Jun 21 11:14:28 proxmox8 pveproxy[2821]: worker 2825 started
Jun 21 11:14:28 proxmox8 pveproxy[2821]: worker 2826 started
Jun 21 11:14:28 proxmox8 systemd[1]: Started PVE API Proxy Server.
Jun 21 11:14:28 proxmox8 systemd[1]: Starting PVE SPICE Proxy Server...
Jun 21 11:14:28 proxmox8 pve-ha-lrm[2845]: starting server
Jun 21 11:14:28 proxmox8 pve-ha-lrm[2845]: status change startup => wait_for_agent_lock
Jun 21 11:14:28 proxmox8 systemd[1]: Started PVE Local HA Ressource Manager Daemon.
Jun 21 11:14:28 proxmox8 spiceproxy[2850]: starting server
Jun 21 11:14:28 proxmox8 spiceproxy[2850]: starting 1 worker(s)
Jun 21 11:14:28 proxmox8 spiceproxy[2850]: worker 2853 started
Jun 21 11:14:28 proxmox8 systemd[1]: Started PVE SPICE Proxy Server.
Jun 21 11:14:28 proxmox8 systemd[1]: Starting PVE guests...
Jun 21 11:14:29 proxmox8 pve-guests[2856]: <root@pam> starting task UPID:proxmox8:00000B38:00000645:5B2B6C75:startall::root@pam:
Jun 21 11:14:29 proxmox8 pvesh[2856]: waiting for quorum ...
Jun 21 11:14:31 proxmox8 pmxcfs[2678]: [status] notice: update cluster info (cluster name  tera-cluster, version = 25)
Jun 21 11:14:48 proxmox8 systemd[1]: Created slice User Slice of root.
Jun 21 11:14:48 proxmox8 systemd[1]: Starting User Manager for UID 0...
Jun 21 11:14:48 proxmox8 systemd[1]: Started Session 1 of user root.
Jun 21 11:14:48 proxmox8 systemd[3089]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Jun 21 11:14:48 proxmox8 systemd[3089]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Jun 21 11:14:48 proxmox8 systemd[3089]: Reached target Paths.
Jun 21 11:14:48 proxmox8 systemd[3089]: Listening on GnuPG cryptographic agent (access for web browsers).
Jun 21 11:14:48 proxmox8 systemd[3089]: Listening on GnuPG network certificate management daemon.
Jun 21 11:14:48 proxmox8 systemd[3089]: Reached target Timers.
Jun 21 11:14:48 proxmox8 systemd[3089]: Listening on GnuPG cryptographic agent and passphrase cache.
Jun 21 11:14:48 proxmox8 systemd[3089]: Reached target Sockets.
Jun 21 11:14:48 proxmox8 systemd[3089]: Reached target Basic System.
Jun 21 11:14:48 proxmox8 systemd[3089]: Reached target Default.
Jun 21 11:14:48 proxmox8 systemd[3089]: Startup finished in 14ms.
Jun 21 11:14:48 proxmox8 systemd[1]: Started User Manager for UID 0.
Jun 21 11:14:49 proxmox8 systemd-timesyncd[1105]: Synchronized to time server 195.154.189.15:123 (2.debian.pool.ntp.org).
Jun 21 11:15:00 proxmox8 systemd[1]: Starting Proxmox VE replication runner...
Jun 21 11:15:00 proxmox8 pvesr[3234]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:01 proxmox8 CRON[3253]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 21 11:15:01 proxmox8 pvesr[3234]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:02 proxmox8 pvesr[3234]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:03 proxmox8 pvesr[3234]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:04 proxmox8 pvesr[3234]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:05 proxmox8 pvesr[3234]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:06 proxmox8 pvesr[3234]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:07 proxmox8 pvesr[3234]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:08 proxmox8 pvesr[3234]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:09 proxmox8 systemd[1]: Stopping User Manager for UID 0...
Jun 21 11:15:09 proxmox8 systemd[3089]: Stopped target Default.
Jun 21 11:15:09 proxmox8 systemd[3089]: Stopped target Basic System.
Jun 21 11:15:09 proxmox8 systemd[3089]: Stopped target Sockets.
Jun 21 11:15:09 proxmox8 systemd[3089]: Closed GnuPG network certificate management daemon.
Jun 21 11:15:09 proxmox8 systemd[3089]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Jun 21 11:15:09 proxmox8 systemd[3089]: Closed GnuPG cryptographic agent (access for web browsers).
Jun 21 11:15:09 proxmox8 systemd[3089]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Jun 21 11:15:09 proxmox8 systemd[3089]: Stopped target Timers.
Jun 21 11:15:09 proxmox8 systemd[3089]: Stopped target Paths.
Jun 21 11:15:09 proxmox8 systemd[3089]: Closed GnuPG cryptographic agent and passphrase cache.
Jun 21 11:15:09 proxmox8 systemd[3089]: Reached target Shutdown.
Jun 21 11:15:09 proxmox8 systemd[3089]: Starting Exit the Session...
Jun 21 11:15:09 proxmox8 systemd[3089]: Received SIGRTMIN+24 from PID 3354 (kill).
Jun 21 11:15:09 proxmox8 systemd[1]: Stopped User Manager for UID 0.
Jun 21 11:15:09 proxmox8 systemd[1]: Removed slice User Slice of root.
Jun 21 11:15:09 proxmox8 pvesr[3234]: error with cfs lock 'file-replication_cfg': no quorum!
Jun 21 11:15:09 proxmox8 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jun 21 11:15:09 proxmox8 systemd[1]: Failed to start Proxmox VE replication runner.
Jun 21 11:15:09 proxmox8 systemd[1]: pvesr.service: Unit entered failed state.
Jun 21 11:15:09 proxmox8 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jun 21 11:15:10 proxmox8 systemd[1]: Created slice User Slice of root.
Jun 21 11:15:10 proxmox8 systemd[1]: Starting User Manager for UID 0...
Jun 21 11:15:10 proxmox8 systemd[1]: Started Session 4 of user root.
Jun 21 11:15:10 proxmox8 systemd[3379]: Listening on GnuPG cryptographic agent and passphrase cache.
Jun 21 11:15:10 proxmox8 systemd[3379]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Jun 21 11:15:10 proxmox8 systemd[3379]: Reached target Timers.
Jun 21 11:15:10 proxmox8 systemd[3379]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Jun 21 11:15:10 proxmox8 systemd[3379]: Reached target Paths.
Jun 21 11:15:10 proxmox8 systemd[3379]: Listening on GnuPG network certificate management daemon.
Jun 21 11:15:10 proxmox8 systemd[3379]: Listening on GnuPG cryptographic agent (access for web browsers).
Jun 21 11:15:10 proxmox8 systemd[3379]: Reached target Sockets.
Jun 21 11:15:10 proxmox8 systemd[3379]: Reached target Basic System.
Jun 21 11:15:10 proxmox8 systemd[3379]: Reached target Default.
Jun 21 11:15:10 proxmox8 systemd[3379]: Startup finished in 15ms.
Jun 21 11:15:10 proxmox8 systemd[1]: Started User Manager for UID 0.
Jun 21 11:15:44 proxmox8 corosync[2703]: notice  [TOTEM ] A new membership (10.0.136.20:75036) was formed. Members joined: 8
Jun 21 11:15:44 proxmox8 corosync[2703]:  [TOTEM ] A new membership (10.0.136.20:75036) was formed. Members joined: 8
Jun 21 11:15:44 proxmox8 corosync[2703]: warning [CPG   ] downlist left_list: 0 received
Jun 21 11:15:44 proxmox8 corosync[2703]:  [CPG   ] downlist left_list: 0 received
Jun 21 11:15:44 proxmox8 corosync[2703]: warning [CPG   ] downlist left_list: 6 received
Jun 21 11:15:44 proxmox8 corosync[2703]:  [CPG   ] downlist left_list: 6 received
Jun 21 11:15:44 proxmox8 corosync[2703]: notice  [QUORUM] Members[2]: 8 9
Jun 21 11:15:44 proxmox8 corosync[2703]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 21 11:15:44 proxmox8 corosync[2703]:  [QUORUM] Members[2]: 8 9
Jun 21 11:15:44 proxmox8 corosync[2703]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [dcdb] notice: members: 8/525829, 9/2678
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [dcdb] notice: starting data syncronisation
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [status] notice: members: 8/525829, 9/2678
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [status] notice: starting data syncronisation
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [dcdb] notice: members: 9/2678
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [dcdb] notice: all data is up to date
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [status] notice: received sync request (epoch 8/525829/00000067)
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [status] notice: received all states
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [status] notice: all data is up to date
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [status] notice: dfsm_deliver_queue: queue length 482
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [status] notice: received log
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [main] notice: ignore duplicate
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [status] notice: received log
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [main] notice: ignore duplicate
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [status] notice: received log
Jun 21 11:15:44 proxmox8 pmxcfs[2678]: [main] notice: ignore duplicate

P9:
no logs, offline.

P11:

Code:

Jun 21 11:14:24 proxmox11 pmxcfs[2521]: [status] notice: received log
Jun 21 11:14:25 proxmox11 pmxcfs[2521]: [status] notice: received log
Jun 21 11:14:25 proxmox11 pmxcfs[2521]: [status] notice: received log
Jun 21 11:14:25 proxmox11 pmxcfs[2521]: [status] notice: received log
Jun 21 11:14:25 proxmox11 pmxcfs[2521]: [status] notice: received log
Jun 21 11:14:26 proxmox11 pmxcfs[2521]: [status] notice: received log
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jun 21 11:17:13 proxmox11 systemd-modules-load[509]: Inserted module 'iscsi_tcp'
Jun 21 11:17:13 proxmox11 systemd-modules-load[509]: Inserted module 'ib_iser'
Jun 21 11:17:13 proxmox11 systemd-modules-load[509]: Inserted module 'vhost_net'
Jun 21 11:17:13 proxmox11 keyboard-setup.sh[519]: cannot open file /tmp/tmpkbd.fljNd0
Jun 21 11:17:13 proxmox11 systemd[1]: Starting Flush Journal to Persistent Storage...
Jun 21 11:17:13 proxmox11 systemd[1]: Starting udev Wait for Complete Device Initialization...
Jun 21 11:17:13 proxmox11 systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch

P-ARB:

Code:

Jun 21 11:15:29 proxmox-arb ceph-mgr[3780771]: ::ffff:195.49.132.87 - - [21/Jun/2018:11:15:29] "OPTIONS / HTTP/1.0" 302 131 "" ""
Jun 21 11:15:29 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection
Jun 21 11:15:29 proxmox-arb pveproxy[2276100]: proxy detected vanished client connection
Jun 21 11:15:31 proxmox-arb pvedaemon[2246776]: worker exit
Jun 21 11:15:31 proxmox-arb pvedaemon[2095]: worker 2246776 finished
Jun 21 11:15:31 proxmox-arb pvedaemon[2095]: starting 1 worker(s)
Jun 21 11:15:31 proxmox-arb pvedaemon[2095]: worker 2290483 started
Jun 21 11:15:31 proxmox-arb ceph-mgr[3780771]: ::ffff:195.49.132.87 - - [21/Jun/2018:11:15:31] "OPTIONS / HTTP/1.0" 302 131 "" ""
Jun 21 11:15:33 proxmox-arb ceph-mgr[3780771]: ::ffff:195.49.132.87 - - [21/Jun/2018:11:15:33] "OPTIONS / HTTP/1.0" 302 131 "" ""
Jun 21 11:15:39 proxmox-arb snmpd[1539]: error on subcontainer 'ia_addr' insert (-1)
Jun 21 11:15:39 proxmox-arb snmpd[1539]: error on subcontainer 'ia_addr' insert (-1)
Jun 21 11:15:44 proxmox-arb corosync[3975603]: notice  [TOTEM ] A new membership (10.0.136.20:75036) was formed. Members joined: 9 left: 6 5 2 7 1 3
Jun 21 11:15:44 proxmox-arb corosync[3975603]: notice  [TOTEM ] Failed to receive the leave message. failed: 6 5 2 7 1 3
Jun 21 11:15:44 proxmox-arb corosync[3975603]:  [TOTEM ] A new membership (10.0.136.20:75036) was formed. Members joined: 9 left: 6 5 2 7 1 3
Jun 21 11:15:44 proxmox-arb corosync[3975603]:  [TOTEM ] Failed to receive the leave message. failed: 6 5 2 7 1 3
Jun 21 11:15:44 proxmox-arb corosync[3975603]: warning [CPG   ] downlist left_list: 0 received
Jun 21 11:15:44 proxmox-arb corosync[3975603]: warning [CPG   ] downlist left_list: 6 received
Jun 21 11:15:44 proxmox-arb corosync[3975603]:  [CPG   ] downlist left_list: 0 received
Jun 21 11:15:44 proxmox-arb corosync[3975603]:  [CPG   ] downlist left_list: 6 received
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: members: 8/525829
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [status] notice: members: 8/525829
Jun 21 11:15:44 proxmox-arb corosync[3975603]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 21 11:15:44 proxmox-arb corosync[3975603]: notice  [QUORUM] Members[2]: 8 9
Jun 21 11:15:44 proxmox-arb corosync[3975603]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 21 11:15:44 proxmox-arb corosync[3975603]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 21 11:15:44 proxmox-arb corosync[3975603]:  [QUORUM] Members[2]: 8 9
Jun 21 11:15:44 proxmox-arb corosync[3975603]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [status] notice: node lost quorum
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] crit: received write while not quorate - trigger resync
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] crit: leaving CPG group
Jun 21 11:15:44 proxmox-arb pve-ha-lrm[536042]: unable to write lrm status file - unable to open file '/etc/pve/nodes/proxmox-arb/lrm_status.tmp.536042' - Permission denied
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [status] notice: members: 8/525829, 9/2678
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [status] notice: starting data syncronisation
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [status] notice: received sync request (epoch 8/525829/00000067)
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [status] notice: received all states
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [status] notice: all data is up to date
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [status] notice: dfsm_deliver_queue: queue length 482
Jun 21 11:15:44 proxmox-arb pvesr[2286104]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: start cluster connection
Jun 21 11:15:44 proxmox-arb pve-ha-crm[536352]: loop take too long (79 seconds)
Jun 21 11:15:44 proxmox-arb pve-ha-crm[536352]: status change slave => wait_for_quorum
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: members: 8/525829, 9/2678
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: starting data syncronisation
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: received sync request (epoch 8/525829/00000070)
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: received all states
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: leader is 8/525829
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: synced members: 8/525829
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: start sending inode updates
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: sent all (17) updates
Jun 21 11:15:44 proxmox-arb pmxcfs[525829]: [dcdb] notice: all data is up to date
Jun 21 11:15:45 proxmox-arb pvesr[2286104]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:46 proxmox-arb pvesr[2286104]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:47 proxmox-arb pvesr[2286104]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:48 proxmox-arb pvesr[2286104]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:49 proxmox-arb pve-ha-lrm[536042]: loop take too long (83 seconds)
Jun 21 11:15:49 proxmox-arb pvesr[2286104]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:50 proxmox-arb pvesr[2286104]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:51 proxmox-arb pvesr[2286104]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:52 proxmox-arb pvesr[2286104]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:15:53 proxmox-arb pvesr[2286104]: error with cfs lock 'file-replication_cfg': no quorum!
Jun 21 11:15:53 proxmox-arb systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jun 21 11:15:53 proxmox-arb systemd[1]: Failed to start Proxmox VE replication runner.
Jun 21 11:15:53 proxmox-arb systemd[1]: pvesr.service: Unit entered failed state.
Jun 21 11:15:53 proxmox-arb systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jun 21 11:16:00 proxmox-arb systemd[1]: Starting Proxmox VE replication runner...
Jun 21 11:16:00 proxmox-arb pvesr[2291730]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:16:01 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection
Jun 21 11:16:01 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection
Jun 21 11:16:01 proxmox-arb pvesr[2291730]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:16:02 proxmox-arb pveproxy[2276100]: proxy detected vanished client connection
Jun 21 11:16:02 proxmox-arb pvesr[2291730]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:16:03 proxmox-arb pvesr[2291730]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:16:04 proxmox-arb pveproxy[2288635]: proxy detected vanished client connection
Jun 21 11:16:04 proxmox-arb pvesr[2291730]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:16:05 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection
Jun 21 11:16:05 proxmox-arb pveproxy[2276100]: proxy detected vanished client connection
Jun 21 11:16:05 proxmox-arb pvesr[2291730]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:16:06 proxmox-arb pvesr[2291730]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:16:07 proxmox-arb pveproxy[2276100]: proxy detected vanished client connection
Jun 21 11:16:07 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection
Jun 21 11:16:07 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection
Jun 21 11:16:07 proxmox-arb pvesr[2291730]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:16:08 proxmox-arb pveproxy[2288635]: proxy detected vanished client connection
Jun 21 11:16:08 proxmox-arb pveproxy[2288635]: proxy detected vanished client connection
Jun 21 11:16:08 proxmox-arb pveproxy[2288635]: proxy detected vanished client connection
Jun 21 11:16:08 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection
Jun 21 11:16:08 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection
Jun 21 11:16:08 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection
Jun 21 11:16:08 proxmox-arb pvesr[2291730]: trying to aquire cfs lock 'file-replication_cfg' ...
Jun 21 11:16:09 proxmox-arb snmpd[1539]: error on subcontainer 'ia_addr' insert (-1)
Jun 21 11:16:09 proxmox-arb snmpd[1539]: error on subcontainer 'ia_addr' insert (-1)
Jun 21 11:16:09 proxmox-arb pvesr[2291730]: error with cfs lock 'file-replication_cfg': no quorum!
Jun 21 11:16:09 proxmox-arb systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jun 21 11:16:09 proxmox-arb systemd[1]: Failed to start Proxmox VE replication runner.
Jun 21 11:16:09 proxmox-arb systemd[1]: pvesr.service: Unit entered failed state.
Jun 21 11:16:09 proxmox-arb systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jun 21 11:16:10 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection
Jun 21 11:16:10 proxmox-arb pveproxy[2274070]: proxy detected vanished client connection

I don't understand why all but one node rebooted without any loose of quorum and no corosync logs.
Any idea ?
The pve version is:
pve-manager/5.2-1/0fcd7879 (running kernel: 4.15.17-2-pve)
Thank you.

Alwin · Jun 25, 2018

How are the nodes between the datacenter connected? And how does your corosync.conf look like?

A first guess would be that the latency between one of the data centers was to high, so quorum was lost. With HA configured, nodes do reset.

spirit · Jun 25, 2018

do you use multicast between dc ? if yes, do you use igmp snooping ? and if yes, where are the igmp queriers ?

fbrachere · Jun 26, 2018

We use multicast, no igmp snooping.
The link between DC1 and DC2 is dedicated fiber, 10Gbps.
The 2 other links (DC1 -> DC3 an DC2 -> DC3) are dedicated 1Gbps but with low traffic (only corosync, admin and ceph monitor), there is no VM on DC3.

Here is my corosync config:

Code:

logging {
  debug: off
  to_syslog: yes
}
nodelist {
  node {
    name: proxmox-arb
    nodeid: 8
    quorum_votes: 1
    ring0_addr: proxmox-arb
    ring1_addr: 169.254.5.9
  }
  node {
    name: proxmox1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: proxmox1
    ring1_addr: 169.254.5.1
  }
  node {
    name: proxmox10
    nodeid: 1
    quorum_votes: 1
    ring0_addr: proxmox10
    ring1_addr: 169.254.5.7
  }
  node {
    name: proxmox11
    nodeid: 3
    quorum_votes: 1
    ring0_addr: proxmox11
    ring1_addr: 169.254.5.8
  }
  node {
    name: proxmox2
    nodeid: 5
    quorum_votes: 1
    ring0_addr: proxmox2
    ring1_addr: 169.254.5.2
  }
  node {
    name: proxmox3
    nodeid: 6
    quorum_votes: 1
    ring0_addr: proxmox3
    ring1_addr: 169.254.5.3
  }
  node {
    name: proxmox7
    nodeid: 7
    quorum_votes: 1
    ring0_addr: proxmox7
    ring1_addr: 169.254.5.4
  }
  node {
    name: proxmox8
    nodeid: 9
    quorum_votes: 1
    ring0_addr: proxmox8
    ring1_addr: 169.254.5.5
  }
  node {
    name: proxmox9
    nodeid: 10
    quorum_votes: 1
    ring0_addr: proxmox9
    ring1_addr: 169.254.5.6
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: tera-cluster
  config_version: 25
  interface {
    bindnetaddr: 10.0.136.32
    ringnumber: 0
  }
  interface {
    bindnetaddr: 169.254.5.1
    ringnumber: 1
  }
  rrp_mode: passive
  ip_version: ipv4
  secauth: on
  version: 2
}

omping test on ring0:

Code:

# omping -c 10000 -i 0.001 -F -q 10.0.136.32 10.0.136.28 10.0.136.26 10.0.136.54 10.0.136.51 10.0.136.52 10.0.136.53 10.0.136.55 10.0.136.20
10.0.136.32 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.742/0.766/1.226/0.021
10.0.136.32 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.750/0.780/1.588/0.025
10.0.136.28 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.743/0.774/2.908/0.041
10.0.136.28 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.753/0.791/2.920/0.045
10.0.136.26 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.745/0.777/3.061/0.045
10.0.136.26 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.757/0.795/3.113/0.049
10.0.136.54 :   unicast, xmt/rcv/%loss = 10000/9994/0%, min/avg/max/std-dev = 0.748/0.783/12.546/0.250
10.0.136.54 : multicast, xmt/rcv/%loss = 10000/9993/0%, min/avg/max/std-dev = 1.490/1.534/13.293/0.250
10.0.136.51 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.027/0.039/0.173/0.011
10.0.136.51 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.032/0.047/0.180/0.012
10.0.136.52 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.027/0.042/2.650/0.034
10.0.136.52 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.033/0.053/2.657/0.035
10.0.136.53 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.026/0.044/0.151/0.017
10.0.136.53 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.032/0.057/0.158/0.019
10.0.136.20 :   unicast, xmt/rcv/%loss = 10000/9999/0%, min/avg/max/std-dev = 1.504/1.521/2.354/0.019
10.0.136.20 : multicast, xmt/rcv/%loss = 10000/9999/0%, min/avg/max/std-dev = 1.505/1.525/2.355/0.018

omping test on ring1:

Code:

# omping -c 10000 -i 0.001 -F -q 169.254.5.1 169.254.5.2 169.254.5.3 169.254.5.4 169.254.5.5 169.254.5.6 169.254.5.7 169.254.5.8 169.254.5.9
169.254.5.1 :   unicast, xmt/rcv/%loss = 10000/9997/0%, min/avg/max/std-dev = 1.566/1.591/4.744/0.060
169.254.5.1 : multicast, xmt/rcv/%loss = 10000/9997/0%, min/avg/max/std-dev = 1.569/1.593/4.746/0.059
169.254.5.2 :   unicast, xmt/rcv/%loss = 10000/9998/0%, min/avg/max/std-dev = 2.287/2.315/2.644/0.014
169.254.5.2 : multicast, xmt/rcv/%loss = 10000/9999/0%, min/avg/max/std-dev = 1.558/1.590/1.823/0.014
169.254.5.3 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.781/0.816/1.638/0.019
169.254.5.3 : multicast, xmt/rcv/%loss = 10000/9999/0%, min/avg/max/std-dev = 1.524/1.559/2.166/0.015
169.254.5.4 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.067/0.093/0.164/0.011
169.254.5.4 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.072/0.098/0.171/0.012
169.254.5.5 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.070/0.095/0.167/0.008
169.254.5.5 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.076/0.102/0.174/0.008
169.254.5.6 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.069/0.099/0.171/0.013
169.254.5.6 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.073/0.102/0.175/0.011
169.254.5.7 :   unicast, xmt/rcv/%loss = 10000/9995/0%, min/avg/max/std-dev = 0.790/0.811/8.352/0.124
169.254.5.7 : multicast, xmt/rcv/%loss = 10000/9994/0%, min/avg/max/std-dev = 1.523/1.554/9.093/0.124
169.254.5.9 :   unicast, xmt/rcv/%loss = 10000/9999/0%, min/avg/max/std-dev = 1.523/1.556/5.642/0.070
169.254.5.9 : multicast, xmt/rcv/%loss = 10000/9999/0%, min/avg/max/std-dev = 1.524/1.558/5.643/0.070

Alwin · Jun 26, 2018

fbrachere said:
10.0.136.54 : unicast, xmt/rcv/%loss = 10000/9994/0%, min/avg/max/std-dev = 0.748/0.783/12.546/0.250 10.0.136.54 : multicast, xmt/rcv/%loss = 10000/9993/0%, min/avg/max/std-dev = 1.490/1.534/13.293/0.250

The output shows 'max >12ms' on some of the pings, enough of these and your cluster likely gets out-of-order. On failure (reboots or maintenance too) there will be way more traffic then on normal operation (eg. ceph recovery).

fbrachere · Jun 26, 2018

Alwin said:
The output shows 'max >12ms' on some of the pings, enough of these and your cluster likely gets out-of-order. On failure (reboots or maintenance too) there will be way more traffic then on normal operation (eg. ceph recovery).

Maybe, but I've two rings configured, including one on its own network (precisely to avoid a fault on one ring) and the logs didn't show any loose of quorum on nodes which rebooted.
They just ... rebooted. Why I don't see any corosync message, only a message from watchdog: "watchdog-mux[1548]: client watchdog expired - disable watchdog updates" ?
If watchdog expired, it's the client which didn't update the watchdog (pmxcfs or pve-ha-lrm I guess) and I wonder why.
And above all, why the hell I don't have log in the interval between the time where the watchdog didn't get updated and the watchdog timeout (10 seconds)? There is only logs on the remaining node (P-ARB), and they show that when the P8 node rebooted, a new membership was formed with only two nodes, and of course no quorum until all the other nodes finished starting.

Alwin · Jun 26, 2018

fbrachere said:
Maybe, but I've two rings configured, including one on its own network (precisely to avoid a fault on one ring) and the logs didn't show any loose of quorum on nodes which rebooted.

Well...

fbrachere said:
The link between DC1 and DC2 is dedicated fiber, 10Gbps.
The 2 other links (DC1 -> DC3 an DC2 -> DC3) are dedicated 1Gbps but with low traffic (only corosync, admin and ceph monitor), there is no VM on DC3.

Although on host side, your setup may have separated networks but whats between the data centers? From your description, there is one link and it could well be that the traffic was not switched (assuming) over the 10GbE (fiber) link.

fbrachere said:
They just ... rebooted. Why I don't see any corosync message, only a message from watchdog: "watchdog-mux[1548]: client watchdog expired - disable watchdog updates" ?
If watchdog expired, it's the client which didn't update the watchdog (pmxcfs or pve-ha-lrm I guess) and I wonder why.

They might not have been written to the disks before the reset. Most likely because of the lost quorum.

fbrachere said:
And above all, why the hell I don't have log in the interval between the time where the watchdog didn't get updated and the watchdog timeout (10 seconds)?

Uncertain, but probably as written above.

fbrachere said:
There is only logs on the remaining node (P-ARB), and they show that when the P8 node rebooted, a new membership was formed with only two nodes, and of course no quorum until all the other nodes finished starting.

If the second member of the new membership didn't reset itself immediately after forming the membership, then there might be a second server with log entries around.

To gain more information on this, remove the HA config on all nodes, then they shouldn't reset and you should find entries in the logs.

flotho · Feb 18, 2019

Hi @fbrachere,

Didi you solved your issue ? It looks like we have the same issue here

Daniel Lauck · Aug 5, 2019

Hi everone.

@fbrachere, did you found a solution to this issue? I have the same situation here.

Alwin · Aug 6, 2019

Guys, this is an old post, please open up a new thread. This enhances the chances, to get answers.

Search

Search

Whole cluster reboot when a node comes back online.

fbrachere

Member

Alwin

Proxmox Retired Staff

spirit

Distinguished Member

fbrachere

Member

Alwin

Proxmox Retired Staff

fbrachere

Member

Alwin

Proxmox Retired Staff

flotho

Renowned Member

Daniel Lauck

Member

Alwin

Proxmox Retired Staff