PVE Cluster Crash

Stone

Well-Known Member
Nov 18, 2016
41
4
48
41
Hi,
We have a PVE cluster with 11 nodes running. Today, all nodes have rebooted quite simultaneously in a period of 1-2 minutes.
Information about the cluster:
All members are running the following PVE version:
pve-manager/5.2-1/0fcd7879 (running kernel: 4.15.17-2-pve)

root@*******:~# pvecm status
Quorum information
------------------
Date: Thu Jul 19 13:01:51 2018
Quorum provider: corosync_votequorum
Nodes: 11
Node ID: 0x00000001
Ring ID: 1/16344
Quorate: Yes

Votequorum information
----------------------
Expected votes: 11
Highest expected: 11
Total votes: 11
Quorum: 6
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.28.2.20 (local)
0x00000002 1 172.28.2.21
0x00000003 1 172.28.2.22
0x00000004 1 172.28.2.23
0x00000005 1 172.28.2.24
0x00000006 1 172.28.2.25
0x00000007 1 172.28.2.26
0x00000008 1 172.28.2.27
0x00000009 1 172.28.2.28
0x0000000a 1 172.28.2.29
0x0000000b 1 172.28.2.30


The following information could be found in the syslog (example of 2 nodes, similar on the remaining nodes)
Node 172.28.2.20:
Jul 19 10:05:56 tatooine corosync[11153]: notice [TOTEM ] A new membership (172.28.2.20:16116) was formed. Members left: 7 8 9 10 11
Jul 19 10:05:56 tatooine corosync[11153]: notice [TOTEM ] Failed to receive the leave message. failed: 7 8 9 10 11
Jul 19 10:05:56 tatooine corosync[11153]: [TOTEM ] A new membership (172.28.2.20:16116) was formed. Members left: 7 8 9 10 11
Jul 19 10:05:56 tatooine corosync[11153]: [TOTEM ] Failed to receive the leave message. failed: 7 8 9 10 11
Jul 19 10:06:00 tatooine systemd[1]: Starting Proxmox VE replication runner...
Jul 19 10:06:04 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 10
Jul 19 10:06:05 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 20
Jul 19 10:06:06 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 30
Jul 19 10:06:07 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 40
Jul 19 10:06:08 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 50
Jul 19 10:06:09 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 60
Jul 19 10:06:10 tatooine corosync[11153]: notice [TOTEM ] Retransmit List: 8 9 d e 10 6
Jul 19 10:06:10 tatooine corosync[11153]: [TOTEM ] Retransmit List: 8 9 d e 10 6
Jul 19 10:06:10 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 70
Jul 19 10:06:11 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 80
Jul 19 10:06:12 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 90
Jul 19 10:06:13 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 100
Jul 19 10:06:13 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retried 100 times
Jul 19 10:06:13 tatooine pmxcfs[11040]: [status] crit: cpg_send_message failed: 6
Jul 19 10:06:13 tatooine corosync[11153]: notice [TOTEM ] Retransmit List: 9 d
Jul 19 10:06:13 tatooine corosync[11153]: [TOTEM ] Retransmit List: 9 d
Jul 19 10:06:14 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 10
Jul 19 10:06:15 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 20
Jul 19 10:06:16 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 30
Jul 19 10:06:17 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 40
Jul 19 10:06:17 tatooine corosync[11153]: notice [TOTEM ] A new membership (172.28.2.20:16128) was formed. Members
Jul 19 10:06:17 tatooine corosync[11153]: [TOTEM ] A new membership (172.28.2.20:16128) was formed. Members
Jul 19 10:06:18 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 50
Jul 19 10:06:19 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 60
Jul 19 10:06:20 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 70
Jul 19 10:06:21 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 80
Jul 19 10:06:22 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 90
Jul 19 10:06:23 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 100
Jul 19 10:06:23 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retried 100 times
Jul 19 10:06:23 tatooine pmxcfs[11040]: [status] crit: cpg_send_message failed: 6
Jul 19 10:06:24 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 10
Jul 19 10:06:25 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 20
Jul 19 10:06:26 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 30
Jul 19 10:06:27 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 40
Jul 19 10:06:28 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 50
Jul 19 10:06:29 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 60
Jul 19 10:06:30 tatooine pmxcfs[11040]: [status] notice: cpg_send_message retry 70
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@



Node 172.28.2.25:
Jul 19 10:05:56 mustafar corosync[9578]: notice [TOTEM ] A new membership (172.28.2.20:16116) was formed. Members left: 7 8 9 10 11
Jul 19 10:05:56 mustafar corosync[9578]: notice [TOTEM ] Failed to receive the leave message. failed: 7 8 9 10 11
Jul 19 10:05:56 mustafar corosync[9578]: [TOTEM ] A new membership (172.28.2.20:16116) was formed. Members left: 7 8 9 10 11
Jul 19 10:05:56 mustafar corosync[9578]: [TOTEM ] Failed to receive the leave message. failed: 7 8 9 10 11
Jul 19 10:06:00 mustafar systemd[1]: Starting Proxmox VE replication runner...
Jul 19 10:06:01 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 10
Jul 19 10:06:02 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 20
Jul 19 10:06:03 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 30
Jul 19 10:06:04 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 40
Jul 19 10:06:05 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 50
Jul 19 10:06:06 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 60
Jul 19 10:06:07 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 70
Jul 19 10:06:08 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 80
Jul 19 10:06:08 mustafar corosync[9578]: notice [TOTEM ] Retransmit List: 8 9 7 d e 10
Jul 19 10:06:08 mustafar corosync[9578]: [TOTEM ] Retransmit List: 8 9 7 d e 10
Jul 19 10:06:09 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 90
Jul 19 10:06:10 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 100
Jul 19 10:06:10 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retried 100 times
Jul 19 10:06:10 mustafar pmxcfs[9477]: [status] crit: cpg_send_message failed: 6
Jul 19 10:06:10 mustafar pve-firewall[9626]: firewall update time (6.499 seconds)
Jul 19 10:06:11 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 10
Jul 19 10:06:12 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 20
Jul 19 10:06:13 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 30
Jul 19 10:06:13 mustafar corosync[9578]: notice [TOTEM ] Retransmit List: 17 18 1a 9
Jul 19 10:06:13 mustafar corosync[9578]: [TOTEM ] Retransmit List: 17 18 1a 9
Jul 19 10:06:14 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 40
Jul 19 10:06:14 mustafar corosync[9578]: notice [TOTEM ] Retransmit List: 9
Jul 19 10:06:14 mustafar corosync[9578]: [TOTEM ] Retransmit List: 9
Jul 19 10:06:15 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 50
Jul 19 10:06:16 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 60
Jul 19 10:06:17 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 70
Jul 19 10:06:17 mustafar corosync[9578]: notice [TOTEM ] A new membership (172.28.2.20:16128) was formed. Members
Jul 19 10:06:17 mustafar corosync[9578]: [TOTEM ] A new membership (172.28.2.20:16128) was formed. Members
Jul 19 10:06:18 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 80
Jul 19 10:06:19 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 90
Jul 19 10:06:20 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 100
Jul 19 10:06:20 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retried 100 times
Jul 19 10:06:20 mustafar pmxcfs[9477]: [status] crit: cpg_send_message failed: 6
Jul 19 10:06:20 mustafar pve-firewall[9626]: firewall update time (7.007 seconds)
Jul 19 10:06:21 mustafar pmxcfs[9477]: [dcdb] notice: cpg_send_message retry 10
Jul 19 10:06:21 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 10
Jul 19 10:06:22 mustafar pmxcfs[9477]: [dcdb] notice: cpg_send_message retry 20
Jul 19 10:06:22 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 20
Jul 19 10:06:23 mustafar pmxcfs[9477]: [dcdb] notice: cpg_send_message retry 30
Jul 19 10:06:23 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 30
Jul 19 10:06:24 mustafar pmxcfs[9477]: [dcdb] notice: cpg_send_message retry 40
Jul 19 10:06:24 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 40
Jul 19 10:06:25 mustafar pmxcfs[9477]: [dcdb] notice: cpg_send_message retry 50
Jul 19 10:06:25 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 50
Jul 19 10:06:26 mustafar pmxcfs[9477]: [dcdb] notice: cpg_send_message retry 60
Jul 19 10:06:26 mustafar pmxcfs[9477]: [status] notice: cpg_send_message retry 60
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^


Anyone have an idea what might have caused the reboot of all nodes
 
  • Like
Reactions: Hyacin
Additional we found some more logs on the hosts:

Jul 19 10:06:26 corellia pmxcfs[9592]: [status] notice: cpg_send_message retry 20
Jul 19 10:06:27 corellia pmxcfs[9592]: [status] notice: cpg_send_message retry 30
Jul 19 10:06:28 corellia pmxcfs[9592]: [status] notice: cpg_send_message retry 40
Jul 19 10:06:28 corellia corosync[9697]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Jul 19 10:06:28 corellia corosync[9697]: notice [TOTEM ] Retransmit List: 7 2 3 4 5
Jul 19 10:06:28 corellia corosync[9697]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Jul 19 10:06:28 corellia corosync[9697]: [TOTEM ] Retransmit List: 7 2 3 4 5
Jul 19 10:06:29 corellia pmxcfs[9592]: [status] notice: cpg_send_message retry 50
Jul 19 10:06:29 corellia corosync[9697]: notice [TOTEM ] Retransmit List: 2 3
Jul 19 10:06:29 corellia corosync[9697]: [TOTEM ] Retransmit List: 2 3
Jul 19 10:06:30 corellia pmxcfs[9592]: [status] notice: cpg_send_message retry 60
Jul 19 10:06:31 corellia pmxcfs[9592]: [status] notice: cpg_send_message retry 70
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@


Jul 19 10:06:33 dagobah pmxcfs[9561]: [status] notice: cpg_send_message retry 90
Jul 19 10:06:33 dagobah corosync[9679]: notice [TOTEM ] Process pause detected for 2783 ms, flushing membership messages.
Jul 19 10:06:33 dagobah corosync[9679]: notice [TOTEM ] Process pause detected for 2783 ms, flushing membership messages.
Jul 19 10:06:33 dagobah corosync[9679]: [TOTEM ] Process pause detected for 2783 ms, flushing membership messages.
Jul 19 10:06:33 dagobah corosync[9679]: [TOTEM ] Process pause detected for 2783 ms, flushing membership messages.
Jul 19 10:06:33 dagobah corosync[9679]: notice [TOTEM ] Process pause detected for 2785 ms, flushing membership messages.
Jul 19 10:06:33 dagobah corosync[9679]: notice [TOTEM ] Process pause detected for 2785 ms, flushing membership messages.
Jul 19 10:06:33 dagobah corosync[9679]: [TOTEM ] Process pause detected for 2785 ms, flushing membership messages.
Jul 19 10:06:33 dagobah corosync[9679]: notice [TOTEM ] Process pause detected for 2785 ms, flushing membership messages.


Jul 19 10:06:26 endor pmxcfs[9471]: [dcdb] notice: cpg_send_message retry 100
Jul 19 10:06:26 endor pmxcfs[9471]: [dcdb] notice: cpg_send_message retried 100 times
Jul 19 10:06:26 endor pvesr[4003]: error with cfs lock 'file-replication_cfg': got lock request timeout
Jul 19 10:06:26 endor pmxcfs[9471]: [dcdb] crit: cpg_send_message failed: 6
Jul 19 10:06:26 endor systemd[1]: pvesr.service: Main process exited, code=exited, status=16/n/a
Jul 19 10:06:26 endor systemd[1]: Failed to start Proxmox VE replication runner.
Jul 19 10:06:26 endor systemd[1]: pvesr.service: Unit entered failed state.
Jul 19 10:06:26 endor systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 19 10:06:26 endor pmxcfs[9471]: [status] notice: cpg_send_message retry 100
Jul 19 10:06:26 endor pmxcfs[9471]: [status] notice: cpg_send_message retried 100 times
Jul 19 10:06:26 endor pmxcfs[9471]: [status] crit: cpg_send_message failed: 6
Jul 19 10:06:26 endor pve-firewall[9671]: firewall update time (8.043 seconds)


Jul 19 10:06:44 endor watchdog-mux[1756]: client watchdog expired - disable watchdog updates
Jul 19 10:06:44 endor pmxcfs[9471]: [status] notice: cpg_send_message retry 80
Jul 19 10:06:45 endor rrdcached[9410]: flushing old values
Jul 19 10:06:45 endor rrdcached[9410]: rotating journals
Jul 19 10:06:45 endor rrdcached[9410]: started new journal /var/lib/rrdcached/journal/rrd.journal.1531987605.600073
Jul 19 10:06:45 endor rrdcached[9410]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1531980405.600002
Jul 19 10:06:45 endor pmxcfs[9471]: [status] notice: cpg_send_message retry 90
 
  • Like
Reactions: Hyacin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!