Lost a node in cluster under "server view"

huky

Renowned Member
Jul 1, 2016
70
3
73
45
Chongqing, China
Sorry for my poor english。

I have a cluster with 6 nodes,it running OK over six months。
today,I found the icon of node6 is red fork,all vms on nodes is grey with only vmid(without name).
Then I run "pvecm nodes" in console,the result is normal,6 nodes are online.

I have restart pveproxy service on all nodes,How can I do to make the cluster return to normal.
 
Make sure 'pvestatd' is running on all nodes. Also make sure you have correct system time on all nodes. What is the output of

# pvecm status
 
Make sure 'pvestatd' is running on all nodes. Also make sure you have correct system time on all nodes. What is the output of

# pvecm status

thanks for your reply.
the 'pvestatd' is active on all nodes:

* pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled)
Active: active (running) since Tue 2016-06-28 23:05:19 CST; 2 days ago
Process: 2752 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
Main PID: 2857 (pvestatd)
CGroup: /system.slice/pvestatd.service
`-2857 pvestat


and 'pvecm status' show the error on all nodes:

172.31.254.6 | FAILED | rc=1 >>
Quorum information
------------------
Date: Fri Jul 1 14:22:26 2016
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000006
Ring ID: 1016
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.31.254.1
0x00000002 1 172.31.254.2
0x00000003 1 172.31.254.3
0x00000004 1 172.31.254.4
0x00000005 1 172.31.254.5
0x00000006 1 172.31.254.6 (local)


and 'corosync.service' is running on all nodes:

172.31.254.6 | success | rc=0 >>
* corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: active (running) since Tue 2016-06-28 23:05:19 CST; 2 days ago
Process: 2757 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 2826 (corosync)
CGroup: /system.slice/corosync.service
`-2826 corosync
 
system time is correct and in sync with other nodes?
yes. time is sync

I try to restart corosync on node6, now the corosync service is failed and could not been started :confused:

# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: failed (Result: timeout) since Fri 2016-07-01 15:15:49 CST; 4min 27s ago
Process: 110071 ExecStop=/usr/share/corosync/corosync stop (code=killed, signal=TERM)
Process: 113945 ExecStart=/usr/share/corosync/corosync start (code=killed, signal=TERM)
Main PID: 2826 (code=exited, status=0/SUCCESS)

Jul 01 15:15:49 node006 systemd[1]: corosync.service start operation timed out. Terminating.
Jul 01 15:15:49 node006 corosync[113945]: Starting Corosync Cluster Engine (corosync):
Jul 01 15:15:49 node006 systemd[1]: Failed to start Corosync Cluster Engine.
Jul 01 15:15:49 node006 systemd[1]: Unit corosync.service entered failed state.
 
I meet the problem again.
Now, I stop corosync and can not start it(timeout)
there are a lot of error in /var/log/daemon.log
Code:
Oct  9 09:25:32 node006 pve-ha-crm[3037]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:32 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:32 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:32 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-crm[3037]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-crm[3037]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-crm[3037]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused