Lost a node in cluster under "server view"

huky

Renowned Member
Jul 1, 2016
70
3
73
44
Chongqing, China
Sorry for my poor english。

I have a cluster with 6 nodes,it running OK over six months。
today,I found the icon of node6 is red fork,all vms on nodes is grey with only vmid(without name).
Then I run "pvecm nodes" in console,the result is normal,6 nodes are online.

I have restart pveproxy service on all nodes,How can I do to make the cluster return to normal.
 
Make sure 'pvestatd' is running on all nodes. Also make sure you have correct system time on all nodes. What is the output of

# pvecm status
 
Make sure 'pvestatd' is running on all nodes. Also make sure you have correct system time on all nodes. What is the output of

# pvecm status

thanks for your reply.
the 'pvestatd' is active on all nodes:

* pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled)
Active: active (running) since Tue 2016-06-28 23:05:19 CST; 2 days ago
Process: 2752 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
Main PID: 2857 (pvestatd)
CGroup: /system.slice/pvestatd.service
`-2857 pvestat


and 'pvecm status' show the error on all nodes:

172.31.254.6 | FAILED | rc=1 >>
Quorum information
------------------
Date: Fri Jul 1 14:22:26 2016
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000006
Ring ID: 1016
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.31.254.1
0x00000002 1 172.31.254.2
0x00000003 1 172.31.254.3
0x00000004 1 172.31.254.4
0x00000005 1 172.31.254.5
0x00000006 1 172.31.254.6 (local)


and 'corosync.service' is running on all nodes:

172.31.254.6 | success | rc=0 >>
* corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: active (running) since Tue 2016-06-28 23:05:19 CST; 2 days ago
Process: 2757 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 2826 (corosync)
CGroup: /system.slice/corosync.service
`-2826 corosync
 
system time is correct and in sync with other nodes?
yes. time is sync

I try to restart corosync on node6, now the corosync service is failed and could not been started :confused:

# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: failed (Result: timeout) since Fri 2016-07-01 15:15:49 CST; 4min 27s ago
Process: 110071 ExecStop=/usr/share/corosync/corosync stop (code=killed, signal=TERM)
Process: 113945 ExecStart=/usr/share/corosync/corosync start (code=killed, signal=TERM)
Main PID: 2826 (code=exited, status=0/SUCCESS)

Jul 01 15:15:49 node006 systemd[1]: corosync.service start operation timed out. Terminating.
Jul 01 15:15:49 node006 corosync[113945]: Starting Corosync Cluster Engine (corosync):
Jul 01 15:15:49 node006 systemd[1]: Failed to start Corosync Cluster Engine.
Jul 01 15:15:49 node006 systemd[1]: Unit corosync.service entered failed state.
 
I meet the problem again.
Now, I stop corosync and can not start it(timeout)
there are a lot of error in /var/log/daemon.log
Code:
Oct  9 09:25:32 node006 pve-ha-crm[3037]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:32 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:32 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:32 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-crm[3037]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-crm[3037]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-crm[3037]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
Oct  9 09:25:37 node006 pve-ha-lrm[3040]: ipcc_send_rec failed: Connection refused
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!