We have a 2 node cluster and it seems that one of the nodes consistently loses the web gui. That is, we lose access the web gui on one of the nodes but can access the gui on the other node.
Also, when this happens, the cluster between the 2 becomes degraded. By that I mean, one node can only sorta see information from the "down" node. We see green checks by each container, but the summary page dies. Clicking on a node in the "down" node shows an error: Connection refused (595)
Yet we can still issue basic commands to that container from the still working node in the cluster.
Rebooting the "down" node seems to put everything back on track. The Web gui is restored and we get access to the individual nodes and summary pages again.
Give it a couple days and that same node goes down again with all the degraded performance issues previously explained.
Even while degraded, the cluster status shows a quorum.
The nodes are slightly behind so I will patch them.
We make extensive use of containers and I can't help shake the feeling that one of the containers is the culprit. Every few days one of our ubuntu containers dies on the problem Proxmox Node so they seem related.
Can a container bring down a host?
Also, when this happens, the cluster between the 2 becomes degraded. By that I mean, one node can only sorta see information from the "down" node. We see green checks by each container, but the summary page dies. Clicking on a node in the "down" node shows an error: Connection refused (595)
Yet we can still issue basic commands to that container from the still working node in the cluster.
Rebooting the "down" node seems to put everything back on track. The Web gui is restored and we get access to the individual nodes and summary pages again.
Give it a couple days and that same node goes down again with all the degraded performance issues previously explained.
Even while degraded, the cluster status shows a quorum.
Code:
pvecm status
Quorum information
------------------
Date: Thu Jun 15 09:55:22 2017
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 2/12164
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.3.0.10 (local)
0x00000001 1 10.3.0.11
Code:
root@prox6:~# cat /etc/pve/.members
{
"nodename": "prox6",
"version": 42,
"cluster": { "name": "florida", "version": 2, "nodes": 2, "quorate": 1 },
"nodelist": {
"prox6": { "id": 2, "online": 1, "ip": "10.3.0.10"},
"prox5": { "id": 1, "online": 1, "ip": "10.3.0.11"}
}
The nodes are slightly behind so I will patch them.
We make extensive use of containers and I can't help shake the feeling that one of the containers is the culprit. Every few days one of our ubuntu containers dies on the problem Proxmox Node so they seem related.
Can a container bring down a host?