Web GUI dies

Jim J · Jun 15, 2017

We have a 2 node cluster and it seems that one of the nodes consistently loses the web gui. That is, we lose access the web gui on one of the nodes but can access the gui on the other node.

Also, when this happens, the cluster between the 2 becomes degraded. By that I mean, one node can only sorta see information from the "down" node. We see green checks by each container, but the summary page dies. Clicking on a node in the "down" node shows an error: Connection refused (595)

Yet we can still issue basic commands to that container from the still working node in the cluster.

Rebooting the "down" node seems to put everything back on track. The Web gui is restored and we get access to the individual nodes and summary pages again.

Give it a couple days and that same node goes down again with all the degraded performance issues previously explained.

Even while degraded, the cluster status shows a quorum.

Code:

pvecm status
Quorum information
------------------
Date:             Thu Jun 15 09:55:22 2017
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          2/12164
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.3.0.10 (local)
0x00000001          1 10.3.0.11

Code:

root@prox6:~# cat /etc/pve/.members
{
"nodename": "prox6",
"version": 42,
"cluster": { "name": "florida", "version": 2, "nodes": 2, "quorate": 1 },
"nodelist": {
  "prox6": { "id": 2, "online": 1, "ip": "10.3.0.10"},
  "prox5": { "id": 1, "online": 1, "ip": "10.3.0.11"}
  }

The nodes are slightly behind so I will patch them.
We make extensive use of containers and I can't help shake the feeling that one of the containers is the culprit. Every few days one of our ubuntu containers dies on the problem Proxmox Node so they seem related.

Can a container bring down a host?

Jim J · Jun 16, 2017

I've noticed that corosync usage spikes to 99% usage so I attempted to restart the cluster and corosync service which at least reined in the cpu usage but the web gui remained broken and I could not control the degraded node from the other node in the cluster.

dcsapak · Jun 19, 2017

do you have maybe a hanging storage? does multicast work reliably? is your /etc/hosts correctly configured?
do you have upgraded to the latest stable version and do the problem persist?

Jim J · Jun 28, 2017

It seems I can narrow my problem down to a pair of miss behaving LXC nodes. Both machines mount CIFS shares across an IPSEC vpn AND mount a local LVM volume from a third LXC node.

CIFS mounts
The nodes were set to auto mount via fstab rules, but when starting either container, the Prox Host would throw some CIFS related errors on it's console. To avoid those errors, I do not auto mount, but instead start the container and manually mount.

Still, after 24 hours or so, I start to see occaisional errors on our prox host console. This one occurs when I attempt to restart one of the problem nodes:

Code:

 kernel:[246361.080793] unregister_netdevice: waiting for lo to become free. Usage count = 1

Around this time, the GUI stops responding until a 1-2 minutes time out occurs and then the containers is closed and guid service is restored.

In order to setup CIFS mounts on a LXC container I had to make some app armour changes, so maybe something is incorrect in that config?

You can see this isn't my first run-in with Proxmox/LXC/CIFS mounts:
https://forum.proxmox.com/threads/lxc-container-mount-cifs-sporadically-hangs.31249/

I also see a few apparmor errors related to mounts, not sure if this is related:

Code:

audit: type=1400 audit(1498653415.497:1122): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=87404 comm="mount" flags="ro, remount, relatime"

Jun 25 09:09:06 prox5 kernel: [177183.979139] audit: type=1400 audit(1498396146.612:806): apparmor="DENIED" operation="file_lock" profile="lxc-container-default-cgns" pid=89240 comm="(ionclean)" family="unix" sock_type="dgram" protocol=0 addr=none

Search

Search

Web GUI dies

Jim J

New Member

Jim J

New Member

dcsapak

Proxmox Staff Member

Jim J

New Member