Web GUI dies

Jim J

New Member
Oct 26, 2016
17
1
1
54
We have a 2 node cluster and it seems that one of the nodes consistently loses the web gui. That is, we lose access the web gui on one of the nodes but can access the gui on the other node.

Also, when this happens, the cluster between the 2 becomes degraded. By that I mean, one node can only sorta see information from the "down" node. We see green checks by each container, but the summary page dies. Clicking on a node in the "down" node shows an error: Connection refused (595)

Yet we can still issue basic commands to that container from the still working node in the cluster.

Rebooting the "down" node seems to put everything back on track. The Web gui is restored and we get access to the individual nodes and summary pages again.

Give it a couple days and that same node goes down again with all the degraded performance issues previously explained.

Even while degraded, the cluster status shows a quorum.

Code:
pvecm status
Quorum information
------------------
Date:             Thu Jun 15 09:55:22 2017
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          2/12164
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.3.0.10 (local)
0x00000001          1 10.3.0.11


Code:
root@prox6:~# cat /etc/pve/.members
{
"nodename": "prox6",
"version": 42,
"cluster": { "name": "florida", "version": 2, "nodes": 2, "quorate": 1 },
"nodelist": {
  "prox6": { "id": 2, "online": 1, "ip": "10.3.0.10"},
  "prox5": { "id": 1, "online": 1, "ip": "10.3.0.11"}
  }



The nodes are slightly behind so I will patch them.
We make extensive use of containers and I can't help shake the feeling that one of the containers is the culprit. Every few days one of our ubuntu containers dies on the problem Proxmox Node so they seem related.

Can a container bring down a host?
 
I've noticed that corosync usage spikes to 99% usage so I attempted to restart the cluster and corosync service which at least reined in the cpu usage but the web gui remained broken and I could not control the degraded node from the other node in the cluster.
 
do you have maybe a hanging storage? does multicast work reliably? is your /etc/hosts correctly configured?
do you have upgraded to the latest stable version and do the problem persist?
 
It seems I can narrow my problem down to a pair of miss behaving LXC nodes. Both machines mount CIFS shares across an IPSEC vpn AND mount a local LVM volume from a third LXC node.

CIFS mounts
The nodes were set to auto mount via fstab rules, but when starting either container, the Prox Host would throw some CIFS related errors on it's console. To avoid those errors, I do not auto mount, but instead start the container and manually mount.

Still, after 24 hours or so, I start to see occaisional errors on our prox host console. This one occurs when I attempt to restart one of the problem nodes:
Code:
 kernel:[246361.080793] unregister_netdevice: waiting for lo to become free. Usage count = 1
Around this time, the GUI stops responding until a 1-2 minutes time out occurs and then the containers is closed and guid service is restored.

In order to setup CIFS mounts on a LXC container I had to make some app armour changes, so maybe something is incorrect in that config?

You can see this isn't my first run-in with Proxmox/LXC/CIFS mounts:
https://forum.proxmox.com/threads/lxc-container-mount-cifs-sporadically-hangs.31249/

I also see a few apparmor errors related to mounts, not sure if this is related:

Code:
audit: type=1400 audit(1498653415.497:1122): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=87404 comm="mount" flags="ro, remount, relatime"

Jun 25 09:09:06 prox5 kernel: [177183.979139] audit: type=1400 audit(1498396146.612:806): apparmor="DENIED" operation="file_lock" profile="lxc-container-default-cgns" pid=89240 comm="(ionclean)" family="unix" sock_type="dgram" protocol=0 addr=none
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!