Cluster Problems

cwoelkers · Apr 5, 2018

I have a total of four Proxmox 4.4 servers running in a production environment.
They are of differing ages, the fourth, vhost4, was added recently and when it was the other three were recreated with Proxmox 4.4 in such a way as to minimize service interruptions.
When it was first installed vhost4 was created with clustering in mind and as such, before any VMs were added to it, was made the first node in a new cluster. This cluster was set up to make use of a separate network, ala https://pve.proxmox.com/wiki/Separate_Cluster_Network, and no problems were found when this was done.
It is now at the point that I am ready to add the other nodes to the cluster and in preparation of this have moved all VMs from vhost2 to another vhost for temporary hosting. I then ran the command 'pvecm add vhost4 -ring0_addr network2IP'. This ended up pausing at the waiting for quorum... stage and would not continue. After waiting overnight, having started this at the end of the day, I found it at the same state the next morning. I hit CTRL-C to stop it and it did and started looking into possible fixes.
After removing vhost2 from the cluster, trying a fix, than adding it back in a few times I finally found a partial fix.
First I set the bridge interface to multicast on both nodes by adding post-up ( echo 1 > /sys/devices/virtual/net/$IFACE/bridge/multicast_querier ) to the bridge configuration in /etc/network/interfaces.
I also made certain that the SSH public key for the opposite node was set in the /etc/pve/prive/authorized_hosts file on each node before running the pvecm add command. Not certain if this was necessary but it seemed to help at the time.
I now have both nodes showing up in the pvecm status output, so far so good.
Now I am at another impasse that my Google search cannot seem to find any direct references to.
When accessing the web interface on vhost2 I am asked to login and can do so with root and the Linux PAM authentication realm, yes I know its not a good practice. I see both nodes listed with the green checkmark.
Within a few moments I get the message Connection error 401: 403 Permission check failed (permission denied - invalid PVE ticket) and am asked to login again. This is a never ending loop. This does not occur on vhost4.
Another possibly related issue is when accessing the web interface on vhost4. When trying to access any screen involving vhost2 I get the message ssl3_get_server_certificate: certificate verify failed (596). A quick note is that the SSL certs are self-created and self-signed per machine so this may be part of the issue. However these machines are not facing the Internet so no SSL cert will ever be purchased for them.
I have tried restarting vhost2, the newly added cluster node, but to no affect. I have tried running pvecm updatecerts to no affect.
What I cannot do at the moment is modify vhost4 as it is in production with a number of VMs running on it. This means I cannot restart the host, any VM, the network, or run an update. I know it narrows my possible options but there is no way around this.
Any ideas would be helpful.

udo · Apr 5, 2018

Hi,
I would say "pvecm updatecerts was the right thing. Do you have restartet the pve-services on both hosts? (except pve-manager - this will shutdown all VMs!)

Code:

systemctl restart pvestatd
systemctl restart pvedaemon
systemctl restart pveproxy

You use an brigde for the cluster communication? Based on an bond? But you also wrote about an seperate network - why then the bridge?

Any corosync logs?

Udo

cwoelkers · Apr 6, 2018

The vhosts are each on a single physical link via 10Gbit Ethernet to the switch. Each makes use of a set of bridges to provide access to various VLANs as needed. One of those VLANs is our management network, no user access, and that is what I have the cluster communications set to use. I know this won't help with the speed or reliability but I feel its necessary. No bond is in use.

I'll try the restart for those services.

I can't find any log entries for when the issues occur. I looked in the auth, syslog, pveam, and pve-firewall logs. There was an NFS issue, which was quickly fixed, but the primary issue continues.

cwoelkers · Apr 10, 2018

So one of the hosts has been rebooted, the one without any VMs, and the other I have restarted the three services you mentioned.
The problem seems to have changed somewhat. If I stay on a screen that is within the host that I initially logged into, say a VM within vhost4 after logging into it, the authentication screen does not pop up. However if I select a screen from another host, say the storage on vhost2 after logging into vhost4, the auth screen, along with the 403 Permission Check error, will pop up but this doesn't seem to happen all of the time. That being said if I wait long enough, between 5 and 10 minutes, the auth screen will still pop up on it's own.

udo · Apr 10, 2018

Hi,
can you do an ssh without PW from all nodes to each other ig you use the name and or IP (depends what you use in corosync.conf)?

What is the output of

Code:

pvecm status

from all nodes?

What is the output of

Code:

find  /etc/pve/nodes -name pve-ssl.pem -exec openssl x509  -issuer -in {} -noout \;

Udo

cwoelkers · Apr 11, 2018

Yes I can ssh between all nodes, no password needed.

Here is the pvecm and command output from the first, ie the first in the cluster, node:

Code:

root@vhost4:~# pvecm status
Quorum information
------------------
Date:             Wed Apr 11 09:07:28 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          2/1652
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.0.0.12
0x00000001          1 10.0.0.14 (local)
root@vhost4:~# find  /etc/pve/nodes -name pve-ssl.pem -exec openssl x509  -issuer -in {} -noout \;
issuer= /CN=Proxmox Virtual Environment/OU=db9d3b03-7b83-4171-8b3e-cb7466f00ffa/O=PVE Cluster Manager CA
issuer= /CN=Proxmox Virtual Environment/OU=db9d3b03-7b83-4171-8b3e-cb7466f00ffa/O=PVE Cluster Manager CA

And the second, ie the one being added, node:

Code:

root@vhost2:~# pvecm status
Quorum information
------------------
Date:             Wed Apr 11 09:07:34 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          2/1652
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.0.0.12 (local)
0x00000001          1 10.0.0.14
root@vhost2:~# find  /etc/pve/nodes -name pve-ssl.pem -exec openssl x509  -issuer -in {} -noout \;
issuer= /CN=Proxmox Virtual Environment/OU=db9d3b03-7b83-4171-8b3e-cb7466f00ffa/O=PVE Cluster Manager CA
issuer= /CN=Proxmox Virtual Environment/OU=db9d3b03-7b83-4171-8b3e-cb7466f00ffa/O=PVE Cluster Manager CA

cwoelkers · Apr 18, 2018

The problem seems to have gotten better in some ways.
The authentication prompt does not come up as often but will still do so. One place it will come up is when I attempt to restore a VM from a backup. The prompt pops up when I try to select a storage location, and then constantly during the restore. Once it is done the restore output box is frozen, ie I cannot scroll, but the restore does work and I can start the VM after closing the output box.

cwoelkers · Apr 18, 2018

Also thought of something and checked the cookies stored in my browser, after clearing them all and logging into the web interfaces four proxmox servers we have.
Three of the servers leave a cookie, the fourth, vhost2 the one that I added to the cluster, does not. This is probably part of the problem, any way to kick it into gear?

Search

Search

Cluster Problems

cwoelkers

Active Member

udo

Distinguished Member

cwoelkers

Active Member

cwoelkers

Active Member

udo

Distinguished Member

cwoelkers

Active Member

cwoelkers

Active Member

cwoelkers

Active Member