Issues in cluster on Proxmox v4

anigwei

Member
Nov 25, 2015
33
1
8
Barcelona (Catalunya)
about.me
Hi!

In our labs we had 5 Proxmox nodes since this summer. Version 3.5 was all gone right (without any kind of problem).

This last month we are migrating the cluster to Proxmox v4 and issues have appeared :(

Suddently "Permission denied - invalid ticket 401" appears while browsing the UI, and throws me back to Login.

All hosts resolve mutually through /etc/hosts.

How can I debug this issue more deeply?

Cluster appears to be ok. And time is synced in all hosts (I checked it).
Selecció_004.png

Code:
Membership information
----------------------
    Nodeid      Votes Name
         3          1 eimtvm0
         5          1 eimtvm1 (local)
         4          1 eimtvm2
         6          1 eimtvm3
         2          1 eimtvm4
         1          1 eimtvm5

Code:
root@eimtvm1:~# pvecm status
Quorum information
------------------
Date:             Fri Nov 27 17:21:28 2015
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000005
Ring ID:          616
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 172.26.1.60
0x00000005          1 172.26.1.61 (local)
0x00000004          1 172.26.1.62
0x00000006          1 172.26.1.63
0x00000002          1 172.26.1.64
0x00000001          1 172.26.1.65


Thanks!!
 
Maybe host time is out of sync? Make sure ntp is active - check with:

# timedatectl

set with:

# timedatectl set-ntp on
 
Hi dietmar!

Thanks.

I made a little script to run a process in every node of the cluster. I ran a timedatctl and every node has the date/time correct and sincronized.

BTW: Now I can't access one of the nodes via UI at all.

The error that throws is:

Secure Connection Failed

The connection to the server was reset while the page was loading.

The page you are trying to view cannot be shown because the authenticity of the received data could not be verified.
Please contact the website owners to inform them of this problem.



May be there is a node that is not syncing correctly? Is there any command to see if a node is perfectly sync'ed with the other nodes?

Thanks again,
 
The connection to the server was reset while the page was loading.

looks more like a network problem (firewall?).

May be there is a node that is not syncing correctly? Is there any command to see if a node is perfectly sync'ed with the other nodes?

There is no need for a 'perfect' time sync.
 
Hi!

I discard network problem, since cluster was running smoothly with 3.5 (and with 4.0 at the begining).

The most weird is that via mobile UI the behavior is the same: it disconnects me and throw me back to login after few seconds after logging in.

VMs are running without problem.

Maybe corosync issues?

Thanks again!!!
 
Hi!

I discard network problem, since cluster was running smoothly with 3.5 (and with 4.0 at the begining).

The most weird is that via mobile UI the behavior is the same: it disconnects me and throw me back to login after few seconds after logging in.

VMs are running without problem.

Maybe corosync issues?

Thanks again!!!

I had similar problems this week, not found the cause yet, possible multicast issues. It got so bad I had to reformat 3 nodes, and completely remove all signs of the cluster. Not the ideal situation, I was looking at implementing live migration, but without a cluster that's hard to do, not impossible though if you do things at the command line not the GUI.
 
I had similar problems this week, not found the cause yet, possible multicast issues. It got so bad I had to reformat 3 nodes, and completely remove all signs of the cluster. Not the ideal situation, I was looking at implementing live migration, but without a cluster that's hard to do, not impossible though if you do things at the command line not the GUI.

Hi erk!

This is windows-style and prefer to avoid it :( (If it's not working and you don't know why: REINSTALL)

I'll review multicast issues (maybe switch is blocking it?) but If final solution is to reinstall all nodes I think we will end up with a clean pure-KVM scheme...

Thank you!
 
Hi erk!

This is windows-style and prefer to avoid it :( (If it's not working and you don't know why: REINSTALL)

I'll review multicast issues (maybe switch is blocking it?) but If final solution is to reinstall all nodes I think we will end up with a clean pure-KVM scheme...

Thank you!

Unfortunately I didn't know multicast was essential for corosync. The gigabit switch I used for the cluster was un-managed, so no chance of accidentally blocking multicast, and two of the three Proxmox nodes did cluster, but I kept getting the "Permission denied - invalid ticket 401" stuff randomly, and when I clicked on anything in the Server View window on the left, it would pop up a dialog asking me to login again. So all was not happy. When I did omping tests, none of the nodes responded which made me think some sort of multicast router was missing from the setup. I couldn't find much info on the requirements, so I abandoned the idea of multicast dependent clustering.
 
Hi,

I have exactly the same issues that erk :(

I checked Switch (Dlink 1510) and IGMP Snooping was disabled. I enabled it, and also enabled querier. But results are the same :(

BTW, in Proxmox multicast documentation (http://pve.proxmox.com/wiki/Multicast_notes#Testing_multicast), It says that the command "pvecm status" will show the Multicast address.

In my Proxmox v4 cluster a "pvecm status" command doesn't show multicast address :(. Is this removed in v4?

Could someone confirm this? Or definetively I have a multicast problem?

Thanks!!

PROXMOX MULTICAST WIKI:
#pvecm status|grep "Multicast addresses"
Multicast addresses: 239.192.221.35


But... I have no Multicast Address in a "pvecm status" command:

Code:
root@eimtvm2:~# pvecm status
Quorum information
------------------
Date:             Tue Dec  1 10:04:06 2015
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000004
Ring ID:          736
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 172.26.1.60
0x00000005          1 172.26.1.61
0x00000004          1 172.26.1.62 (local)
0x00000006          1 172.26.1.63
0x00000002          1 172.26.1.64
0x00000001          1 172.26.1.65
root@eimtvm2:~#
 
Hi,

Thanks dietmar. I discard switch issues (I've grouped proxmox nodes into a snopping group.

But problem persists!!

I've been debugging a bit more...

* I connect to node 1 via UI
* I browse node 3 via UI (*connected via node 1*)
* Then it throws me out the UI... as usual...:(
* In node 1 pveproxy access.log access become 401!!!

All is going right...

Code:
172.26.0.19 - abassolsa@pve [01/Dec/2015:11:24:58 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=mem,maxmem&timeframe=hour&cf=AVERAGE&_dc=0 HTTP/1.1" 200 8403
172.26.0.19 - abassolsa@pve [01/Dec/2015:11:24:58 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=diskread,diskwrite&timeframe=hour&cf=AVERAGE&_dc=0 HTTP/1.1" 200 12615
172.26.0.19 - - [01/Dec/2015:11:24:58 +0100] "GET /pve2/ext4/resources/themes/images/default/tree/elbow-minus.gif HTTP/1.1" 304 -
172.26.0.19 - - [01/Dec/2015:11:24:58 +0100] "GET /pve2/images/drive-harddisk.png HTTP/1.1" 304 -
172.26.0.19 - abassolsa@pve [01/Dec/2015:11:24:58 +0100] "GET /api2/json/cluster/resources HTTP/1.1" 200 2997
172.26.0.19 - - [01/Dec/2015:11:24:58 +0100] "GET /pve2/ext4/resources/themes/images/default/tree/elbow-line.gif HTTP/1.1" 304 -
172.26.0.19 - - [01/Dec/2015:11:24:58 +0100] "GET /pve2/images/computer-on.png HTTP/1.1" 304 -
172.26.0.19 - abassolsa@pve [01/Dec/2015:11:24:58 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=netin,netout&timeframe=hour&cf=AVERAGE&_dc=0 HTTP/1.1" 200 12111
172.26.0.19 - - [01/Dec/2015:11:24:58 +0100] "GET /pve2/images/network-server-on.png HTTP/1.1" 304 -
172.26.0.19 - abassolsa@pve [01/Dec/2015:11:24:58 +0100] "GET /api2/json/cluster/tasks HTTP/1.1" 200 5882

Until Here:

Code:
172.26.0.19 - abassolsa@pve [01/Dec/2015:11:25:00 +0100] "GET /api2/json/nodes/eimtvm3/storage?content=backup HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:25:00 +0100] "GET /api2/json/access/domains HTTP/1.1" 200 155
172.26.0.19 - abassolsa@pve [01/Dec/2015:11:25:01 +0100] "GET /api2/json/nodes/eimtvm3/qemu/104/status/current HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:25:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=netin,netout&timeframe=hour&cf=AVERAGE&_dc=1 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:25:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=cpu&timeframe=hour&cf=AVERAGE&_dc=1 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:25:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=diskread,diskwrite&timeframe=hour&cf=AVERAGE&_dc=1 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:25:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=mem,maxmem&timeframe=hour&cf=AVERAGE&_dc=1 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:25:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=netin,netout&timeframe=hour&cf=AVERAGE&_dc=2 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:25:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=cpu&timeframe=hour&cf=AVERAGE&_dc=2 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:25:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=mem,maxmem&timeframe=hour&cf=AVERAGE&_dc=2 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:25:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=diskread,diskwrite&timeframe=hour&cf=AVERAGE&_dc=2 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:26:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=cpu&timeframe=hour&cf=AVERAGE&_dc=3 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:26:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=netin,netout&timeframe=hour&cf=AVERAGE&_dc=3 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:26:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=mem,maxmem&timeframe=hour&cf=AVERAGE&_dc=3 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:26:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=diskread,diskwrite&timeframe=hour&cf=AVERAGE&_dc=3 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:26:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=cpu&timeframe=hour&cf=AVERAGE&_dc=4 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:26:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=diskread,diskwrite&timeframe=hour&cf=AVERAGE&_dc=4 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:26:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=netin,netout&timeframe=hour&cf=AVERAGE&_dc=4 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:26:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=mem,maxmem&timeframe=hour&cf=AVERAGE&_dc=4 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:27:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=cpu&timeframe=hour&cf=AVERAGE&_dc=5 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:27:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=diskread,diskwrite&timeframe=hour&cf=AVERAGE&_dc=5 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:27:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=mem,maxmem&timeframe=hour&cf=AVERAGE&_dc=5 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:27:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=netin,netout&timeframe=hour&cf=AVERAGE&_dc=5 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:27:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=netin,netout&timeframe=hour&cf=AVERAGE&_dc=6 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:27:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=cpu&timeframe=hour&cf=AVERAGE&_dc=6 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:27:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=mem,maxmem&timeframe=hour&cf=AVERAGE&_dc=6 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:27:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=diskread,diskwrite&timeframe=hour&cf=AVERAGE&_dc=6 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:28:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=cpu&timeframe=hour&cf=AVERAGE&_dc=7 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:28:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=mem,maxmem&timeframe=hour&cf=AVERAGE&_dc=7 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:28:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=diskread,diskwrite&timeframe=hour&cf=AVERAGE&_dc=7 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:28:27 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=netin,netout&timeframe=hour&cf=AVERAGE&_dc=7 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:28:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=netin,netout&timeframe=hour&cf=AVERAGE&_dc=8 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:28:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=cpu&timeframe=hour&cf=AVERAGE&_dc=8 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:28:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=mem,maxmem&timeframe=hour&cf=AVERAGE&_dc=8 HTTP/1.1" 401 -
172.26.0.19 - - [01/Dec/2015:11:28:57 +0100] "GET /api2/png/nodes/eimtvm3/qemu/104/rrd?ds=diskread,diskwrite&timeframe=hour&cf=AVERAGE&_dc=8 HTTP/1.1" 401 -

What are the reasons for node3 to suddently ban those accesses? Why?

(This scenario repeats browsing the cluster from other nodes and 401 appears too).

Thanks!!!
 
Last edited:
Hi again! An update... tailing the nodes log i've found this:
Code:
Dec  1 11:41:42 eimtvm4 pveproxy[111354]: ipcc_send_rec failed: Transport endpoint is not connected
Maybe the cause?
Hello!
You have 2 problems, i think.
the problem is the authkey(401-error) (i tested this problem and i started when i deleted the file /etc/corosync/authkey .check this point in your node.
and "ipcc_send_rec failed: Transport endpoint is not connected" problems in your cluster node, he can´t join on the cluster, you can check with service corosync status and service pve-cluster status(the result of this "Cannot initialize CMAP service")
Good luck! :D
 
Last edited:
I still do not get what you what. Above command just prints the corosync multicast address.
I had tried that:
Code:
#  corosync-cmapctl -g totem.interface.0.mcastaddr
Can't get key totem.interface.0.mcastaddr. Error CS_ERR_NOT_EXIST
So I thought that 'totem.interface.0.mcastaddr' needed to be specified.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!