pve 4.4: Cluster nodes flapping in GUI from green to red and vice versa

Hello Wolfgang,
attached your request. Let us stop us here because I have to check if this is not an interference of the IPMI which resides on the mainboard. At my point of view, it looks like a double IP in the network but this is not the case (checked ARP-table twice). So next IPMI has control over the server, so I want to disable it. I come back to you.

# corosync-quorumtool
Quorum information
------------------
Date: Wed Sep 27 08:33:07 2017
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 5
Ring ID: 1/552
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
1 1 node1708vm-1
2 1 node1708vm-2
3 1 node1708vm-3
4 1 node1708vm-4
5 1 node1708vm-5 (local)
 

Attachments

I have set the mode from "failover" to "deticated" in the IPMI NIC interface settings, hoping that this might be the reason for the phaenomen... but it was not. The nodes still keep on flapping.

On the console I saw on all nodes the following message about 20-30 times:
...systemd-sysv-generator ignoring creation of an alias umountiscsi.service for itself
ISCSI is not in use.

I am running out of ideas, because I can't find any error message.
 
Please send me your storage.cfg
Your ceph config shows you are use cephx auth but your rados command shows auth none.

Code:
cat /etc/pve/storage.cfg 
ls -hal /etc/pve/priv/ceph
 
I have to mention, that in an early state of intallation (after creating HA) I need to switch the the corosync network from previous Admin-Net 192.168.0.0/24 to 10.0.21.0/24 (dedicated network). This changed hostnames from node1708-X to node1708vm-X.
It would be nice to be able to define the network according their purposes (admin, coro, storage) with the installation wizzard.

# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content vztmpl,iso,backup
lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images
rbd: VM-OS
monhost 10.0.21.10,10.0.21.20,10.0.21.30
content images
krbd 0
pool TIER0
username admin
rbd: VM-DATA
monhost 10.0.21.10,10.0.21.20,10.0.21.30
content images
krbd 0
pool TIER1
username admin

#ls -hal /etc/pve/priv/ceph
-rw------- 1 root www-data 137 Sep 19 17:09 TIER0.keyring
-rw------- 1 root www-data 137 Sep 19 17:09 TIER1.keyring
 
With the movement of coro network the "old" hostnames left in /etc/pve/nodes:
drwxr-xr-x 2 root www-data 0 Sep 13 19:01 node1708-1
drwxr-xr-x 2 root www-data 0 Sep 17 17:40 node1708-2
drwxr-xr-x 2 root www-data 0 Sep 17 17:41 node1708-3
drwxr-xr-x 2 root www-data 0 Sep 17 17:41 node1708-4
drwxr-xr-x 2 root www-data 0 Sep 17 17:42 node1708-5
drwxr-xr-x 2 root www-data 0 Sep 20 10:52 node1708vm-1
drwxr-xr-x 2 root www-data 0 Sep 20 10:52 node1708vm-2
drwxr-xr-x 2 root www-data 0 Sep 20 10:52 node1708vm-3
drwxr-xr-x 2 root www-data 0 Sep 20 10:52 node1708vm-4
drwxr-xr-x 2 root www-data 0 Sep 20 10:52 node1708vm-5
Should I delete the directories with the old hostnames ?
 
Please rename you keyring's.
You have to use the storage name not the pool name.

VM-OS.keyring
VM-DATA.keyring

Then restart the pvestad.service

Should I delete the directories with the old hostnames ?
I you have no settings in it you can erase the old node dir.
But I would back them up.
 
I moved the old directories to /root/nodes/ and renamed the two keyring files according the storage names.
Afterall rebooted each node. Nothing changed.
 

Attachments

  • flapping_nodes_20170928.PNG
    flapping_nodes_20170928.PNG
    30.5 KB · Views: 2
The monitor address in your storage.cfg is not correct.
 
I do not clearly understand
10.0.21.0/24 = coro network
10.0.20.0/24 = ceph network
So I added the three nodes which are qurorum monitors 10.0.21.10 to 30.
Do I have to define three ceph nodes (randomly?) as monitor ?
 
You have also to update the storage.cfg with the correct mon.