[SOLVED] ceph nodes alternate red/green on pve screen

RobFantini · Feb 5, 2017

I am trying to set up a 7 node test ceph cluster.

The 1ST 5 work great.

when I add a 6TH, it keeps going red - then back to green then red.

I tried reinstalling using a different address and hostname - same issue.

I thought there may be hardware issues with the 6TH node.

however I deleted the node, then added a 6TH node on different motherboard and same disks. same issue.

Is there limitation on number of nodes? Or a configuration item to change? It seems like a software not hardware issue.

RobFantini · Feb 5, 2017

Code:

# pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-107
pve-firmware: 1.1-10
libpve-common-perl: 4.0-90
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80

t.lamprecht · Feb 5, 2017

RobFantini said:
Is there limitation on number of nodes?

32 nodes is the current limit and we've got 10-16 node clusters in the wild, whereas with16 node the cluster communications could need to be tuned a little.

RobFantini said:
when I add a 6TH, it keeps going red - then back to green then red.

"Just" the 6th node keeps doing that? Red can mean a) not in quorate partition or b) the pvestatd daemon which collects this is slow, or has other problems?

RobFantini said:
Or a configuration item to change? It seems like a software not hardware issue.

Anything from corosync / the cluster FS in the logs?

Code:

journalctl -u corosync -u pve-cluster

RobFantini · Feb 5, 2017

thanks for the response.
here is journalctl -u corosync -u pve-cluster

Code:

sys8  ~ # journalctl -u corosync -u pve-cluster
-- Logs begin at Sun 2017-02-05 08:42:35 EST, end at Sun 2017-02-05 09:42:16 EST. --
Feb 05 08:42:46 sys8 systemd[1]: Starting The Proxmox VE cluster filesystem...
Feb 05 08:42:46 sys8 pmxcfs[2281]: [quorum] crit: quorum_initialize failed: 2
Feb 05 08:42:46 sys8 pmxcfs[2281]: [quorum] crit: can't initialize service
Feb 05 08:42:46 sys8 pmxcfs[2281]: [confdb] crit: cmap_initialize failed: 2
Feb 05 08:42:46 sys8 pmxcfs[2281]: [confdb] crit: can't initialize service
Feb 05 08:42:46 sys8 pmxcfs[2281]: [dcdb] crit: cpg_initialize failed: 2
Feb 05 08:42:46 sys8 pmxcfs[2281]: [dcdb] crit: can't initialize service
Feb 05 08:42:46 sys8 pmxcfs[2281]: [status] crit: cpg_initialize failed: 2
Feb 05 08:42:46 sys8 pmxcfs[2281]: [status] crit: can't initialize service
Feb 05 08:42:47 sys8 systemd[1]: Started The Proxmox VE cluster filesystem.
Feb 05 08:42:47 sys8 systemd[1]: Starting Corosync Cluster Engine...
Feb 05 08:42:47 sys8 corosync[2334]: [MAIN  ] Corosync Cluster Engine ('2.4.0'): started and ready to provide service.
Feb 05 08:42:47 sys8 corosync[2334]: [MAIN  ] Corosync built-in features: augeas systemd pie relro bindnow
Feb 05 08:42:47 sys8 corosync[2338]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Feb 05 08:42:47 sys8 corosync[2338]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Feb 05 08:42:47 sys8 corosync[2338]: [TOTEM ] The network interface [10.1.10.8] is now up.
Feb 05 08:42:47 sys8 corosync[2338]: [SERV  ] Service engine loaded: corosync configuration map access [0]
Feb 05 08:42:47 sys8 corosync[2338]: [QB  ] server name: cmap
Feb 05 08:42:47 sys8 corosync[2338]: [SERV  ] Service engine loaded: corosync configuration service [1]
Feb 05 08:42:47 sys8 corosync[2338]: [QB  ] server name: cfg
Feb 05 08:42:47 sys8 corosync[2338]: [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 05 08:42:47 sys8 corosync[2338]: [QB  ] server name: cpg
Feb 05 08:42:47 sys8 corosync[2338]: [SERV  ] Service engine loaded: corosync profile loading service [4]
Feb 05 08:42:47 sys8 corosync[2338]: [QUORUM] Using quorum provider corosync_votequorum
Feb 05 08:42:47 sys8 corosync[2338]: [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 05 08:42:47 sys8 corosync[2338]: [QB  ] server name: votequorum
Feb 05 08:42:47 sys8 corosync[2338]: [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 05 08:42:47 sys8 corosync[2338]: [QB  ] server name: quorum
Feb 05 08:42:47 sys8 corosync[2338]: [TOTEM ] A new membership (10.1.10.8:29908) was formed. Members joined: 6
Feb 05 08:42:47 sys8 corosync[2338]: [QUORUM] Members[1]: 6
Feb 05 08:42:47 sys8 corosync[2338]: [MAIN  ] Completed service synchronization, ready to provide service.
Feb 05 08:42:47 sys8 corosync[2338]: [TOTEM ] A new membership (10.1.10.6:29912) was formed. Members joined: 3 1 5 4 2
Feb 05 08:42:47 sys8 corosync[2338]: [QUORUM] This node is within the primary component and will provide service.
Feb 05 08:42:47 sys8 corosync[2338]: [QUORUM] Members[6]: 3 6 1 5 4 2

RobFantini · Feb 5, 2017

the next node I checked has same :

Code:

Feb 03 12:06:05 s020 systemd[1]: Starting The Proxmox VE cluster filesystem...
Feb 03 12:06:06 s020 pmxcfs[2816]: [quorum] crit: quorum_initialize failed: 2
Feb 03 12:06:06 s020 pmxcfs[2816]: [quorum] crit: can't initialize service
Feb 03 12:06:06 s020 pmxcfs[2816]: [confdb] crit: cmap_initialize failed: 2
Feb 03 12:06:06 s020 pmxcfs[2816]: [confdb] crit: can't initialize service
Feb 03 12:06:06 s020 pmxcfs[2816]: [dcdb] crit: cpg_initialize failed: 2
Feb 03 12:06:06 s020 pmxcfs[2816]: [dcdb] crit: can't initialize service
Feb 03 12:06:06 s020 pmxcfs[2816]: [status] crit: cpg_initialize failed: 2
Feb 03 12:06:06 s020 pmxcfs[2816]: [status] crit: can't initialize service
Feb 03 12:06:07 s020 systemd[1]: Started The Proxmox VE cluster filesystem.
Feb 03 12:06:07 s020 systemd[1]: Starting Corosync Cluster Engine...

RobFantini · Feb 5, 2017

just the 6TH node. the other 5 stay green for weeks.

t.lamprecht · Feb 5, 2017

Corosync and the cluster filesystem looks ok, so something with pvestatd or its connection (e.g. the network).

Just a guess but it may be something somewhat cosmetically, where the status report of the 6th node comes to late periodically, thus the web interface shows it red (=unknown) then a little bit later it gets an valid update again and so the UI shows it green again.

How do you visit the interface, over a slow connection or a (relative) slow device or directly through the LAN of the PVE nodes?

Does the 6th node status flickers also if you visit the web interface through it?

Does the journal shows some warnings/connection drops?

Seems really a little weird this issue, hmm...

RobFantini · Feb 5, 2017

re: How do you visit the interface, over a slow connection or a (relative) slow device or directly through the LAN of the PVE nodes?

all the pve nodes are attached to a cisco 1G switch. It is a managed switch - some of the settings may not be ideal.
there is not much on the vlan used by pve. mainly just phone system + pve .
on the cisco switch pve port have:

Code:

interface gigabitethernet8
storm-control broadcast enable
storm-control include-multicast unknown-unicast
qos cos 4
switchport trunk native vlan 10

vlan 10 is also used tagged on the nics used for vm's . each pve machine has 4 nics.
1- for pve, 2- ceph and backups 3- most vm's 4- video vm's .

re: Seems really a little weird this issue, hmm...
I agree. This could be a network issue - still it strange that node 6 has the issue.

I'll add node 7 tomorrow and report back results.

RobFantini · Feb 6, 2017

I suspect that our issue is network related.

any comments on our managed switch setting at #8 above?

RobFantini · Feb 6, 2017

so I added another node.

the last two nodes have same issue - red/green . sometimes one , sometimes both.

I have checked DNS, each node has all the other nodes in /etc/hosts .

If anyone has a suggestion to help debug this please do so .

RobFantini · Feb 6, 2017

so if i click on 'ceph' on either of the nodes that switch red/green i get:

Code:

pveceph configuration not enabled (500)

RobFantini · Feb 6, 2017

found this at https://forum.proxmox.com/threads/pveceph-configuration-not-enabled-500.20369/

"
You need to run:

# pveceph init

on each node you want to enable the ceph management GUI."

and wiki:
"
Note:

If you add a node where you do not want to run a Ceph monitor, e.g. another node for OSDs, you need to install the Ceph packages with 'pveceph install' and you need to initialize Ceph with 'pveceph init'"

I'll do that and report back.

RobFantini · Feb 6, 2017

'pveceph init' fixed the issue.

as of this date a ceph cluster node must be a mon or the above done.

I updated the wiki .

RobFantini · Feb 6, 2017

when ceph > log is clicked:

Code:

unable to open file - No such file or directory

is what shows. logs do show on the 5 orig nodes. not last 2.

note all other ceph screens work, I was able to add an osd.

t.lamprecht · Feb 7, 2017

RobFantini said:
'pveceph init' fixed the issue.

as of this date a ceph cluster node must be a mon or the above done.

I updated the wiki .

Thanks for the wiki update!

Hmm, but I could not reproduce the red/green alternation of a node, I have added a new one to a ceph cluster without any pveceph changes there and waited a bit, then i installed ceph and waited for a bit, didn't trigger the red/green alternation... :/ But it were just 4 nodes, so I try again with another one now.

But yes, if you want to do anything with Ceph on a node (i.e. add OSDs or add a Monitor or do both of this) you need

Code:

pveceph install
pveceph init

Else you will not be able to manage ceph if you click on this node in the UI (but naturally through the others, disregarding on which node you connect to).

RobFantini said:
is what shows. logs do show on the 5 orig nodes. not last 2.

note all other ceph screens work, I was able to add an osd.

Logs are node specific and from monitors AFAIK, so in your case that'd explain it. Its a bit confusing, maybe a better name (Monitor Log) would be good here.

RobFantini · Feb 7, 2017

the red/green thing could have something to do with our managed switch settings.

in your lab what kind of network is used for ceph?

t.lamprecht · Feb 8, 2017

RobFantini said:
in your lab what kind of network is used for ceph?

Completely unmanaged switch, some tests happen also in a virtual clusters (nested Proxmox VE instances on PVE), so linux bridges as a "switch".

RobFantini · Feb 8, 2017

t.lamprecht said:
Completely unmanaged switch, some tests happen also in a virtual clusters (nested Proxmox VE instances on PVE), so linux bridges as a "switch".

for ceph our switch is an unmanaged 10G switch.

for pve and vlan nics we use a managed switch. we've hundreds of devices attached, so limit broadcast traffic to vlans.
the main managed switch - we've used netgear and cisco - have so many possible settings that it is hard to figure out the ideal set up.

if anyone has suggestions on simplifying a large LAN with multiple rooms and switches - I could use the advice. for instance a forum or a company to help out.

Search

Search

[SOLVED] ceph nodes alternate red/green on pve screen

RobFantini

Famous Member

RobFantini

Famous Member

t.lamprecht

Proxmox Staff Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

t.lamprecht

Proxmox Staff Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

t.lamprecht

Proxmox Staff Member

RobFantini

Famous Member

t.lamprecht

Proxmox Staff Member

RobFantini

Famous Member

We value your privacy