Ceph installation on new cluster node fails - ceph-mon not running

mbstein · Jul 6, 2023

Installed a new ceph node to replace a failed node (no reused addresses, names, etc) after joining the corosync cluster, IP networking is o.k. pcecm status o.k.
Installation of Ceph failed both with the GUI and then the pveceph CLI.
- Timeout on ceph -s
- Can't start a monitor, error: Could not connect to ceph cluster despite configured monitors (500)
- Ceph monitor mon.0 cannot be removed
- ceph-mon is not running on the new node, TCP ports 3300 and 6789 not in Listening state

Probably a trivial task for seasoned professionals, but not for me.

Any suggestions welcome, especially on how to start ceph-mon.

Best regards Martin

gurubert · Jul 7, 2023

Are the other Ceph MONs running?

mbstein · Jul 7, 2023

There are Ceph MONs running for each node, Status "unknown", no improvement after restarts

Syslog entries for the new node:

Jul 07 14:24:46 pve-guest ceph-mon[934462]: 2023-07-07T14:24:46.332+0200 7ffb8e978a00 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve-guest' does not exist: have you run 'mkfs'?

Syslog for old node:

Jul 07 14:24:25 pve54 ceph-mon[521983]: 2023-07-07T14:24:25.134+0200 7f32bceb6700 -1 mon.pve54@1(probing) e11 get_health_metrics reporting 4 slow ops, oldest is auth(proto 0 30 bytes epoch 0)
Jul 07 14:24:25 pve54 systemd[1]: Stopping Ceph cluster monitor daemon...
Jul 07 14:24:25 pve54 ceph-mon[521983]: 2023-07-07T14:24:25.994+0200 7f32bfebc700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0
Jul 07 14:24:25 pve54 ceph-mon[521983]: 2023-07-07T14:24:25.994+0200 7f32bfebc700 -1 mon.pve54@1(probing) e11 *** Got Signal Terminated ***
Jul 07 14:24:26 pve54 systemd[1]: ceph-mon@pve54.service: Succeeded.
Jul 07 14:24:26 pve54 systemd[1]: Stopped Ceph cluster monitor daemon.
Jul 07 14:24:26 pve54 systemd[1]: ceph-mon@pve54.service: Consumed 1min 6.665s CPU time.
Jul 07 14:24:26 pve54 systemd[1]: Started Ceph cluster monitor daemon.
Jul 07 14:25:01 pve54 ceph-mon[1077489]: 2023-07-07T14:25:01.218+0200 7f80d3523700 -1 mon.pve54@1(probing) e11 get_health_metrics reporting 5 slow ops, oldest is auth(proto 0 34 bytes epoch 0)

jsterr · Jul 7, 2023

Set a timeserver for each node, sometimes if the time is off some things like wont work as usual in ceph, sometimes you cant even create a MON. So maybe the new node is not having a correct time? Can you share you ceph.conf and your crushmap?

Please also provide /etc/network/interfaces and ping the other nodes from your new node to check if network is ok?

Configs to provide:

cat /etc/network/interfaces
cat /etc/pve/ceph.conf
copy of CRUSH-MAP

ToDos:

Check timeserver configuration
Check network in ceph (ping between nodes)
if using jumbo-frames ping with (ping -M do -s 8972 IP-of-other-node)

mbstein · Jul 7, 2023

Network interfaces:

Old node:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto enp4s0f1
iface enp4s0f1 inet manual
#Intel 350T2

auto enp4s0f0
iface enp4s0f0 inet manual
#Intel 350T2

iface enp0s31f6 inet manual
#iLO

auto enp1s0f0
iface enp1s0f0 inet static
address 192.168.79.6/24
#Intel X520-DA2 VL79

auto enp1s0f1
iface enp1s0f1 inet static
address 192.168.80.6/24
#Intel X520-DA2 VL80

auto bond0
iface bond0 inet manual
bond-slaves enp4s0f0 enp4s0f1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer3+4

auto vmbr0
iface vmbr0 inet static
address 192.168.77.6/24
hwaddress a0:36:9f:1d:f3:66
gateway 192.168.77.251
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
#Management untagged

New Node:
root@pve-guest:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual
#Single Ethernet

auto vmbr0
iface vmbr0 inet static
address 192.168.77.25/24
gateway 192.168.77.251
bridge-ports eno1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
#anything else

iface wlp0s20f3 inet manual

auto vlan79
iface vlan79 inet static
address 192.168.79.25/24
vlan-raw-device vmbr0
#Corosync

auto vlan80
iface vlan80 inet static
address 192.168.80.25/24
vlan-raw-device vmbr0
#Ceph

ping between nodes successfull for all LANs/VLANs
no jumbo frames used (unsupported by 10GBE Switch)

---------------
root@pve-guest:~# cat /etc/pve/ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster network = 192.168.80.0/24
cluster_network = 192.168.80.0/24
fsid = 933cbdfb-ecef-4df2-bf79-f2ae3b50c181
mon_allow_pool_delete = true
mon_initial_members = pve54
ms_bind_ipv4 = true
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public network = 192.168.80.0/24
public_network = 192.168.77.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve54]
host = pve54
mds_standby_for_name = pve

---
Crush-Map unavailable (Timeout in GUI)
----

NTP has been configured now, no hints on clock skew

Thanx

gurubert · Jul 7, 2023

You need to add the IP addresses of your current MONs to ceph.conf in the global section as mon hosts = 1.2.3.4 5.6.7.8

mbstein · Jul 7, 2023

yes, the initial mon address is needed,but there seems to be more that's out of order
Upon comparing the old node to the new node, there are some directories and files not installed by the Ceph Gui.

I will go through the Ceph documentation for manual installation

Ceph installation on new cluster node fails - ceph-mon not running

mbstein

Member

gurubert

Distinguished Member

mbstein

Member

jsterr

Famous Member

mbstein

Member

gurubert

Distinguished Member

mbstein

Member

We value your privacy