Ceph installation on new cluster node fails - ceph-mon not running

mbstein

Member
Jun 11, 2020
8
0
6
69
Installed a new ceph node to replace a failed node (no reused addresses, names, etc) after joining the corosync cluster, IP networking is o.k. pcecm status o.k.
Installation of Ceph failed both with the GUI and then the pveceph CLI.
- Timeout on ceph -s
- Can't start a monitor, error: Could not connect to ceph cluster despite configured monitors (500)
- Ceph monitor mon.0 cannot be removed
- ceph-mon is not running on the new node, TCP ports 3300 and 6789 not in Listening state

Probably a trivial task for seasoned professionals, but not for me.

Any suggestions welcome, especially on how to start ceph-mon.

Best regards Martin
 
There are Ceph MONs running for each node, Status "unknown", no improvement after restarts

Syslog entries for the new node:

Jul 07 14:24:46 pve-guest ceph-mon[934462]: 2023-07-07T14:24:46.332+0200 7ffb8e978a00 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve-guest' does not exist: have you run 'mkfs'?

Syslog for old node:

Jul 07 14:24:25 pve54 ceph-mon[521983]: 2023-07-07T14:24:25.134+0200 7f32bceb6700 -1 mon.pve54@1(probing) e11 get_health_metrics reporting 4 slow ops, oldest is auth(proto 0 30 bytes epoch 0)
Jul 07 14:24:25 pve54 systemd[1]: Stopping Ceph cluster monitor daemon...
Jul 07 14:24:25 pve54 ceph-mon[521983]: 2023-07-07T14:24:25.994+0200 7f32bfebc700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0
Jul 07 14:24:25 pve54 ceph-mon[521983]: 2023-07-07T14:24:25.994+0200 7f32bfebc700 -1 mon.pve54@1(probing) e11 *** Got Signal Terminated ***
Jul 07 14:24:26 pve54 systemd[1]: ceph-mon@pve54.service: Succeeded.
Jul 07 14:24:26 pve54 systemd[1]: Stopped Ceph cluster monitor daemon.
Jul 07 14:24:26 pve54 systemd[1]: ceph-mon@pve54.service: Consumed 1min 6.665s CPU time.
Jul 07 14:24:26 pve54 systemd[1]: Started Ceph cluster monitor daemon.
Jul 07 14:25:01 pve54 ceph-mon[1077489]: 2023-07-07T14:25:01.218+0200 7f80d3523700 -1 mon.pve54@1(probing) e11 get_health_metrics reporting 5 slow ops, oldest is auth(proto 0 34 bytes epoch 0)
 
Set a timeserver for each node, sometimes if the time is off some things like wont work as usual in ceph, sometimes you cant even create a MON. So maybe the new node is not having a correct time? Can you share you ceph.conf and your crushmap?

Please also provide /etc/network/interfaces and ping the other nodes from your new node to check if network is ok?

Configs to provide:
  • cat /etc/network/interfaces
  • cat /etc/pve/ceph.conf
  • copy of CRUSH-MAP
ToDos:
  • Check timeserver configuration
  • Check network in ceph (ping between nodes)
  • if using jumbo-frames ping with (ping -M do -s 8972 IP-of-other-node)
 
Last edited:
  • Like
Reactions: gurubert
Network interfaces:

Old node:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto enp4s0f1
iface enp4s0f1 inet manual
#Intel 350T2

auto enp4s0f0
iface enp4s0f0 inet manual
#Intel 350T2

iface enp0s31f6 inet manual
#iLO

auto enp1s0f0
iface enp1s0f0 inet static
address 192.168.79.6/24
#Intel X520-DA2 VL79

auto enp1s0f1
iface enp1s0f1 inet static
address 192.168.80.6/24
#Intel X520-DA2 VL80

auto bond0
iface bond0 inet manual
bond-slaves enp4s0f0 enp4s0f1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer3+4

auto vmbr0
iface vmbr0 inet static
address 192.168.77.6/24
hwaddress a0:36:9f:1d:f3:66
gateway 192.168.77.251
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
#Management untagged

New Node:
root@pve-guest:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual
#Single Ethernet

auto vmbr0
iface vmbr0 inet static
address 192.168.77.25/24
gateway 192.168.77.251
bridge-ports eno1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094
#anything else

iface wlp0s20f3 inet manual

auto vlan79
iface vlan79 inet static
address 192.168.79.25/24
vlan-raw-device vmbr0
#Corosync

auto vlan80
iface vlan80 inet static
address 192.168.80.25/24
vlan-raw-device vmbr0
#Ceph

ping between nodes successfull for all LANs/VLANs
no jumbo frames used (unsupported by 10GBE Switch)

---------------
root@pve-guest:~# cat /etc/pve/ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster network = 192.168.80.0/24
cluster_network = 192.168.80.0/24
fsid = 933cbdfb-ecef-4df2-bf79-f2ae3b50c181
mon_allow_pool_delete = true
mon_initial_members = pve54
ms_bind_ipv4 = true
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public network = 192.168.80.0/24
public_network = 192.168.77.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve54]
host = pve54
mds_standby_for_name = pve


---
Crush-Map unavailable (Timeout in GUI)
----

NTP has been configured now, no hints on clock skew

Thanx
 
yes, the initial mon address is needed,but there seems to be more that's out of order
Upon comparing the old node to the new node, there are some directories and files not installed by the Ceph Gui.

I will go through the Ceph documentation for manual installation
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!