The ceph monitor and manager on my PVE cannot start. May I ask if there is a problem? I have been using it normally and suddenly this problem occurred

ggea · Oct 22, 2023

This is my error message
Ceph version is 17.2.6
Ceph version is 8.0.3

Code:

root@zmc-pve10:~# systemctl status ceph-mgr@zmc-pve10.service
× ceph-mgr@zmc-pve10.service - Ceph cluster manager daemon
     Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mgr@.service.d
             └─ceph-after-pve-cluster.conf
     Active: failed (Result: exit-code) since Sun 2023-10-22 17:26:24 CST; 21s ago
   Duration: 59ms
    Process: 548320 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id zmc-pve10 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
   Main PID: 548320 (code=exited, status=1/FAILURE)
        CPU: 60ms

Oct 22 17:26:24 zmc-pve10 systemd[1]: ceph-mgr@zmc-pve10.service: Scheduled restart job, restart counter is at 3.
Oct 22 17:26:24 zmc-pve10 systemd[1]: Stopped ceph-mgr@zmc-pve10.service - Ceph cluster manager daemon.
Oct 22 17:26:24 zmc-pve10 systemd[1]: ceph-mgr@zmc-pve10.service: Start request repeated too quickly.
Oct 22 17:26:24 zmc-pve10 systemd[1]: ceph-mgr@zmc-pve10.service: Failed with result 'exit-code'.
Oct 22 17:26:24 zmc-pve10 systemd[1]: Failed to start ceph-mgr@zmc-pve10.service - Ceph cluster manager daemon.
root@zmc-pve10:~# /usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id zmc-pve10 --setuser ceph --setgroup ceph
did not load config file, using default settings.
2023-10-22T17:34:29.445+0800 7fc4c100c000 -1 Errors while parsing config file!
2023-10-22T17:34:29.445+0800 7fc4c100c000 -1 can't open --id.conf: (2) No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
2023-10-22T17:34:29.777+0800 7fc4c100c000 -1 failed for service _ceph-mon._tcp
2023-10-22T17:34:29.777+0800 7fc4c100c000 -1 monclient: get_monmap_and_config cannot identify monitors to contact
failed to fetch mon config (--no-mon-config to skip)

aaron · Oct 23, 2023

Can you please post the /etc/pve/ceph.conf file? Also verify that the /etc/ceph/ceph.conf file is symlink to the one in /etc/pve:

Code:

root@cephtest1:~# ls -la /etc/ceph/ceph.conf
lrwxrwxrwx 1 root root 18 Apr 19  2023 /etc/ceph/ceph.conf -> /etc/pve/ceph.conf

How is the Proxmox VE cluster? Healthy? pvecm status

ggea · Oct 24, 2023

Hello, this is my cluster status and configuration for ceph.conf

Code:

root@zmc-pve8-master:~# ls -la /etc/ceph/ceph.conf
lrwxrwxrwx 1 root root 18 Aug 22 15:30 /etc/ceph/ceph.conf -> /etc/pve/ceph.conf

root@zmc-pve8-master:~# cat /etc/ceph/ceph.conf
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 172.16.2.8/24
     fsid = e3c95601-767e-4b0e-ad9e-6cb44ad69db1
     mon_allow_pool_delete = true
     mon_host = 172.16.2.8
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 172.16.2.8/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.zmc-pve10]
     host = zmc-pve10
     mds standby for name = pve

[mds.zmc-pve7]
     host = zmc-pve7
     mds_standby_for_name = pve

[mds.zmc-pve8-master]
     host = zmc-pve8-master
     mds_standby_for_name = pve

[mon.zmc-pve8-master]
     public_addr = 172.16.2.8


root@zmc-pve8-master:~# pvecm status
Cluster information
-------------------
Name:             zmc
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Oct 24 11:35:26 2023
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.40
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.2.8 (local)
0x00000002          1 192.168.2.7
0x00000003          1 192.168.2.10


root@zmc-pve8-master:~# ceph -s
  cluster:
    id:     e3c95601-767e-4b0e-ad9e-6cb44ad69db1
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum zmc-pve8-master (age 8w)
    mgr: zmc-pve8-master(active, since 8w)
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 3w), 3 in (since 8w)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 117.52k objects, 455 GiB
    usage:   1.6 TiB used, 1.3 TiB / 2.9 TiB avail
    pgs:     97 active+clean
 
  io:
    client:   0 B/s rd, 74 KiB/s wr, 0 op/s rd, 15 op/s wr

aaron · Oct 24, 2023

hmm, looks like the cluster is only aware of the MON & MGR on node 8.

Did you remove them on the other two nodes? Why they are still shown in the GUI is another question. Maybe there are still some remnants. You could check if you have sub directories in the following places:

Code:

/var/lib/ceph/mon
/var/lib/ceph/mgr

And if the systemd units still exist:

Code:

systemctl status ceph-mon@$(hostname)
systemctl status ceph-mgr@$(hostname)

Other than that, you can also just try to create new MONs and MGRs on nodes 7 and 10

ggea · Oct 26, 2023

I created mon and started it without reporting an error, but it couldn't start. If I created it, I would report an error

aaron · Oct 30, 2023

Then there might be something that wasn't cleaned up properly.
Please verify that in the ceph.conf file and in the status of Ceph you only see mentions to the MON on node 8, nothing of the other nodes.

If that is the case, you can clean up, on the two nodes without the MON, try to run the following to manually clean up anything they might have left behind:

Code:

systemctl disable ceph-mon@$(hostname).service
rm -rf /var/lib/ceph/mon/ceph-$(hostname)

ggea · Oct 31, 2023

aaron said:
Then there might be something that wasn't cleaned up properly.
Please verify that in the ceph.conf file and in the status of Ceph you only see mentions to the MON on node 8, nothing of the other nodes.

If that is the case, you can clean up, on the two nodes without the MON, try to run the following to manually clean up anything they might have left behind:

Code:

systemctl disable ceph-mon@$(hostname).service rm -rf /var/lib/ceph/mon/ceph-$(hostname)

Hello, I have already resolved this issue from another post. Thank you
address：https://forum.proxmox.com/threads/i-managed-to-create-a-ghost-ceph-monitor.58435/

Search

Search

The ceph monitor and manager on my PVE cannot start. May I ask if there is a problem? I have been using it normally and suddenly this problem occurred

ggea

Member

aaron

Proxmox Staff Member

ggea

Member

aaron

Proxmox Staff Member

ggea

Member

aaron

Proxmox Staff Member

ggea

Member