Ceph mgr become unresponsive and switching to standby very frequently

Sarandyna

New Member
Aug 5, 2024
7
0
1
Hello All,

We have a proxmox cluster of 11 nodes and with 9 nodes installed with ceph ( 2 only used for compute). We have 4 ceph mon and 4 mgr , ceph mgr is switching between active to standby mgr ( mostly between the same two mgr nodes).

PVE version: pve-manager/8.3.3/f157a38b211595d6 (running kernel: 6.8.12-6-pve)
Ceph version: 19.2.0

Ceph conf:
Code:
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 172.17.2.96/27
        fsid = abc3a412-6180-4u06-8sa6-71a06366f927
        mon_allow_pool_delete = true
        mon_host = 172.17.2.110 172.17.2.111 172.17.2.103 172.17.2.104
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 2
        public_network = 172.17.2.96/27

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.testfedprx03]
        host = testfedprx03
        mds_standby_for_name = pve

[mds.testfedprx04]
        host = testfedprx04
        mds_standby_for_name = pve

[mds.test1fedprx04]
        host = test1fedprx04
        mds_standby_for_name = pve

[mon.testfedprx03]
        public_addr = 172.17.2.110

[mon.testfedprx04]
        public_addr = 172.17.2.111

[mon.test1fedprx03]
        public_addr = 172.17.2.103

[mon.test1fedprx04]
        public_addr = 172.17.2.104

Ceph log:
Code:
Feb 18 14:21:36 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:21:36.046+0100 7d512be006c0  0 [balancer INFO root] prepared 0/10 upmap changes
Feb 18 14:22:50 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:50.680+0100 7d51070006c0  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
Feb 18 14:22:50 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:50.687+0100 7d51264006c0  0 log_channel(cluster) log [DBG] : pgmap v407: 6177 pgs: 1 active+clean+scrubbing+deep, 6176 active+clean; 73 TiB data, 217 TiB used, 272 >
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0 -1 mgr handle_mgr_map I was active but no longer am
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  e: '/usr/bin/ceph-mgr'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  0: '/usr/bin/ceph-mgr'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  1: '-f'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  2: '--cluster'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  3: 'ceph'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  4: '--id'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  5: 'testfedprx04'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  6: '--setuser'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  7: 'ceph'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  8: '--setgroup'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  9: 'ceph'
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn respawning with exe /usr/bin/ceph-mgr
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.341+0100 7d5141e006c0  1 mgr respawn  exe_path /proc/self/exe
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: did not load config file, using default settings.
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: ignoring --setuser ceph since I am not root
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: ignoring --setgroup ceph since I am not root
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.450+0100 72fc8dc7e280 -1 Errors while parsing config file!
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.450+0100 72fc8dc7e280 -1 can't open ceph.conf: (2) No such file or directory
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: unable to get monitor info from DNS SRV with service name: ceph-mon
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.455+0100 72fc8dc7e280 -1 failed for service _ceph-mon._tcp
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: 2025-02-18T14:22:51.455+0100 72fc8dc7e280 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Feb 18 14:22:51 testfedprx04 ceph-mgr[345148]: failed to fetch mon config (--no-mon-config to skip)
Feb 18 14:22:51 testfedprx04 systemd[1]: ceph-mgr@testfedprx04.service: Main process exited, code=exited, status=1/FAILURE
Feb 18 14:22:51 testfedprx04 systemd[1]: ceph-mgr@testfedprx04.service: Failed with result 'exit-code'.

Even the journalctl logs doesn't have much information.


Thanks
Saran
 
/etc/ceph/ceph.conf is a symlink to /etc/pve/ceph.cobf on Proxmox nodes.
/etc/pve is the FUSE mounted ckustered Proxmox config database.

You should check why ceph.conf is not available.

BTW: do not run with an even number of MONs. Add one (less risk) or remove one (equal risk as with 4).
 
Even we thought the missing ceph conf would be the issue at first place.But that looks like a disguise. Ceph conf is accessible all the while even during the issue and moreover each time ceph monitor initiates a switchover to standby ceph mgr a minute earlier than this ceph conf event.