Help: No Working Ceph Monitors...

Jospeh Huber

Renowned Member
Apr 18, 2016
99
7
73
45
Hi,

today one of my monitor was in a bad state and does not come up after a node reboot.
So I tried to recover it with the solution from here ... that has worked in the past.
https://forum.proxmox.com/threads/i-managed-to-create-a-ghost-ceph-monitor.58435/#post-389798

I have three nodes, two nodes have a monitors running on the third node the monitor was deleted.
But both monitors are in a state "probing".
So at the moment I am in a state with no working monitor.
When I select ceph in the UI it is not responding.
pveceph status and ceph -s are not responding...

Code:
pveceph status
command 'ceph -s' failed: got timeout

The systems in the rdb storage are all up and running.
What can I do?

Code:
ceph --admin-daemon /var/run/ceph/ceph-mon.vmhost3.asok mon_status
{
    "name": "vmhost3",
    "rank": -1,
    "state": "probing",
    "election_epoch": 0,
    "quorum": [],
    "features": {
        "required_con": "2449958197560098820",
        "required_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus"
        ],
        "quorum_con": "0",
        "quorum_mon": []
    },
    "outside_quorum": [
        "vmhost2"
    ],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 20,
        "fsid": "42ca65c7-716e-4357-802e-44178a1a0c03",
        "modified": "2021-06-22 14:09:26.898732",
        "created": "2017-01-30 17:14:40.940356",
        "min_mon_release": 14,
        "min_mon_release_name": "nautilus",
        "features": {
            "persistent": [
                "kraken",
                "luminous",
                "mimic",
                "osdmap-prune",
                "nautilus"
            ],
            "optional": []
        },
        "mons": [
            {
                "rank": 0,
                "name": "vmhost2",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.0.99.82:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.0.99.82:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.0.99.82:6789/0",
                "public_addr": "10.0.99.82:6789/0"
            },
            {
                "rank": 1,
                "name": "vmhost5",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.0.99.83:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.0.99.83:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.0.99.83:6789/0",
                "public_addr": "10.0.99.83:6789/0"
            }
        ]
    },
    "feature_map": {
        "mon": [
            {
                "features": "0x3ffddff8ffecffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "client": [
            {
                "features": "0x3ffddff8ffecffff",
                "release": "luminous",
                "num": 8
            }
        ]
    }
}
Logfile Vmhost3
Code:
2021-06-22 16:30:20.480 7f0c8cf1e700 -1 mon.vmhost3@-1(probing) e20 get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 at 2021-06-22 16:10:39.766768)
2021-06-22 16:30:20.744 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:20.924 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:20.928 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:20.948 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:21.044 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:21.108 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:21.188 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:21.188 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:21.280 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:21.280 7f0c8e721700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:21.348 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:21.852 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:22.056 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:30:22.148 7f0c8ef22700  1 mon.vmhost3@-1(probing) e20 handle_auth_request failed to assign global_id

Code:
ceph --admin-daemon /var/run/ceph/ceph-mon.vmhost2.asok mon_status
{
    "name": "vmhost2",
    "rank": 0,
    "state": "probing",
    "election_epoch": 0,
    "quorum": [],
    "features": {
        "required_con": "2449958747315912708",
        "required_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus"
        ],
        "quorum_con": "0",
        "quorum_mon": []
    },
    "outside_quorum": [
        "vmhost2"
    ],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 20,
        "fsid": "42ca65c7-716e-4357-802e-44178a1a0c03",
        "modified": "2021-06-22 14:09:26.898732",
        "created": "2017-01-30 17:14:40.940356",
        "min_mon_release": 14,
        "min_mon_release_name": "nautilus",
        "features": {
            "persistent": [
                "kraken",
                "luminous",
                "mimic",
                "osdmap-prune",
                "nautilus"
            ],
            "optional": []
        },
        "mons": [
            {
                "rank": 0,
                "name": "vmhost2",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.0.99.82:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.0.99.82:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.0.99.82:6789/0",
                "public_addr": "10.0.99.82:6789/0"
            },
            {
                "rank": 1,
                "name": "vmhost5",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.0.99.83:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.0.99.83:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.0.99.83:6789/0",
                "public_addr": "10.0.99.83:6789/0"
            }
        ]
    },
    "feature_map": {
        "mon": [
            {
                "features": "0x3ffddff8ffecffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "osd": [
            {
                "features": "0x3ffddff8ffecffff",
                "release": "luminous",
                "num": 3
            }
        ],
        "client": [
            {
                "features": "0x3ffddff8ffecffff",
                "release": "luminous",
                "num": 12
            }
        ],
        "mgr": [
            {
                "features": "0x3ffddff8ffecffff",
                "release": "luminous",
                "num": 3
            }
        ]
    }

Logfile Vmhost2:
Code:
2021-06-22 16:26:57.463 7f709f652700 -1 mon.vmhost2@0(probing) e20 get_health_metrics reporting 6149 slow ops, oldest is auth(proto 2 2 bytes epoch 0)
2021-06-22 16:26:57.471 7f709ce4d700  0 mon.vmhost2@0(probing) e20 handle_command mon_command({"prefix": "mon metadata", "id": "vmhost2"} v 0) v1
2021-06-22 16:26:57.471 7f709ce4d700  0 log_channel(audit) log [DBG] : from='mgr.122329882 10.0.99.82:0/1290' entity='mgr.vmhost2' cmd=[{"prefix": "mon metadata", "id": "vmhost2"}]: dispatch
2021-06-22 16:26:57.471 7f709ce4d700  0 mon.vmhost2@0(probing) e20 handle_command mon_command({"prefix": "mon metadata", "id": "vmhost5"} v 0) v1
2021-06-22 16:26:57.471 7f709ce4d700  0 log_channel(audit) log [DBG] : from='mgr.122329882 10.0.99.82:0/1290' entity='mgr.vmhost2' cmd=[{"prefix": "mon metadata", "id": "vmhost5"}]: dispatch
2021-06-22 16:26:57.483 7f70a1656700  1 mon.vmhost2@0(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:26:57.483 7f70a1656700  1 mon.vmhost2@0(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:26:57.515 7f70a1656700  1 mon.vmhost2@0(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:26:57.531 7f70a1656700  1 mon.vmhost2@0(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:26:57.531 7f70a0e55700  1 mon.vmhost2@0(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:26:57.539 7f70a1656700  1 mon.vmhost2@0(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:26:57.571 7f709ce4d700  1 mon.vmhost2@0(probing) e20  adding peer [v2:10.0.99.84:3300/0,v1:10.0.99.84:6789/0] to list of hints
2021-06-22 16:26:57.575 7f70a1656700  1 mon.vmhost2@0(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:26:57.591 7f70a1656700  1 mon.vmhost2@0(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:26:57.607 7f70a1656700  1 mon.vmhost2@0(probing) e20 handle_auth_request failed to assign global_id
2021-06-22 16:26:57.623 7f70a1656700  1 mon.vmhost2@0(probing) e20 handle_auth_request failed to assign global_id
 
Hi,

I found this,
It is possible to recover from the OSDs, but for this the OSDs must be shutdown.
That seems to much risk for me. If a system is shutdown in this state, it never comes up again.
Backup and restore with Proxmox vzdump is not possible anymore.

We did a complete backup and restore one by one. The whole Ceph storage is lost! ... after 4 years of operation :-((
Since the systems were still running we did a live backup out of nodes which generate data like (database nodes) and for the others we used the existing backups to restore it to another storage.

The corrupt VMs cannot be deleted via the GUI, they must be deleted on the cli in /etc/pve/lxc or /etc/pve/qemu-server.
There are timeouts with communicating with the underlying rdb storage when executing several operations.

My question is now: how to remove the corrupt ceph installation and reinstall it again?
pveceph purge --force

Thanks.
 
did you ever fix? Having similar issue - looking for instruction on how to stop all (none are in quorum) stop ceph and osd then do recovery from osd.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!