[SOLVED] Help: CEPH pool non responsive/inactive after moving to a new house/new connection

jaykavathe

Member
Feb 25, 2021
36
3
13
42
First I thought it was proxmox upgrade issue since I turned server on after few months (house move and stuff). I got proxmox upgraded to latest but still CEPH is not responding.

How to troubleshoot and fix this?

pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-10-pve)
ceph version 17.2.6 (810db68029296377607028a6c6da1ec06f5a2b27) quincy (stable)

If I run a command line health check (assuming its ceph -s) the command just hangs up and doesnt do anything.

pic1
pic2
 
What does your syslog (jouranlctl) and ceph log show.

Normally all my issues like this were caused by one of two things.

1. Ceph can find other nodes or itself on the network (aka some sort of comma issue)

2. Problem with the underlying disk being used for the OSDs
 
Thank you for responding.
A couple of relevant entries from "ceph-mon.localhost.log"

2023-09-04T22:08:42.301-0400 7f59cd4776c0 -1 mon.localhost@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 30 bytes epoch 0)
2023-09-04T22:08:45.349-0400 7f59cec7a6c0 1 mon.localhost@0(probing) e3 handle_auth_request failed to assign global_id

Though I am not sure which log file will point to the issue.
Dont see any error with command "journalctl". Again, what should I be looking for?

The system was running perfectly for more than a month with CEPH setup before shutdown and physical move.
 
well the first place to start is what does pveceph status think is going on?

secondly the logs would imply the monitors are started on one machine - do you know if you have a full set of monitors running and one manager running?

are you sure the underlying disks are ok?
 
1) Most command starting with ceph are not working including pveceph status (timeout)
2) 3 systems, 6 disc (2 on each). I doubt all 6 failed at once.
3) I dont see any manager active. Should I create one manager? I followed apard's youtube video to set it all up.
4) None of monitors are runing and no managers exist. Check image
 
Weird.

When you shell into each node can you ping all 3 of those IPs?
Yes try creating a manager and see if it fails telling you you already have one... i have no idea what apards YT video is or who they are (and tbh i am an old fart and hate consuming videos... so won't be watching it).

Here are my notes https://gist.github.com/scyto/8c652f3eab61ed1fa2f980d02a484c35 from when i did mine (i write as much down as i can for my own use)

Are you sure that at a network level the IP addresses and the subnet are right - you haven't implemented VLANs or changed your network equipment in some way?

to me your issues are like when i tried and failed to do an IPv6 install - basic network issues... the timeout is very indicative of that to me... (but i am still a relative noob at ceph)
 
Last edited:
Hello,

You can see the contents of `/etc/pve/corosync.conf` (symlink to `/etc/corosync/corosync.conf`) and `/etc/pve/ceph.conf` (symlink to `/etc/ceph/ceph.conf`). The first thing to see is whether all hosts can ping each other on all the listed interfaces, is `public_network` equal to `cluster_network` in ceph.conf?
 
When you shell into each node can you ping all 3 of those IPs?
Not even that, if a larger MTU is used, does it work? ping -M do -s 8972 {target host} The size is set to an MTU of 9000 minus IP and ICMP headers. If you use VLANs or a smaller MTU, then set the size smaller.
 
1) ceph status returns timeout as well.
2) Checked corosync.conf and ceph.conf and all looks ok
3) All hosts can ping each other.
My new setup : Modem - Firewall - Managed switch (VLAN config) - Unamanaged switch - Nodes
My old setup : Modem - Firewall - Managed switch (VLAN config) - Managed switch - Nodes

So I did change one switch in between but now I am wondering, is VLAN pass through or is that new unmanaged switch an issue?

For proxmox node, gateway is 192.168.1.1
For VM/LXC, gateway is 192.168.10.1 which makes me assume that VLAN setup is alright depite that unamanaged switch inbetween, right?

All nodes can ping both gateways though none of VM/LXCs are up because they are on ceph storage.
 
My new setup : Modem - Firewall - Managed switch (VLAN config) - Unamanaged switch - Nodes
The PVE nodes are in VLANs? I am not sure if an unmanaged switch will be happy with VLAN tags.

Did you test ping on the VLANs as well?

Or, to reduce our questions, can you post the /etc/network/interfaces from one node? Assuming that they are all configured very similarly. Please withing [CODE][/CODE] tags for better readability :)
 
am not sure if an unmanaged switch will be happy with VLAN tags
In my experience all my unmanaged switches transparently passed vlan traffic, but apparently some strip the vlan tag!

That said they can ping so have basic ICMP connectivity so seems like theirs doesn’t. I agree with you there is some piece of info we are missing here. In addition to a cat of interfaces it might be useful to see output of ip a
 
The PVE nodes are in VLANs? I am not sure if an unmanaged switch will be happy with VLAN tags.

Did you test ping on the VLANs as well?

Or, to reduce our questions, can you post the /etc/network/interfaces from one node? Assuming that they are all configured very similarly. Please withing [CODE][/CODE] tags for better readability :)

1) PVE nodes are NOT on vlan (192.168.1.1). Though they can ping 192.168.10.1

Code:
auto lo
iface lo inet loopback
iface eno3 inet manual
auto vmbr0
iface vmbr0 inet static
        address 192.168.1.150/24
        gateway 192.168.1.1
        bridge-ports eno3
        bridge-stp off
        bridge-fd 0
iface eno4 inet manual
iface eno1 inet manual
iface eno2 inet manual
Code:
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 24:6e:96:ac:d5:1c brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0
3: eno4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 24:6e:96:ac:d5:1d brd ff:ff:ff:ff:ff:ff
    altname enp6s0f1
4: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 24:6e:96:ac:d5:18 brd ff:ff:ff:ff:ff:ff
    altname enp1s0f0
5: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 24:6e:96:ac:d5:1a brd ff:ff:ff:ff:ff:ff
    altname enp1s0f1
6: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 24:6e:96:ac:d5:1c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.150/24 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::266e:96ff:feac:d51c/64 scope link
       valid_lft forever preferred_lft forever

2) LXC and VMs are all on VLAN 10 but I cant access them since they are not starting (because of CEPH pool being down). Though
 
Another question would be, can I just setup ceph again and import these disk somehow without losing data?
Its not like data is super important but now that things have failed, I should learn recovery rather than fresh install. I still have about 20 LXC running.
 
I don't know about the import

out of interest what does your IPTABLES look like? It might be worth turning off the firewall off and chainge the in and out default policies to ACCEPT

Also this is so weird (given you say nothing else changed ) I have to ask have you tried a different switch...
 
it's your firewall on the proxmox host issue the command iptables -l to list

Set fire to off and input to accept here. Note this could be set at data center level.OR model level.
 

Attachments

  • IMG_0658.jpeg
    IMG_0658.jpeg
    223.9 KB · Views: 4
Hmm, have you tried restarting the monitors yet? If they cannot find each other and form a quorum, then nothing will work.
https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-mon/
The second line should work on each node.
Code:
ceph --admin-daemon <full_path_to_asok_file> <command>
ceph --admin-daemon /run/ceph/ceph-mon.$(hostname).asok mon_status
It will print quite a bit of information. The first part is about the state and if there is quorum, and with which MONs. A bit further down you find the monmap. You should see the info of all the MONs there.

Verify the state and monmap of all MONs. Maybe something is wrong there?
 
  • Like
Reactions: scyto
Below is what I see with "ceph --admin-daemon /run/ceph/ceph-mon.localhost.asok mon_status"
No quorum .. I suspect monitor is not running on node2/3 may be?

The command runs on node1 (mystic1 even though monitor name is localhost). Command doesnt run on node2 and 3. Error says "admin_socket: exception getting command descriptions: [Errno 111] Connection refused"

I have tried to restart the monitors from node 2/3 webpage. Doesnt help.

Code:
{
    "name": "localhost",
    "rank": 0,
    "state": "probing",
    "election_epoch": 0,
    "quorum": [],
    "features": {
        "required_con": "2449958755906961412",
        "required_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus",
            "octopus",
            "pacific",
            "elector-pinging",
            "quincy"
        ],
        "quorum_con": "0",
        "quorum_mon": []
    },
    "outside_quorum": [
        "localhost"
    ],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 3,
        "fsid": "84a7eb78-7460-4ab6-94f0-efb4fe9dc5f0",
        "modified": "2023-05-09T21:35:33.357841Z",
        "created": "2023-05-09T21:12:09.807347Z",
        "min_mon_release": 17,
        "min_mon_release_name": "quincy",
        "election_strategy": 1,
        "disallowed_leaders: ": "",
        "stretch_mode": false,
        "tiebreaker_mon": "",
        "removed_ranks: ": "",
        "features": {
            "persistent": [
                "kraken",
                "luminous",
                "mimic",
                "osdmap-prune",
                "nautilus",
                "octopus",
                "pacific",
                "elector-pinging",
                "quincy"
            ],
            "optional": []
        },
        "mons": [
            {
                "rank": 0,
                "name": "localhost",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "192.168.1.150:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "192.168.1.150:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "192.168.1.150:6789/0",
                "public_addr": "192.168.1.150:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            },
            {
                "rank": 1,
                "name": "mystic2",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "192.168.1.151:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "192.168.1.151:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "192.168.1.151:6789/0",
                "public_addr": "192.168.1.151:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            },
            {
                "rank": 2,
                "name": "mystic3",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "192.168.1.152:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "192.168.1.152:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "192.168.1.152:6789/0",
                "public_addr": "192.168.1.152:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            }
        ]
    },
    "feature_map": {
        "mon": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "client": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 2
            }
        ]
    },
    "stretch_mode": false
}
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!