[SOLVED] Help: CEPH pool non responsive/inactive after moving to a new house/new connection

jaykavathe · Sep 4, 2023

First I thought it was proxmox upgrade issue since I turned server on after few months (house move and stuff). I got proxmox upgraded to latest but still CEPH is not responding.

How to troubleshoot and fix this?

pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-10-pve)
ceph version 17.2.6 (810db68029296377607028a6c6da1ec06f5a2b27) quincy (stable)

If I run a command line health check (assuming its ceph -s) the command just hangs up and doesnt do anything.

pic1
pic2

scyto · Sep 5, 2023

What does your syslog (jouranlctl) and ceph log show.

Normally all my issues like this were caused by one of two things.

1. Ceph can find other nodes or itself on the network (aka some sort of comma issue)

2. Problem with the underlying disk being used for the OSDs

jaykavathe · Sep 5, 2023

Thank you for responding.
A couple of relevant entries from "ceph-mon.localhost.log"

2023-09-04T22:08:42.301-0400 7f59cd4776c0 -1 mon.localhost@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 30 bytes epoch 0)
2023-09-04T22:08:45.349-0400 7f59cec7a6c0 1 mon.localhost@0(probing) e3 handle_auth_request failed to assign global_id

Though I am not sure which log file will point to the issue.
Dont see any error with command "journalctl". Again, what should I be looking for?

The system was running perfectly for more than a month with CEPH setup before shutdown and physical move.

scyto · Sep 5, 2023

well the first place to start is what does pveceph status think is going on?

secondly the logs would imply the monitors are started on one machine - do you know if you have a full set of monitors running and one manager running?

are you sure the underlying disks are ok?

jaykavathe · Sep 5, 2023

1) Most command starting with ceph are not working including pveceph status (timeout)
2) 3 systems, 6 disc (2 on each). I doubt all 6 failed at once.
3) I dont see any manager active. Should I create one manager? I followed apard's youtube video to set it all up.
4) None of monitors are runing and no managers exist. Check image

scyto · Sep 5, 2023

Weird.

When you shell into each node can you ping all 3 of those IPs?
Yes try creating a manager and see if it fails telling you you already have one... i have no idea what apards YT video is or who they are (and tbh i am an old fart and hate consuming videos... so won't be watching it).

Here are my notes https://gist.github.com/scyto/8c652f3eab61ed1fa2f980d02a484c35 from when i did mine (i write as much down as i can for my own use)

Are you sure that at a network level the IP addresses and the subnet are right - you haven't implemented VLANs or changed your network equipment in some way?

to me your issues are like when i tried and failed to do an IPv6 install - basic network issues... the timeout is very indicative of that to me... (but i am still a relative noob at ceph)

scyto · Sep 5, 2023

I assume ceph status times out too?

Maximiliano · Sep 5, 2023

Hello,

You can see the contents of `/etc/pve/corosync.conf` (symlink to `/etc/corosync/corosync.conf`) and `/etc/pve/ceph.conf` (symlink to `/etc/ceph/ceph.conf`). The first thing to see is whether all hosts can ping each other on all the listed interfaces, is `public_network` equal to `cluster_network` in ceph.conf?

aaron · Sep 5, 2023

scyto said:
When you shell into each node can you ping all 3 of those IPs?

Not even that, if a larger MTU is used, does it work? ping -M do -s 8972 {target host} The size is set to an MTU of 9000 minus IP and ICMP headers. If you use VLANs or a smaller MTU, then set the size smaller.

jaykavathe · Sep 5, 2023

1) ceph status returns timeout as well.
2) Checked corosync.conf and ceph.conf and all looks ok
3) All hosts can ping each other.
My new setup : Modem - Firewall - Managed switch (VLAN config) - Unamanaged switch - Nodes
My old setup : Modem - Firewall - Managed switch (VLAN config) - Managed switch - Nodes

So I did change one switch in between but now I am wondering, is VLAN pass through or is that new unmanaged switch an issue?

For proxmox node, gateway is 192.168.1.1
For VM/LXC, gateway is 192.168.10.1 which makes me assume that VLAN setup is alright depite that unamanaged switch inbetween, right?

All nodes can ping both gateways though none of VM/LXCs are up because they are on ceph storage.

aaron · Sep 5, 2023

jaykavathe said:
My new setup : Modem - Firewall - Managed switch (VLAN config) - Unamanaged switch - Nodes

The PVE nodes are in VLANs? I am not sure if an unmanaged switch will be happy with VLAN tags.

Did you test ping on the VLANs as well?

Or, to reduce our questions, can you post the /etc/network/interfaces from one node? Assuming that they are all configured very similarly. Please withing [CODE][/CODE] tags for better readability

scyto · Sep 5, 2023

aaron said:
am not sure if an unmanaged switch will be happy with VLAN tags

In my experience all my unmanaged switches transparently passed vlan traffic, but apparently some strip the vlan tag!

That said they can ping so have basic ICMP connectivity so seems like theirs doesn’t. I agree with you there is some piece of info we are missing here. In addition to a cat of interfaces it might be useful to see output of ip a

jaykavathe · Sep 5, 2023

aaron said:
The PVE nodes are in VLANs? I am not sure if an unmanaged switch will be happy with VLAN tags.

Did you test ping on the VLANs as well?

Or, to reduce our questions, can you post the /etc/network/interfaces from one node? Assuming that they are all configured very similarly. Please withing [CODE][/CODE] tags for better readability

1) PVE nodes are NOT on vlan (192.168.1.1). Though they can ping 192.168.10.1

Code:

auto lo
iface lo inet loopback
iface eno3 inet manual
auto vmbr0
iface vmbr0 inet static
        address 192.168.1.150/24
        gateway 192.168.1.1
        bridge-ports eno3
        bridge-stp off
        bridge-fd 0
iface eno4 inet manual
iface eno1 inet manual
iface eno2 inet manual

Code:

ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 24:6e:96:ac:d5:1c brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0
3: eno4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 24:6e:96:ac:d5:1d brd ff:ff:ff:ff:ff:ff
    altname enp6s0f1
4: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 24:6e:96:ac:d5:18 brd ff:ff:ff:ff:ff:ff
    altname enp1s0f0
5: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 24:6e:96:ac:d5:1a brd ff:ff:ff:ff:ff:ff
    altname enp1s0f1
6: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 24:6e:96:ac:d5:1c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.150/24 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::266e:96ff:feac:d51c/64 scope link
       valid_lft forever preferred_lft forever

2) LXC and VMs are all on VLAN 10 but I cant access them since they are not starting (because of CEPH pool being down). Though

jaykavathe · Sep 6, 2023

Another question would be, can I just setup ceph again and import these disk somehow without losing data?
Its not like data is super important but now that things have failed, I should learn recovery rather than fresh install. I still have about 20 LXC running.

scyto · Sep 6, 2023

I don't know about the import

out of interest what does your IPTABLES look like? It might be worth turning off the firewall off and chainge the in and out default policies to ACCEPT

Also this is so weird (given you say nothing else changed ) I have to ask have you tried a different switch...

jaykavathe · Sep 6, 2023

What are IPTABLES?

Well, I still have old switch and can go back to it. Worth a try really.

scyto · Sep 6, 2023

it's your firewall on the proxmox host issue the command iptables -l to list

Set fire to off and input to accept here. Note this could be set at data center level.OR model level.

aaron · Sep 6, 2023

Hmm, have you tried restarting the monitors yet? If they cannot find each other and form a quorum, then nothing will work.
https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-mon/
The second line should work on each node.

Code:

ceph --admin-daemon <full_path_to_asok_file> <command>
ceph --admin-daemon /run/ceph/ceph-mon.$(hostname).asok mon_status

It will print quite a bit of information. The first part is about the state and if there is quorum, and with which MONs. A bit further down you find the monmap. You should see the info of all the MONs there.

Verify the state and monmap of all MONs. Maybe something is wrong there?

jaykavathe · Sep 6, 2023

Below is what I see with "ceph --admin-daemon /run/ceph/ceph-mon.localhost.asok mon_status"
No quorum .. I suspect monitor is not running on node2/3 may be?

The command runs on node1 (mystic1 even though monitor name is localhost). Command doesnt run on node2 and 3. Error says "admin_socket: exception getting command descriptions: [Errno 111] Connection refused"

I have tried to restart the monitors from node 2/3 webpage. Doesnt help.

Code:

{
    "name": "localhost",
    "rank": 0,
    "state": "probing",
    "election_epoch": 0,
    "quorum": [],
    "features": {
        "required_con": "2449958755906961412",
        "required_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus",
            "octopus",
            "pacific",
            "elector-pinging",
            "quincy"
        ],
        "quorum_con": "0",
        "quorum_mon": []
    },
    "outside_quorum": [
        "localhost"
    ],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 3,
        "fsid": "84a7eb78-7460-4ab6-94f0-efb4fe9dc5f0",
        "modified": "2023-05-09T21:35:33.357841Z",
        "created": "2023-05-09T21:12:09.807347Z",
        "min_mon_release": 17,
        "min_mon_release_name": "quincy",
        "election_strategy": 1,
        "disallowed_leaders: ": "",
        "stretch_mode": false,
        "tiebreaker_mon": "",
        "removed_ranks: ": "",
        "features": {
            "persistent": [
                "kraken",
                "luminous",
                "mimic",
                "osdmap-prune",
                "nautilus",
                "octopus",
                "pacific",
                "elector-pinging",
                "quincy"
            ],
            "optional": []
        },
        "mons": [
            {
                "rank": 0,
                "name": "localhost",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "192.168.1.150:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "192.168.1.150:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "192.168.1.150:6789/0",
                "public_addr": "192.168.1.150:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            },
            {
                "rank": 1,
                "name": "mystic2",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "192.168.1.151:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "192.168.1.151:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "192.168.1.151:6789/0",
                "public_addr": "192.168.1.151:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            },
            {
                "rank": 2,
                "name": "mystic3",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "192.168.1.152:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "192.168.1.152:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "192.168.1.152:6789/0",
                "public_addr": "192.168.1.152:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            }
        ]
    },
    "feature_map": {
        "mon": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "client": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 2
            }
        ]
    },
    "stretch_mode": false
}

alexskysilk · Sep 6, 2023

please post /etc/pve/ceph.conf

[SOLVED] Help: CEPH pool non responsive/inactive after moving to a new house/new connection

Member

Active Member

Member

Active Member

Member

Active Member

Active Member

Proxmox Staff Member

Proxmox Staff Member

Member

Proxmox Staff Member

Active Member

Member

Member

Active Member

Member

Active Member

Attachments

Proxmox Staff Member

Member

Distinguished Member