Ceph cluster timeout

kenneth_vkd · Jul 15, 2022

Hi
We recently deployed a small 3-node PVE cluster on-premise. All three nodes run Ceph OSD, Ceph Monitor, Ceph Manager and Ceph MDS.
We run the setup with 3 switches and each server has 4 NICs.
The two first NICs are configured as a linux bond in failover mode and have a cable to each of the first two switches. PVE cluster runs here as well
The last two NICs run a linux bond with LACP to the third switch so that we can do maintenance on the primary network without having Ceph loose conectivity. Ceph is the only service running here.
It has been working fine, but a technician misconfigured the switch used for Ceph, so that it could no longer be managed and had to be reset.
In order to fix that, we planned to power off all 3 nodes, reconfigure the switch and then power the nodes on again. Since we had to move things around in the rack, we did that and then the technician forgot to reconfigure the switch and then powered on the servers on again. Since the switch still had the ports configured for LACP, everything still worked.
The technician then remembered that he had to reset the switch and did that while the cluster was online and had virtual machines running. Since LACP is no longer working, we shut the servers down again, completed reconfiguration of the switch, powered on the servers and found that the technician had not configured the switch ports for LACP. Reconfigured the ports while servers were running.
From the point where the switch was reset, "pveceph status" reports "got timeout" and so does the web UI. Running "ceph -s" seems to hang forever, but after some time it outputs this:

Code:

2022-07-15T08:29:46.358+0200 7f7367611700  0 monclient(hunting): authenticate timed out after 300

We have tried rebooting all servers at the same time, rebooting them one-by-one and wait until the node was back up before rebooting the next. We have also tried "systemctl restart ceph.target" with also no luck.
Looking at the syslog, it seems there is a connectivity issue where all nodes report that they cannot connect to the socket on port 6789 on every other node. Nodes can ping each other on the IP-adresses assigned to Ceph public network. Also we can do a telnet connection to each node on port 6789. So connectiivty should work.
Each node does also report syslog errors for its own IP on port 6789
Has the above mistake broken Ceph and is there a way to have PVE fix the cluster since it is a PVE managed cluster?

Code:

proxmox-ve: 7.2-1 (running kernel: 5.15.39-1-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-6
pve-kernel-helper: 7.2-6
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-5
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-11
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

shrdlicka · Jul 15, 2022

Hi,
did you take a look into /var/log/ceph? Take a look into ceph.log if you see any errors in there.

kenneth_vkd · Jul 15, 2022

ceph.log has no entries since the described failure.
The ceph-mon.xxx.log files contain only these types of lines:

Code:

2022-07-15T11:19:58.832+0200 7fbf7e16d700  1 mon.x1-pve-srv1@0(electing) e3 handle_auth_request failed to assign global_id

ceph-mgr.xxx.log is empty

osd logs and volume logs are also empty after the last reboot that we attempted to fix the issue.

other than that, the only sign of life from Ceph is this in the syslog

Code:

Jul 15 11:25:08 x1-pve-srv1 kernel: [65620.320269] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:09 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:09.232+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:12 x1-pve-srv1 corosync[3490]:   [TOTEM ] Token has not been received in 2737 ms
Jul 15 11:25:13 x1-pve-srv1 kernel: [65625.320355] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:14 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:14.256+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [QUORUM] Sync members[3]: 1 2 3
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [TOTEM ] A new membership (1.1c6) was formed. Members
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [QUORUM] Members[3]: 1 2 3
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 15 11:25:18 x1-pve-srv1 kernel: [65630.320407] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:19 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:19.284+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:24 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:24.312+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:28 x1-pve-srv1 kernel: [65640.320421] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:29 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:29.340+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)

shrdlicka · Jul 15, 2022

You could try to go through the troubleshooting guide:

https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-mon/?

aaron · Jul 15, 2022

Maybe I missed it, but did you check that the network works again and that the nodes can ping each other on all configured networks?

herzkerl · Aug 31, 2022

kenneth_vkd said:

ceph.log has no entries since the described failure.
The ceph-mon.xxx.log files contain only these types of lines:

Code:

2022-07-15T11:19:58.832+0200 7fbf7e16d700  1 mon.x1-pve-srv1@0(electing) e3 handle_auth_request failed to assign global_id

ceph-mgr.xxx.log is empty

osd logs and volume logs are also empty after the last reboot that we attempted to fix the issue.

other than that, the only sign of life from Ceph is this in the syslog

Code:

Jul 15 11:25:08 x1-pve-srv1 kernel: [65620.320269] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:09 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:09.232+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:12 x1-pve-srv1 corosync[3490]:   [TOTEM ] Token has not been received in 2737 ms
Jul 15 11:25:13 x1-pve-srv1 kernel: [65625.320355] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:14 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:14.256+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [QUORUM] Sync members[3]: 1 2 3
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [TOTEM ] A new membership (1.1c6) was formed. Members
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [QUORUM] Members[3]: 1 2 3
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 15 11:25:18 x1-pve-srv1 kernel: [65630.320407] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:19 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:19.284+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:24 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:24.312+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:28 x1-pve-srv1 kernel: [65640.320421] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:29 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:29.340+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)

I’m having the same problem—started today after a reboot of one node. Did you fix it yet?

Search

Search

Ceph cluster timeout

kenneth_vkd

Well-Known Member

shrdlicka

Proxmox Retired Staff

kenneth_vkd

Well-Known Member

shrdlicka

Proxmox Retired Staff

aaron

Proxmox Staff Member

herzkerl

Member