Ceph cluster timeout

kenneth_vkd

Well-Known Member
Sep 13, 2017
37
3
48
31
Hi
We recently deployed a small 3-node PVE cluster on-premise. All three nodes run Ceph OSD, Ceph Monitor, Ceph Manager and Ceph MDS.
We run the setup with 3 switches and each server has 4 NICs.
The two first NICs are configured as a linux bond in failover mode and have a cable to each of the first two switches. PVE cluster runs here as well
The last two NICs run a linux bond with LACP to the third switch so that we can do maintenance on the primary network without having Ceph loose conectivity. Ceph is the only service running here.
It has been working fine, but a technician misconfigured the switch used for Ceph, so that it could no longer be managed and had to be reset.
In order to fix that, we planned to power off all 3 nodes, reconfigure the switch and then power the nodes on again. Since we had to move things around in the rack, we did that and then the technician forgot to reconfigure the switch and then powered on the servers on again. Since the switch still had the ports configured for LACP, everything still worked.
The technician then remembered that he had to reset the switch and did that while the cluster was online and had virtual machines running. Since LACP is no longer working, we shut the servers down again, completed reconfiguration of the switch, powered on the servers and found that the technician had not configured the switch ports for LACP. Reconfigured the ports while servers were running.
From the point where the switch was reset, "pveceph status" reports "got timeout" and so does the web UI. Running "ceph -s" seems to hang forever, but after some time it outputs this:
Code:
2022-07-15T08:29:46.358+0200 7f7367611700  0 monclient(hunting): authenticate timed out after 300
We have tried rebooting all servers at the same time, rebooting them one-by-one and wait until the node was back up before rebooting the next. We have also tried "systemctl restart ceph.target" with also no luck.
Looking at the syslog, it seems there is a connectivity issue where all nodes report that they cannot connect to the socket on port 6789 on every other node. Nodes can ping each other on the IP-adresses assigned to Ceph public network. Also we can do a telnet connection to each node on port 6789. So connectiivty should work.
Each node does also report syslog errors for its own IP on port 6789
Has the above mistake broken Ceph and is there a way to have PVE fix the cluster since it is a PVE managed cluster?

Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.39-1-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-6
pve-kernel-helper: 7.2-6
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-5
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-11
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 
Last edited:
Hi,
did you take a look into /var/log/ceph? Take a look into ceph.log if you see any errors in there.
 
ceph.log has no entries since the described failure.
The ceph-mon.xxx.log files contain only these types of lines:
Code:
2022-07-15T11:19:58.832+0200 7fbf7e16d700  1 mon.x1-pve-srv1@0(electing) e3 handle_auth_request failed to assign global_id

ceph-mgr.xxx.log is empty

osd logs and volume logs are also empty after the last reboot that we attempted to fix the issue.

other than that, the only sign of life from Ceph is this in the syslog
Code:
Jul 15 11:25:08 x1-pve-srv1 kernel: [65620.320269] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:09 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:09.232+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:12 x1-pve-srv1 corosync[3490]:   [TOTEM ] Token has not been received in 2737 ms
Jul 15 11:25:13 x1-pve-srv1 kernel: [65625.320355] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:14 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:14.256+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [QUORUM] Sync members[3]: 1 2 3
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [TOTEM ] A new membership (1.1c6) was formed. Members
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [QUORUM] Members[3]: 1 2 3
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 15 11:25:18 x1-pve-srv1 kernel: [65630.320407] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:19 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:19.284+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:24 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:24.312+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:28 x1-pve-srv1 kernel: [65640.320421] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:29 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:29.340+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
 
Maybe I missed it, but did you check that the network works again and that the nodes can ping each other on all configured networks?
 
  • Like
Reactions: shrdlicka
ceph.log has no entries since the described failure.
The ceph-mon.xxx.log files contain only these types of lines:
Code:
2022-07-15T11:19:58.832+0200 7fbf7e16d700  1 mon.x1-pve-srv1@0(electing) e3 handle_auth_request failed to assign global_id

ceph-mgr.xxx.log is empty

osd logs and volume logs are also empty after the last reboot that we attempted to fix the issue.

other than that, the only sign of life from Ceph is this in the syslog
Code:
Jul 15 11:25:08 x1-pve-srv1 kernel: [65620.320269] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:09 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:09.232+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:12 x1-pve-srv1 corosync[3490]:   [TOTEM ] Token has not been received in 2737 ms
Jul 15 11:25:13 x1-pve-srv1 kernel: [65625.320355] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:14 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:14.256+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [QUORUM] Sync members[3]: 1 2 3
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [TOTEM ] A new membership (1.1c6) was formed. Members
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [QUORUM] Members[3]: 1 2 3
Jul 15 11:25:14 x1-pve-srv1 corosync[3490]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 15 11:25:18 x1-pve-srv1 kernel: [65630.320407] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:19 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:19.284+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:24 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:24.312+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
Jul 15 11:25:28 x1-pve-srv1 kernel: [65640.320421] libceph: mon2 (1)192.168.50.23:6789 socket closed (con state OPEN)
Jul 15 11:25:29 x1-pve-srv1 ceph-mon[869965]: 2022-07-15T11:25:29.340+0200 7fbf81974700 -1 mon.x1-pve-srv1@0(electing) e3 get_health_metrics reporting 6 slow ops, oldest is auth(proto 0 36 bytes epoch 0)
I’m having the same problem—started today after a reboot of one node. Did you fix it yet?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!