Ceph monclient(hunting): authenticate timed out after 300

Oct 17, 2023
4
0
1
After rebooting the cluster on the current node, I encountered an issue where I couldn't execute Ceph commands. The commands would hang for a while and eventually result in the error mentioned above.

apart from this we face another one
[170887.228156] libceph: mon2 (1)192.168.xx.xx:6789 socket closed (con state V1_BANNER)
[170890.012151] libceph: mon4 (1)192.168.xx.xx:6789 socket closed (con state V1_BANNER)
[170892.185571] libceph: mon0 (1)192.168.xx.xx:6789 socket closed (con state OPEN)

we have six nodes in this cluster and currently 2 is running but i change in ceph.conf file from 6 to 3
pvecm status
root@vms-pmx-i:~# pvecm status
Cluster information
-------------------
Name: cluster-pmx-i
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Oct 17 13:32:20 2023
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.1b9b
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.xx.xx (local)
0x00000003 1 192.168.xx.xx



can anyone suggest how to resolve this error ??



Thanks in advance
 
After rebooting the cluster on the current node, I encountered an issue where I couldn't execute Ceph commands. The commands would hang for a while and eventually result in the error mentioned above.

apart from this we face another one
[170887.228156] libceph: mon2 (1)192.168.xx.xx:6789 socket closed (con state V1_BANNER)
[170890.012151] libceph: mon4 (1)192.168.xx.xx:6789 socket closed (con state V1_BANNER)
[170892.185571] libceph: mon0 (1)192.168.xx.xx:6789 socket closed (con state OPEN)

we have six nodes in this cluster and currently 2 is running but i change in ceph.conf file from 6 to 3
pvecm status
root@vms-pmx-i:~# pvecm status
Cluster information
-------------------
Name: cluster-pmx-i
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Oct 17 13:32:20 2023
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.1b9b
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.xx.xx (local)
0x00000003 1 192.168.xx.xx



can anyone suggest how to resolve this error ??



Thanks in advance
Hi,
can you please post the output of pveceph status. Please make sure all the nodes required for the ceph cluster to run are online. Same is true for the proxmox cluster, but according to the output this should be a 3 node cluster, 2 of whom are online (which is in contrast to your statement that this is a 6 node cluster).

Also check the systemd journal for errors. By running journalctl -r you get a paginated view of the logs in reverse order.
 
Thanks for responding chris

pvecm status:
Screenshot from 2023-10-17 21-57-20.png

Am new to this environment if am doing anything wrong let me know i'll try to correct that

And for journalctl -r output is: i pasted here error logs only)

Oct 17 18:18:05 vms-pmx-i pmxcfs[2266]: [dcdb] crit: leaving CPG group
Oct 17 18:18:05 vms-pmx-i pmxcfs[2266]: [dcdb] crit: received write while not quorate - trigger resync
vms-pmx-i systemd[1]: Reloading.
Oct 17 18:17:59 vms-pmx-i pvestatd[3562]: status update time (67.610 seconds)
Oct 17 18:17:59 vms-pmx-i pvestatd[3562]: mount error: Job failed. See "journalctl -xe" for details.
Oct 17 18:17:59 vms-pmx-i systemd[1]: Failed to mount mnt-pve-cluster\x2dpmx\x2di\x2dcephfs.mount - /mnt/pve/cluster-pmx-i-cephfs.
Oct 17 18:17:59 vms-pmx-i systemd[1]: mnt-pve-cluster\x2dpmx\x2di\x2dcephfs.mount: Failed with result 'exit-code'.
Oct 17 18:17:59 vms-pmx-i systemd[1]: mnt-pve-cluster\x2dpmx\x2di\x2dcephfs.mount: Mount process exited, code=exited, status=32/n/a
Oct 17 18:17:59 vms-pmx-i kernel: ceph: No mds server is up or the cluster is laggy
Oct 17 18:17:59 vms-pmx-i mount[12892]: mount error: no mds server is up or the cluster is laggy
Oct 17 18:17:59 vms-pmx-i ceph-mon[2362]: 2023-10-17T18:17:59.003+0200 7fdd1d6196c0 -1 mon.vms-pmx-i@0(probing) e10 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 34 bytes epoch 0)
Oct 17 18:17:58 vms-pmx-i pmxcfs[2266]: [dcdb] crit: can't initialize service
Oct 17 18:17:58 vms-pmx-i pmxcfs[2266]: [dcdb] crit: cpg_join failed: 14
Oct 17 18:17:58 vms-pmx-i pmxcfs[2266]: [dcdb] notice: start cluster connection
Oct 17 18:17:55 vms-pmx-i pmxcfs[2266]: [dcdb] crit: leaving CPG group
Oct 17 18:17:55 vms-pmx-i pmxcfs[2266]: [dcdb] crit: received write while not quorate - trigger resync
Oct 17 18:17:54 vms-pmx-i pmxcfs[2266]: [dcdb] notice: all data is up to date
Oct 17 18:17:54 vms-pmx-i pmxcfs[2266]: [dcdb] notice: members: 1/2266, 2/2361, 3/2062
Oct 17 18:17:54 vms-pmx-i kernel: libceph: mon4 (1)192.168.80.39:6789 socket closed (con state V1_BANNER)
Oct 17 18:17:54 vms-pmx-i kernel: libceph: mon0 (1)192.168.80.31:6789 socket closed (con state OPEN)
Oct 17 18:17:54 vms-pmx-i ceph-mon[2362]: 2023-10-17T18:17:54.003+0200 7fdd1d6196c0 -1 mon.vms-pmx-i@0(probing) e10 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 34 bytes epoch 0)
Oct 17 18:17:52 vms-pmx-i pmxcfs[2266]: [dcdb] notice: all data is up to date
 
Okay, the output of your pvecm status is not the same as above, I assume the lower is the current/correct one?

Anyways, as you can see from that, from the 6 node cluster only 3 are online and/or reachable by each other. Since this is lower than the required 4 nodes to have quorum in your cluster, the cluster is switched into a read only state.

So you will have to bring up at least one additional node in the cluster to reach quorum again. If that is the case, make sure all nodes can reach each other. Regarding ceph, please provide the pveceph status output as requested, although I assume the issue is directly connected.
 
I executed the command "pvecm expect 3" in order to resolve the "actively blocked" error, as indicated by the output of the "pvecm status" command above.
After a reboot and without altering the "pvecm status," it remains in an actively blocked state

so, Is it possible to execute the "pvecm expect 3" command to resolve the quorum blocked error, without the need to bring all the nodes in the cluster online (as I lacked the necessary permissions to do so)?
 
Last edited:
pvecm expect 3
This will only be active until you reboot. This is not persistent. As you can see from your initial output, in that case 2 of the Proxmox VE nodes are in quorum. But this is independent from ceph.

as I lacked the necessary permissions to do so
Then you should get the permission to do so, as you cannot run the cluster as is. For a 6 node Proxmox VE cluster, at least 4 nodes have to be online and able to communicate for the cluster to work. Also for ceph you need a majority of the nodes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!