[SOLVED] PVE6.4 pmxcfs fail to initialize and ceph failed on a one node cluster

flotho

Renowned Member
Sep 3, 2012
74
4
73
Hi everyone,

I have a one node server that has been part of a 4 nodes cluster.
The current server has 2 disk with OSD and VM + CT using ceph.
A few days ago ceph had turned unresponsive with question mark and got timeout (500) in the web UI.
We updated the PXVE6.4 to the latest release, all the packages are uptodate
We faced some network issue because the NIC name changed with the update. This part has been solved and now the PVE is available on the network and everything seems to be up :
1733923713472.png

checking some infos from the system seems to say everything OK :
1733923786367.png
Yet the ceph is still not running :
1733923856170.png

Filesystem seems OK and OSD are mounted :
1733923948337.png*
I restarted the server many times and check infos into the logs and found this :

Code:
Starting The Proxmox VE cluster filesystem...
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [quorum] crit: quorum_initialize failed: 2
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [quorum] crit: can't initialize service
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [confdb] crit: cmap_initialize failed: 2
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [confdb] crit: can't initialize service
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [dcdb] crit: cpg_initialize failed: 2
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [dcdb] crit: can't initialize service
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [status] crit: cpg_initialize failed: 2
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [status] crit: can't initialize service
Dec 11 14:21:47 ovh7 iscsid[1504]: iSCSI daemon with pid=1505 started!

Not sure of the origin of this issue.
Any advice will be appreciated.
Regards
 
Last edited:
so maybe it's not corosync the issue but ceph.
The ceph logs are throwing :
Code:
e13 handle_auth_request failed to assign global_id
2024-12-11T14:55:53.720+0100 7f815b4c5700 -1 mon.server@1(probing) e13 get_health_metrics reporting 4 slow ops, oldest is auth(proto 0 29 bytes epoch 0)
2024-12-11T14:55:56.188+0100 7f815d4c9700  1 mon.server@1(probing) e13 handle_auth_request failed to assign global_id
2024-12-11T14:55:56.388+0100 7f815d4c9700  1 mon.server@1(probing) e13 handle_auth_request failed to assign global_id
 
ceph is also quorum based and doesn't work if you just remove most of the nodes.. also PVE 6 is EOL since a quite a while already (>2 years!)
 
Hi @fabian
Thanks for this precision, you're right.
So If I resume, ceph can't run on a single node ? OR do we need to adapt the corum also for ceph.
Thanks for the time spent to read and answer
 
My question behind this is : if I upgrade the server, will it allow ceph to run on this single node ?
Regards
 
hum actually it won't be possible to upgrade without making ceph work : pve6to7 fail
 
you'd need to manually adapt the ceph setup to work with a single node, and it doesn't really make sense to use ceph in such a setup. if you have (tested!) backups of your guests, you might be better off simply reinstalling the system using PVE 8.3, without Ceph, and restoring from backups
 
understood.
Sadly I have some VM on this server that hasn't been saved for a long time. not so critics but with a long setup process.
If I reinstall everything, will I be able to recover the osd ?
 
additionnal infos :
  • the one node ceph was working perfectly, it has just failed recently and I'm looking for the reason.
  • a recent update change the NIC name and maybe that's the reason but I don't find any clue for this hypothesis
 
no, you need the monitor(s) as well.. you are running a very outdated, very non-standard setup (and seemingly don't have backups). if you don't know how ceph works, it might be best to get professional assistance..
 
  • Like
Reactions: Johannes S
you'd need to manually adapt the ceph setup to work with a single node, and it doesn't really make sense to use ceph in such a setup. if you have (tested!) backups of your guests, you might be better off simply reinstalling the system using PVE 8.3, without Ceph, and restoring from backups
Thanks for your advice, I'll try to restore the behaviour that was working and upgrade ASAP.
 
no, you need the monitor(s) as well.. you are running a very outdated, very non-standard setup (and seemingly don't have backups). if you don't know how ceph works, it might be best to get professional assistance..
Hi @fabian , FMI, do you think that the official support of proxmox could cover this case ?
I mean, for changing the ceph conf to make it work as it used to recently.

Regards
 
Last edited:
So for the record, I've succeed in making ceph working.
First there was some ghost monitors that I've succeed to delete from monmap
Then, I had some ACL issues on the directory structure :
Code:
rocksdb: IO error: While opening a file for sequentially reading:
/var/lib/ceph/mon/ceph-host/store.db/CURRENT: Permission denied
Some files and folder had root as owner.
I passed a chown command on the files and the ceph become available once service restarted.

Thanks for your support. This forum is so awesome.
 
  • Like
Reactions: fabian

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!