[SOLVED] PVE6.4 pmxcfs fail to initialize and ceph failed on a one node cluster

flotho · Dec 11, 2024

Hi everyone,

I have a one node server that has been part of a 4 nodes cluster.
The current server has 2 disk with OSD and VM + CT using ceph.
A few days ago ceph had turned unresponsive with question mark and got timeout (500) in the web UI.
We updated the PXVE6.4 to the latest release, all the packages are uptodate
We faced some network issue because the NIC name changed with the update. This part has been solved and now the PVE is available on the network and everything seems to be up :

checking some infos from the system seems to say everything OK :

Yet the ceph is still not running :

Filesystem seems OK and OSD are mounted :

*
I restarted the server many times and check infos into the logs and found this :

Code:

Starting The Proxmox VE cluster filesystem...
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [quorum] crit: quorum_initialize failed: 2
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [quorum] crit: can't initialize service
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [confdb] crit: cmap_initialize failed: 2
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [confdb] crit: can't initialize service
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [dcdb] crit: cpg_initialize failed: 2
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [dcdb] crit: can't initialize service
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [status] crit: cpg_initialize failed: 2
Dec 11 14:21:46 ovh7 pmxcfs[1653]: [status] crit: can't initialize service
Dec 11 14:21:47 ovh7 iscsid[1504]: iSCSI daemon with pid=1505 started!

Not sure of the origin of this issue.
Any advice will be appreciated.
Regards

flotho · Dec 11, 2024

I've also tried to turn this into a standalone server : https://forum.proxmox.com/threads/proxmox-ve-6-removing-cluster-configuration.56259/#post-259203
Yet ceph is not starting but I have no more pmxfs issues at boot

flotho · Dec 11, 2024

so maybe it's not corosync the issue but ceph.
The ceph logs are throwing :

Code:

e13 handle_auth_request failed to assign global_id
2024-12-11T14:55:53.720+0100 7f815b4c5700 -1 mon.server@1(probing) e13 get_health_metrics reporting 4 slow ops, oldest is auth(proto 0 29 bytes epoch 0)
2024-12-11T14:55:56.188+0100 7f815d4c9700  1 mon.server@1(probing) e13 handle_auth_request failed to assign global_id
2024-12-11T14:55:56.388+0100 7f815d4c9700  1 mon.server@1(probing) e13 handle_auth_request failed to assign global_id

fabian · Dec 11, 2024

ceph is also quorum based and doesn't work if you just remove most of the nodes.. also PVE 6 is EOL since a quite a while already (>2 years!)

flotho · Dec 11, 2024

Hi @fabian
Thanks for this precision, you're right.
So If I resume, ceph can't run on a single node ? OR do we need to adapt the corum also for ceph.
Thanks for the time spent to read and answer

flotho · Dec 11, 2024

My question behind this is : if I upgrade the server, will it allow ceph to run on this single node ?
Regards

flotho · Dec 11, 2024

hum actually it won't be possible to upgrade without making ceph work : pve6to7 fail

fabian · Dec 11, 2024

you'd need to manually adapt the ceph setup to work with a single node, and it doesn't really make sense to use ceph in such a setup. if you have (tested!) backups of your guests, you might be better off simply reinstalling the system using PVE 8.3, without Ceph, and restoring from backups

flotho · Dec 11, 2024

understood.
Sadly I have some VM on this server that hasn't been saved for a long time. not so critics but with a long setup process.
If I reinstall everything, will I be able to recover the osd ?

flotho · Dec 11, 2024

additionnal infos :

the one node ceph was working perfectly, it has just failed recently and I'm looking for the reason.
a recent update change the NIC name and maybe that's the reason but I don't find any clue for this hypothesis

fabian · Dec 11, 2024

no, you need the monitor(s) as well.. you are running a very outdated, very non-standard setup (and seemingly don't have backups). if you don't know how ceph works, it might be best to get professional assistance..

flotho · Dec 11, 2024

fabian said:
you'd need to manually adapt the ceph setup to work with a single node, and it doesn't really make sense to use ceph in such a setup. if you have (tested!) backups of your guests, you might be better off simply reinstalling the system using PVE 8.3, without Ceph, and restoring from backups

Thanks for your advice, I'll try to restore the behaviour that was working and upgrade ASAP.

flotho · Dec 11, 2024

fabian said:
no, you need the monitor(s) as well.. you are running a very outdated, very non-standard setup (and seemingly don't have backups). if you don't know how ceph works, it might be best to get professional assistance..

Hi @fabian , FMI, do you think that the official support of proxmox could cover this case ?
I mean, for changing the ceph conf to make it work as it used to recently.

Regards

flotho · Dec 11, 2024

So for the record, I've succeed in making ceph working.
First there was some ghost monitors that I've succeed to delete from monmap
Then, I had some ACL issues on the directory structure :

Code:

rocksdb: IO error: While opening a file for sequentially reading:
/var/lib/ceph/mon/ceph-host/store.db/CURRENT: Permission denied

Some files and folder had root as owner.
I passed a chown command on the files and the ceph become available once service restarted.

Thanks for your support. This forum is so awesome.

Search

Search

[SOLVED] PVE6.4 pmxcfs fail to initialize and ceph failed on a one node cluster

flotho

Renowned Member

flotho

Renowned Member

flotho

Renowned Member

fabian

Proxmox Staff Member

flotho

Renowned Member

flotho

Renowned Member

flotho

Renowned Member

fabian

Proxmox Staff Member

flotho

Renowned Member

flotho

Renowned Member

fabian

Proxmox Staff Member

flotho

Renowned Member

flotho

Renowned Member

flotho

Renowned Member