Fresh install 8.1.4 + Ceph 18.4 = Broken install w. multiple issues

jorel83 · Feb 23, 2024

Hi,

Tried yet again to reinstall PMX 8.1.4 and Ceph 18.4 and have the same identical results.

Issues i detect is as follows below in the screenshots, but also seems like symlinks is missing and cannot be created.

Seems this is very buggy release and i cant find a way forward.

file '/etc/ceph/ceph.conf' already exists and is not a symlink to /etc/pve/ceph.conf (500)

Not possible to add new symlink (testing on the 3rd host) identical issue across all hosts.

rados_connect failed - No such file or directory (500)

file '/etc/ceph/ceph.conf' already exists and is not a symlink to /etc/pve/ceph.conf (500)

Monitor cannot start

alexskysilk · Feb 23, 2024

jorel83 said:
Not possible to add new symlink (testing on the 3rd host) identical issue across all hosts.

correct. dont do that. there's also no reason to, pve clustering will handle it for you.

jorel83 · Feb 23, 2024

alexskysilk said:
correct. dont do that. there's also no reason to, pve clustering will handle it for you.

Ok, that was some advice from similar cases on the forum.

Found onre more issue I never seen befoer either, seems DNS related for ceph-mon?

root@pmx0:/etc/pve# ceph fs ls
unable to get monitor info from [B]DNS SRV[/B] with service name: ceph-mon
2024-02-23T20:00:08.528+0100 7fb5aac716c0 -1 failed for service _ceph-mon._tcp
2024-02-23T20:00:08.528+0100 7fb5aac716c0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
[errno 2] RADOS object not found (error connecting to the cluster)

added to the hostfile, but no difference on this part, just grasping for straws, no good to have the entire cluster down for this long time...

alexskysilk · Feb 23, 2024

not dns; please post your ceph.conf from the machine you are running that on.

jorel83 · Feb 23, 2024

alexskysilk said:
not dns; please post your ceph.conf from the machine you are running that on.

This is the only thing generated during the setup (on 1 of the 4 hosts) both crush map and configuration database times out with Error got timeout (500)

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 172.16.102.102/24
fsid = 17f038ec-b463-4770-8af4-504b56c1e4b7
mon_allow_pool_delete = true
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.254.10.12/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

Nothing beyond this is generated, the other hosts has identical config but instead of timeout it throws: Error rados_connect failed - No such file or directory (500)

I have now decided to remove 1 host from the equation and use as single node until this is resolved, cannot have more down time, even for hobby. Happy this is not a production environment..

alexskysilk · Feb 23, 2024

there are no mds servers defined in your configuration file. ceph has no idea what you're asking with

ceph fs ls.

jorel83 · Feb 23, 2024

alexskysilk said:
there are no mds servers defined in your configuration file. ceph has no idea what you're asking with

ceph fs ls.

I know, and I cannot add any the webui config times out and the cli just refuses me to edit it manually

root@pmx0:~# ceph fs ls.
unable to get monitor info from DNS SRV with service name: ceph-mon
2024-02-23T23:40:00.428+0100 7f78a78106c0 -1 failed for service _ceph-mon._tcp
2024-02-23T23:40:00.428+0100 7f78a78106c0 -1 monclient: get_monmap_and_config cannot identify monitors to contact
[errno 2] RADOS object not found (error connecting to the cluster)

Added ceph-mon to /etc/hosts but makes no difference.

Keep in mind this is pure fresh install, only network for the bonds is configured before this.

Was running 7.2.X something before the reinstall and disk swap that lead up to this, even that webui ceph config worked, before that I did manual in the CLI.

alexskysilk · Feb 23, 2024

ok, here is what I would recommend.

I assume you have not used your file system meaningfully yet.

run the following command on all nodes:
pveceph purge
pveceph install

at this point, you can either use the gui or the cli to continue.
on first node, pveceph --init (see options here: https://pve.proxmox.com/pve-docs/pveceph.1.html)
then pveceph createmon on three nodes
the createosds; since they will probably have an existing signature, you can use ceph-volume lvm disk zap to clear them.

if you have a sane configuration at this point, congratulations. if not... reinstalling everything is probably the quickest way forward.

--edit in case its not clear, DO NOT HAND MODIFY ANYTHING

jorel83 · Feb 24, 2024

alexskysilk said:
ok, here is what I would recommend.

I assume you have not used your file system meaningfully yet.

run the following command on all nodes:
pveceph purge
pveceph install

at this point, you can either use the gui or the cli to continue.
on first node, pveceph --init (see options here: https://pve.proxmox.com/pve-docs/pveceph.1.html)
then pveceph createmon on three nodes
the createosds; since they will probably have an existing signature, you can use ceph-volume lvm disk zap to clear them.

if you have a sane configuration at this point, congratulations. if not... reinstalling everything is probably the quickest way forward.

--edit in case its not clear, DO NOT HAND MODIFY ANYTHING

At this stage your probably right, only thing left to do is to do CLI not UI, already reinstalled the servers 3 times now and doing it remotely so takes quite some time achieve it.

I hope to be able to try it tomorrow.

Thanks for your help so far.

Cheers

alexskysilk · Feb 24, 2024

from scratch- order of operations:
1. build your networks. make sure you have your service (internet) network, corosync (at least one) and ceph (private and public; can be one for both) defined on all nodes before you do anything.
2. make sure you have hosts files on all nodes. hosts files should contain each node's short name (without any domain) pointing to that node's primary corosync ip
3. create your cluster. see https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_create_cluster
4. make sure all nodes are present in your gui, with green status
5. pveceph install on all nodes
6. make sure your ceph ips can ping all nodes from all nodes.
7. continue as in https://forum.proxmox.com/threads/f...-install-w-multiple-issues.142160/post-637635

at no point should you be editing anything by hand; if you have to, you messed up something above.

Once you get the whole cluster up, there are things you could be doing to tune the cluster- but not till you got the cluster up to begin with.

Search

Search

Fresh install 8.1.4 + Ceph 18.4 = Broken install w. multiple issues

jorel83

Active Member

Attachments

alexskysilk

Distinguished Member

jorel83

Active Member

alexskysilk

Distinguished Member

jorel83

Active Member

alexskysilk

Distinguished Member

jorel83

Active Member

alexskysilk

Distinguished Member

jorel83

Active Member

alexskysilk

Distinguished Member