Reinstall CEPH on Proxmox 6

Metz

Active Member
May 5, 2018
11
2
43
49
Hello,

After the upgrade to Release 6 I tried instead of upgrading CEPH to reinstall CEPH. I used a page which showed to delete several directories.
( rm -Rf /etc/ceph /etc/pve/ceph.conf /etc/pve/priv/ceph* /var/lib/ceph )

pveceph init --network 10.1.1.0/24 was working But afterwards I get following error:
pveceph createmon
unable to get monitor info from DNS SRV with service name: ceph-mon
Could not connect to ceph cluster despite configured monitors

Installation via GUI also fails.

Is there a way to reinstall CEPH so that I can fix the issue?

Danke Metz
 
Can confirm, after upgrading to PVE 6 from 5.4 (which was successful) I tried to upgrade Ceph which was not successful. I purged the Ceph config and tried to reinstall with nautilus, I made sure it is installed. It is failing with the same message. I even put all the nodes in the host table but it did not help.
 
pveceph init --network 10.1.1.0/24 was working But afterwards I get following error:
remove the /etc/pve/ceph.conf, as it will not be re-initialized once it was created.

EDIT:
After posting I have seen the brackets. :/

Are you on the latest packages (pveversion -v)? Can you please post your ceph.conf?
 
Last edited:
pveversion -v
Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.18-1-pve)
pve-manager: 6.0-5 (running version: 6.0-5/f8a710d7)
pve-kernel-5.0: 6.0-6
pve-kernel-helper: 6.0-6
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 14.2.2-pve1
ceph-fuse: 14.2.2-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve2
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-3
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

cat /etc/pve/ceph.conf
Code:
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 10.1.1.0/24
         fsid = 59cf47e3-19c8-4b4c-bea7-983c62ebbcdf
         mon_allow_pool_delete = true
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public network = 10.1.1.0/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring
 
Installed the latest Updates. Same result.

pveceph init --network 10.1.1.0/24 was working But afterwards I get following error:
pveceph createmon
Code:
unable to get monitor info from DNS SRV with service name: ceph-mon
Could not connect to ceph cluster despite configured monitors

cat /etc/pve/ceph.conf
Code:
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 10.1.1.0/24
         fsid = 631a6e28-8e2d-4563-89d8-ac0043790a6f
         mon allow pool delete = true
         osd pool default min size = 2
         osd pool default size = 3
         public network = 10.1.1.0/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

pveversion -v
Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.18-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 14.2.2-pve1
ceph-fuse: 14.2.2-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2
 
I found following log entry after reboot of the node:
cat /var/log/ceph/ceph-mon.pve-node3.log
Code:
2019-09-06 20:55:17.263 7fd6ce7fa3c0  0 set uid:gid to 64045:64045 (ceph:ceph)
2019-09-06 20:55:17.263 7fd6ce7fa3c0  0 ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable), process ceph-mon, pid 1695
2019-09-06 20:55:17.263 7fd6ce7fa3c0 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve-node3' does not exist: have you run 'mkfs'?
2019-09-06 20:55:27.555 7f856c6883c0  0 set uid:gid to 64045:64045 (ceph:ceph)
2019-09-06 20:55:27.555 7f856c6883c0  0 ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable), process ceph-mon, pid 1876
2019-09-06 20:55:27.555 7f856c6883c0 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve-node3' does not exist: have you run 'mkfs'?

After the creation of the directory:
Code:
2019-09-06 20:57:53.206 7fd067a7f3c0  0 set uid:gid to 64045:64045 (ceph:ceph)
2019-09-06 20:57:53.206 7fd067a7f3c0  0 ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable), process ceph-mon, pid 2068
2019-09-06 20:57:53.206 7fd067a7f3c0 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve-node3' is empty: have you run 'mkfs'?

The log files on the other nodes are empty but I did not reboot them.
 
Did following steps:
Code:
pveceph purge    | on all nodes
rm -r /var/lib/ceph    | on all nodes
rm /etc/pve/ceph.conf
reboot of one node

Why does the log file still have entries and it looks like ceph will be started?

cat /var/log/ceph/ceph-mon.pve-node3.log

Code:
2019-09-06 23:24:06.667 7f69a14cc3c0  0 set uid:gid to 64045:64045 (ceph:ceph)
2019-09-06 23:24:06.667 7f69a14cc3c0 -1 Errors while parsing config file!
2019-09-06 23:24:06.667 7f69a14cc3c0 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2019-09-06 23:24:06.667 7f69a14cc3c0 -1 parse_file: cannot open /.ceph/ceph.conf: (2) No such file or directory
2019-09-06 23:24:06.667 7f69a14cc3c0 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2019-09-06 23:24:06.667 7f69a14cc3c0  0 ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable), process ceph-mon, pid 1864
2019-09-06 23:24:06.667 7f69a14cc3c0 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve-node3' does not exist: have you run 'mkfs'?
2019-09-06 23:24:16.563 7f41e7cc93c0  0 set uid:gid to 64045:64045 (ceph:ceph)
2019-09-06 23:24:16.563 7f41e7cc93c0 -1 Errors while parsing config file!
2019-09-06 23:24:16.563 7f41e7cc93c0 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2019-09-06 23:24:16.563 7f41e7cc93c0 -1 parse_file: cannot open /.ceph/ceph.conf: (2) No such file or directory
2019-09-06 23:24:16.563 7f41e7cc93c0 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2019-09-06 23:24:16.563 7f41e7cc93c0  0 ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable), process ceph-mon, pid 1909
2019-09-06 23:24:16.563 7f41e7cc93c0 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve-node3' does not exist: have you run 'mkfs'?
2019-09-06 23:24:26.747 7f683159d3c0  0 set uid:gid to 64045:64045 (ceph:ceph)
2019-09-06 23:24:26.747 7f683159d3c0 -1 Errors while parsing config file!
2019-09-06 23:24:26.747 7f683159d3c0 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2019-09-06 23:24:26.747 7f683159d3c0 -1 parse_file: cannot open /.ceph/ceph.conf: (2) No such file or directory
2019-09-06 23:24:26.747 7f683159d3c0 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2019-09-06 23:24:26.747 7f683159d3c0  0 ceph version 14.2.2 (a887fe9a5d3d97fe349065d3c1c9dbd7b8870855) nautilus (stable), process ceph-mon, pid 1972
2019-09-06 23:24:26.747 7f683159d3c0 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve-node3' does not exist: have you run 'mkfs'?

Code:
ps -ef | grep ceph
root         986       1  0 23:23 ?        00:00:00 /usr/bin/python2.7 /usr/bin/ceph-crash
 
Hi
I seem to be having the exact same problem as metz and brucexx.
Have tried purge and delete of /var/lib/ceph without success.

WHen i do pveceph init either via cli or the gui it errors out with "Could not connect to ceph cluster despite configured monitors (500) "
And it is left with "ghosts" of the monitors, visible in webgui but not in any cli, iie. cannot be removed stopped or started.

Is there a way to delete alle these references and start completly from scratch with the ceph install? (remove all config/servicees and put them back)
 
  • Like
Reactions: G0ldmember
Is there a way to delete alle these references and start completly from scratch with the ceph install? (remove all config/servicees and put them back)
please check all nodes if there are any leftover directories (/var/lib/ceph/mon) and leftover services
/etc/systemd/system/ceph-mon.target.wants/

remove all of those and restart pvestatd with 'systemctl restart pvestatd'
 
Thanks. That helped me one step further. Now 3 monitors are running. 2 managers are configured and running. Not able to start the managers and not able to configure the 3rd manager.

Following Error Message: /var/lib/ceph/mgr/ceph-pve-node3/keyring.tmp.2235958

Now I've two weeks holiday. I think I will reinstall proxmox to have a clean setup. Because also I'm loosing corosync from time to time which I never had on Version 5 and also the gui shows a questionmark on some nodes from time to time but on the cli I see the node is up.

Code:
ls -l /var/lib/ceph/mon
ls: cannot access '/var/lib/ceph/mon': No such file or directory

ls /etc/systemd/system/ceph-mon.target.wants/
ceph-mon@pve-node3.service

systemctl disable ceph-mon@pve-node3.service
Removed /etc/systemd/system/ceph-mon.target.wants/ceph-mon@pve-node3.service.

ls /etc/systemd/system/ceph-mon.target.wants/
ls: cannot access '/etc/systemd/system/ceph-mon.target.wants/': No such file or directory

pveceph init --network 10.1.1.0/24
creating /etc/pve/priv/ceph.client.admin.keyring

pveceph createmon
unable to get monitor info from DNS SRV with service name: ceph-mon
creating /etc/pve/priv/ceph.mon.keyring
importing contents of /etc/pve/priv/ceph.client.admin.keyring into /etc/pve/priv/ceph.mon.keyring
chown: cannot access '/var/lib/ceph/mon/ceph-pve-node3': No such file or directory
error with cfs lock 'file-ceph_conf': command 'chown ceph:ceph /var/lib/ceph/mon/ceph-pve-node3' failed: exit code 1

mkdir -p /var/lib/ceph/mon

pveceph createmon
unable to get monitor info from DNS SRV with service name: ceph-mon
monmaptool: monmap file /tmp/monmap
monmaptool: generated fsid 4d72e5e9-4e59-4875-a27f-ff273a9c007f
epoch 0
fsid 4d72e5e9-4e59-4875-a27f-ff273a9c007f
last_changed 2019-09-14 11:36:10.961149
created 2019-09-14 11:36:10.961149
min_mon_release 0 (unknown)
0: [v2:10.9.9.13:3300/0,v1:10.9.9.13:6789/0] mon.pve-node3
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
Created symlink /etc/systemd/system/ceph-mon.target.wants/ceph-mon@pve-node3.service -> /lib/systemd/system/ceph-mon@.service.
creating manager directory '/var/lib/ceph/mgr/ceph-pve-node3'
creating keys for 'mgr.pve-node3'
unable to open file '/var/lib/ceph/mgr/ceph-pve-node3/keyring.tmp.2235636' - No such file or directory

pveceph createmgr
creating manager directory '/var/lib/ceph/mgr/ceph-pve-node3'
creating keys for 'mgr.pve-node3'
unable to open file '/var/lib/ceph/mgr/ceph-pve-node3/keyring.tmp.2235958' - No such file or directory
 
  • Like
Reactions: G0ldmember
with full new installed latest pve6, this problem remains.

first setup ceph with:
pveceph init --network 10.0.115.0/24 -disable_cephx 1
pveceph mon create
...
it's normal,

and delete all ceph to continue testing..
resetup new ceph cluster with:

pveceph init --network 10.0.115.0/24 -disable_cephx 1
pveceph mon create
unable to get monitor info from DNS SRV with service name: ceph-mon
...
it won't work!

now, i'm not use this parameter. it's all ok, and won't reinstall pve6 again!
 
Last edited:
@lynn_yudi, what is in the ceph config when the 'unable to get monitor info' shows up?
 
@lynn_yudi, what is in the ceph config when the 'unable to get monitor info' shows up?

pveceph init --network 10.0.115.0/24 -disable_cephx 1

for ceph.conf
Bash:
# cat ceph.conf
[global]
         auth_client_required = none
         auth_cluster_required = none
         auth_service_required = none
         cluster_network = 10.0.115.0/24
         fsid = 554ee1fe-8a40-44bf-9c54-2db323aa89ea
         mon_allow_pool_delete = true
         mon_host = 10.0.115.15 10.0.115.11 10.0.115.13
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         osd_crush_update_on_start = false
         public_network = 10.0.115.0/24

and
# pveceph mon create
unable to get monitor info from DNS SRV with service name: ceph-mon

sorry, I forgot if it was this message(above)

but it's not going to work.
 
There seems to be some leftover. After you purged Ceph, is /var/lib/ceph/ empty? And is there no ceph.conf anymore?