ceph install broken on new node

Stefano Giunchi · Jul 15, 2020

I'm replacing two nodes on my PVE5.4 cluster. I will upgrade to 6.x after that.
I installed the first of the new nodes, joined to the cluster, reloaded the webgui, everything ok.

Then, from another node's webgui, I clicked in the new node's "Ceph" section.
It proposed to install ceph packages, I accepted, then it got timeout.
Now, the packages are installed, but when I go to the new node's CEPH, it timeouts:

I tried "pveceph install", it says everything is installed, while "pveceph init" says nothing.
How can I reinstall ceph on the new node?

RokaKen · Jul 15, 2020

Well, "pveceph install" does the same thing the GUI did (and probably harmless), but "pveceph init" is for a NEW cluster -- not an existing one.

On the NEW node, as 'root', run ls -lh /etc/pve/ceph.conf . Does the file exist?
if so, run ls -lh /etc/ceph/ceph.conf . Is it a symlink like this:

Code:

lrwxrwxrwx 1 root root <some ISO date> /etc/ceph/ceph.conf -> /etc/pve/ceph.conf

What is the output of systemctl status ceph.target ?

Stefano Giunchi · Jul 15, 2020

RokaKen said:
Well, "pveceph install" does the same thing the GUI did (and probably harmless), but "pveceph init" is for a NEW cluster -- not an existing one.

Before trying that, I did read this post which said to use pveceph init on new nodes too. It didn't do any harm, but I should have asked before doing that.

Code:
On the NEW node, as 'root', run ls -lh /etc/pve/ceph.conf . Does the file exist?
if so, run ls -lh /etc/ceph/ceph.conf . Is it a symlink like this:

Code:

lrwxrwxrwx 1 root root <some ISO date> /etc/ceph/ceph.conf -> /etc/pve/ceph.conf

It's not a link, and neither in the other servers. The file is correctly synced.

Code:

-rw-r-----  1 root www-data  1038 Jul 15 17:12 ceph.conf

What is the output of systemctl status ceph.target ?

Code:

root@PRIVATE:/etc/pve# systemctl status ceph.target
● ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once
   Loaded: loaded (/lib/systemd/system/ceph.target; enabled; vendor preset: enabled)
   Active: active since Wed 2020-07-15 17:01:38 CEST; 5h 8min ago

/var/log/ceph is empty, and ceph status hangs, and I have to terminate it.

Stefano Giunchi · Jul 15, 2020

I added the second new node (it's the fifth), and used the pveceph install command.
The result is the same, "Got Timeout (500)".

The new nodes are a bit more updated, 5.4.15 against the 5.4.13 of the older ones, but there are no ceph packages to upgrade in those.
Also, the new ones are no-subscription while the older ones are with subscription, because I have to transfer the subscriptions to the new ones.

Stefano Giunchi · Jul 15, 2020

RokaKen said:
if so, run ls -lh /etc/ceph/ceph.conf . Is it a symlink like this:

Code:

lrwxrwxrwx 1 root root <some ISO date> /etc/ceph/ceph.conf -> /etc/pve/ceph.conf

I didn't read correctly, sorry.

This is the first of the new servers:

Code:

root@NEWSERVER1:~# ls -al /etc/ceph/
total 12
drwxr-xr-x  2 root root 4096 Jul 15 17:12 .
drwxr-xr-x 92 root root 4096 Jul 15 22:31 ..
lrwxrwxrwx  1 root root   18 Jul 15 17:12 ceph.conf -> /etc/pve/ceph.conf
-rw-r--r--  1 root root   92 Nov 19  2018 rbdmap

This is the second new server:

Code:

root@NEWSERVER2:~# ls -al /etc/ceph/
total 12
drwxr-xr-x  2 root root 4096 Jul 15 22:51 .
drwxr-xr-x 92 root root 4096 Jul 15 22:51 ..
-rw-r--r--  1 root root   92 Nov 19  2018 rbdmap

This is one of the old servers:

Code:

root@OLDSERVER1:~# ls -al /etc/ceph/
total 16
drwxr-xr-x   2 root root 4096 Jun 17  2019 .
drwxr-xr-x 103 root root 4096 Jul  6 05:00 ..
-rw-------   1 ceph ceph  159 Dec 29  2018 ceph.client.admin.keyring
lrwxrwxrwx   1 root root   18 Dec 29  2018 ceph.conf -> /etc/pve/ceph.conf
-rw-r--r--   1 root root   92 Jun  7  2017 rbdmap

The keyring is missing in both the new servers, while NEWSERVER2 misses the link to /etc/pve/ceph.conf

Alwin · Jul 16, 2020

Please post your ceph.conf and are all nodes reachable?

Stefano Giunchi · Jul 16, 2020

Alwin said:
Please post your ceph.conf and are all nodes reachable?

All nodes are reachable, and the cluster is ok:

root@OLD1:~# ha-manager status
quorum OK
master NEW1 (active, Thu Jul 16 11:39:37 2020)
lrm OLD2 (active, Thu Jul 16 11:39:44 2020)
lrm OLD3 (active, Thu Jul 16 11:39:43 2020)
lrm NEW1 (idle, Thu Jul 16 11:39:42 2020)
lrm NEW2 (idle, Thu Jul 16 11:39:43 2020)
lrm OLD1 (active, Thu Jul 16 11:39:41 2020)
service ct:124 (OLD2, started)
service ct:200 (OLD2, started)
service vm:100 (OLD2, started)
service vm:102 (OLD1, started)
service vm:104 (OLD1, started)
service vm:105 (OLD1, started)
service vm:106 (OLD3, started)
service vm:111 (OLD3, started)
service vm:114 (OLD2, started)
service vm:115 (OLD3, started)
service vm:117 (OLD3, started)
service vm:118 (OLD2, started)
service vm:121 (OLD1, started)
service vm:122 (OLD3, started)
service vm:123 (OLD1, started)
service vm:126 (OLD1, started)

root@OLD1:~# cat /etc/pve/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.10.10.0/24
filestore xattr use omap = true
fsid = e3b8320a-5149-4269-93f4-eeddef3597b2
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
mon osd down out interval = 30
osd journal size = 5120
osd pool default min size = 1
public network = 10.10.10.0/24

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
bluestore cache size = 1G
keyring = /var/lib/ceph/osd/ceph-$id/keyring
osd client op priority = 1
osd max backfills = 1
osd recovery max active = 1
osd recovery op priority = 63

[mds.OLD1]
host = OLD1
mds standby for name = pve

[mds.OLD2]
host = OLD2
mds standby for name = pve

[mds.OLD3]
host = OLD3
mds standby for name = pve

[mon.OLD2]
host = OLD2
mon addr = 10.10.10.10:6789

[mon.OLD3]
host = OLD3
mon addr = 10.10.10.13:6789

[mon.OLD1]
host = OLD1
mon addr = 10.10.10.3:6789

These are software versions, I did an apt upgrade of OLD3 yesterday:

root@OLD1:~# pveversion
pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-24-pve)
root@OLD1:~# ceph -v
ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)

root@OLD2:~# pveversion
pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-23-pve)
root@OLD2:~# ceph -v
ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)

root@OLD3:~# pveversion
pve-manager/5.4-15/d0ec33c6 (running kernel: 4.15.18-30-pve)
root@OLD3:~# ceph -v
ceph version 12.2.13 (98af9a6b9a46b2d562a0de4b09263d70aeb1c9dd) luminous (stable)

root@NEW1:~# pveversion
pve-manager/5.4-15/d0ec33c6 (running kernel: 4.15.18-29-pve)
root@NEW1:~# ceph -v
ceph version 12.2.13 (98af9a6b9a46b2d562a0de4b09263d70aeb1c9dd) luminous (stable)

root@NEW2:~# pveversion
pve-manager/5.4-15/d0ec33c6 (running kernel: 4.15.18-30-pve)
root@NEW2:~# ceph -v
ceph version 12.2.13 (98af9a6b9a46b2d562a0de4b09263d70aeb1c9dd) luminous (stable)

pveceph status in old nodes works, while on new nodes returns got timeout

Alwin · Jul 16, 2020

Stefano Giunchi said:
pveceph status in old nodes works, while on new nodes returns got timeout

That's what I meant with, can all the nodes connect to each other. And it's about the Ceph, public network = 10.10.10.0/24.

Stefano Giunchi · Jul 16, 2020

Do'h! I still didn't add the new nodes to the ceph network.

I feel so stupid...

Search

Search

ceph install broken on new node

Stefano Giunchi

Renowned Member

RokaKen

Active Member

Stefano Giunchi

Renowned Member

Stefano Giunchi

Renowned Member

Stefano Giunchi

Renowned Member

Alwin

Proxmox Retired Staff

Stefano Giunchi

Renowned Member

Alwin

Proxmox Retired Staff

Stefano Giunchi

Renowned Member

We value your privacy