ceph install broken on new node

Stefano Giunchi

Renowned Member
Jan 17, 2016
84
12
73
50
Forlì, Italy
www.soasi.com
I'm replacing two nodes on my PVE5.4 cluster. I will upgrade to 6.x after that.
I installed the first of the new nodes, joined to the cluster, reloaded the webgui, everything ok.

Then, from another node's webgui, I clicked in the new node's "Ceph" section.
It proposed to install ceph packages, I accepted, then it got timeout.
Now, the packages are installed, but when I go to the new node's CEPH, it timeouts:
1594826442466.png

I tried "pveceph install", it says everything is installed, while "pveceph init" says nothing.
How can I reinstall ceph on the new node?
 
Well, "pveceph install" does the same thing the GUI did (and probably harmless), but "pveceph init" is for a NEW cluster -- not an existing one.

On the NEW node, as 'root', run ls -lh /etc/pve/ceph.conf . Does the file exist?
if so, run ls -lh /etc/ceph/ceph.conf . Is it a symlink like this:
Code:
lrwxrwxrwx 1 root root <some ISO date> /etc/ceph/ceph.conf -> /etc/pve/ceph.conf

What is the output of systemctl status ceph.target ?
 
Well, "pveceph install" does the same thing the GUI did (and probably harmless), but "pveceph init" is for a NEW cluster -- not an existing one.
Before trying that, I did read this post which said to use pveceph init on new nodes too. It didn't do any harm, but I should have asked before doing that.
On the NEW node, as 'root', run ls -lh /etc/pve/ceph.conf . Does the file exist?
if so, run ls -lh /etc/ceph/ceph.conf . Is it a symlink like this:
Code:
lrwxrwxrwx 1 root root <some ISO date> /etc/ceph/ceph.conf -> /etc/pve/ceph.conf
It's not a link, and neither in the other servers. The file is correctly synced.
Code:
-rw-r-----  1 root www-data  1038 Jul 15 17:12 ceph.conf
What is the output of systemctl status ceph.target ?
Code:
root@PRIVATE:/etc/pve# systemctl status ceph.target
● ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once
   Loaded: loaded (/lib/systemd/system/ceph.target; enabled; vendor preset: enabled)
   Active: active since Wed 2020-07-15 17:01:38 CEST; 5h 8min ago

/var/log/ceph is empty, and ceph status hangs, and I have to terminate it.
 
I added the second new node (it's the fifth), and used the pveceph install command.
The result is the same, "Got Timeout (500)".

The new nodes are a bit more updated, 5.4.15 against the 5.4.13 of the older ones, but there are no ceph packages to upgrade in those.
Also, the new ones are no-subscription while the older ones are with subscription, because I have to transfer the subscriptions to the new ones.
 
if so, run ls -lh /etc/ceph/ceph.conf . Is it a symlink like this:
Code:
lrwxrwxrwx 1 root root <some ISO date> /etc/ceph/ceph.conf -> /etc/pve/ceph.conf

I didn't read correctly, sorry.

This is the first of the new servers:
Code:
root@NEWSERVER1:~# ls -al /etc/ceph/
total 12
drwxr-xr-x  2 root root 4096 Jul 15 17:12 .
drwxr-xr-x 92 root root 4096 Jul 15 22:31 ..
lrwxrwxrwx  1 root root   18 Jul 15 17:12 ceph.conf -> /etc/pve/ceph.conf
-rw-r--r--  1 root root   92 Nov 19  2018 rbdmap

This is the second new server:
Code:
root@NEWSERVER2:~# ls -al /etc/ceph/
total 12
drwxr-xr-x  2 root root 4096 Jul 15 22:51 .
drwxr-xr-x 92 root root 4096 Jul 15 22:51 ..
-rw-r--r--  1 root root   92 Nov 19  2018 rbdmap

This is one of the old servers:
Code:
root@OLDSERVER1:~# ls -al /etc/ceph/
total 16
drwxr-xr-x   2 root root 4096 Jun 17  2019 .
drwxr-xr-x 103 root root 4096 Jul  6 05:00 ..
-rw-------   1 ceph ceph  159 Dec 29  2018 ceph.client.admin.keyring
lrwxrwxrwx   1 root root   18 Dec 29  2018 ceph.conf -> /etc/pve/ceph.conf
-rw-r--r--   1 root root   92 Jun  7  2017 rbdmap

The keyring is missing in both the new servers, while NEWSERVER2 misses the link to /etc/pve/ceph.conf
 
Please post your ceph.conf and are all nodes reachable?
 
Please post your ceph.conf and are all nodes reachable?

All nodes are reachable, and the cluster is ok:
root@OLD1:~# ha-manager status
quorum OK
master NEW1 (active, Thu Jul 16 11:39:37 2020)
lrm OLD2 (active, Thu Jul 16 11:39:44 2020)
lrm OLD3 (active, Thu Jul 16 11:39:43 2020)
lrm NEW1 (idle, Thu Jul 16 11:39:42 2020)
lrm NEW2 (idle, Thu Jul 16 11:39:43 2020)
lrm OLD1 (active, Thu Jul 16 11:39:41 2020)
service ct:124 (OLD2, started)
service ct:200 (OLD2, started)
service vm:100 (OLD2, started)
service vm:102 (OLD1, started)
service vm:104 (OLD1, started)
service vm:105 (OLD1, started)
service vm:106 (OLD3, started)
service vm:111 (OLD3, started)
service vm:114 (OLD2, started)
service vm:115 (OLD3, started)
service vm:117 (OLD3, started)
service vm:118 (OLD2, started)
service vm:121 (OLD1, started)
service vm:122 (OLD3, started)
service vm:123 (OLD1, started)
service vm:126 (OLD1, started)

root@OLD1:~# cat /etc/pve/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.10.10.0/24
filestore xattr use omap = true
fsid = e3b8320a-5149-4269-93f4-eeddef3597b2
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
mon osd down out interval = 30
osd journal size = 5120
osd pool default min size = 1
public network = 10.10.10.0/24

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
bluestore cache size = 1G
keyring = /var/lib/ceph/osd/ceph-$id/keyring
osd client op priority = 1
osd max backfills = 1
osd recovery max active = 1
osd recovery op priority = 63

[mds.OLD1]
host = OLD1
mds standby for name = pve

[mds.OLD2]
host = OLD2
mds standby for name = pve

[mds.OLD3]
host = OLD3
mds standby for name = pve

[mon.OLD2]
host = OLD2
mon addr = 10.10.10.10:6789

[mon.OLD3]
host = OLD3
mon addr = 10.10.10.13:6789

[mon.OLD1]
host = OLD1
mon addr = 10.10.10.3:6789

These are software versions, I did an apt upgrade of OLD3 yesterday:
root@OLD1:~# pveversion
pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-24-pve)
root@OLD1:~# ceph -v
ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)

root@OLD2:~# pveversion
pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-23-pve)
root@OLD2:~# ceph -v
ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)

root@OLD3:~# pveversion
pve-manager/5.4-15/d0ec33c6 (running kernel: 4.15.18-30-pve)
root@OLD3:~# ceph -v
ceph version 12.2.13 (98af9a6b9a46b2d562a0de4b09263d70aeb1c9dd) luminous (stable)

root@NEW1:~# pveversion
pve-manager/5.4-15/d0ec33c6 (running kernel: 4.15.18-29-pve)
root@NEW1:~# ceph -v
ceph version 12.2.13 (98af9a6b9a46b2d562a0de4b09263d70aeb1c9dd) luminous (stable)

root@NEW2:~# pveversion
pve-manager/5.4-15/d0ec33c6 (running kernel: 4.15.18-30-pve)
root@NEW2:~# ceph -v
ceph version 12.2.13 (98af9a6b9a46b2d562a0de4b09263d70aeb1c9dd) luminous (stable)

pveceph status in old nodes works, while on new nodes returns got timeout
 
pveceph status in old nodes works, while on new nodes returns got timeout
That's what I meant with, can all the nodes connect to each other. And it's about the Ceph, public network = 10.10.10.0/24.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!