[SOLVED] Help: CEPH pool non responsive/inactive after moving to a new house/new connection

Code:
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 192.168.1.150/24
         fsid = 84a7eb78-7460-4ab6-94f0-efb4fe9dc5f0
         mon_allow_pool_delete = true
         mon_host = 192.168.1.150 192.168.1.151 192.168.1.152
         ms_bind_ipv4 = true
         ms_bind_ipv6 = false
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 192.168.1.150/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.localhost]
         public_addr = 192.168.1.150

[mon.mystic2]
         public_addr = 192.168.1.151

[mon.mystic3]
         public_addr = 192.168.1.152
 
Also I tried to install cephadm and upgrade proxmox but now getting this error. Wondering if some packages are damaged may be? Can I reinstall ceph without losing data on my disks?

Setting up proxmox-kernel-6.2 (6.2.16-12) ...
Errors were encountered while processing:
cephadm
E: Sub-process /usr/bin/dpkg returned an error code (1)
 
Hmmm , now suspecting that this whole thing also has to do something with following error. Should I try to update Ceph to Reef or something?

Code:
root@mystic1:~# ./cephadm install
Installing packages ['cephadm']...
Non-zero exit code 100 from apt-get install -y cephadm
apt-get: stdout Reading package lists...
apt-get: stdout Building dependency tree...
apt-get: stdout Reading state information...
apt-get: stdout cephadm is already the newest version (17.2.6-pve1+3).
apt-get: stdout The following packages were automatically installed and are no longer required:
apt-get: stdout   g++-10 libfmt7 libstdc++-10-dev libthrift-0.13.0 libtiff5 libwebp6
apt-get: stdout   pve-kernel-5.15.107-1-pve pve-kernel-5.15.107-2-pve python-pastedeploy-tpl
apt-get: stdout   telnet
apt-get: stdout Use 'apt autoremove' to remove them.
apt-get: stdout 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
apt-get: stdout 1 not fully installed or removed.
apt-get: stdout After this operation, 0 B of additional disk space will be used.
apt-get: stdout Setting up cephadm (17.2.6-pve1+3) ...
apt-get: stdout usermod: unlocking the user's password would result in a passwordless account.
apt-get: stdout You should set a password with usermod -p to unlock this user's password.
apt-get: stdout mkdir: cannot create directory ‘/home/cephadm/.ssh’: No such file or directory
apt-get: stdout dpkg: error processing package cephadm (--configure):
apt-get: stdout  installed cephadm package post-installation script subprocess returned error exit status 1
apt-get: stdout Errors were encountered while processing:
apt-get: stdout  cephadm
apt-get: stderr E: Sub-process /usr/bin/dpkg returned an error code (1)
Traceback (most recent call last):
  File "/root/./cephadm", line 9653, in <module>
    main()
  File "/root/./cephadm", line 9641, in main
    r = ctx.func(ctx)
        ^^^^^^^^^^^^^
  File "/root/./cephadm", line 8123, in command_install
    pkg.install(ctx.packages)
  File "/root/./cephadm", line 7751, in install
    call_throws(self.ctx, ['apt-get', 'install', '-y'] + ls)
  File "/root/./cephadm", line 1852, in call_throws
    raise RuntimeError(f'Failed command: {" ".join(command)}: {s}')
RuntimeError: Failed command: apt-get install -y cephadm: E: Sub-process /usr/bin/dpkg returned an error code (1)
 
Last edited:
Another info. A friend said RADOS is corrupted because of this message.

Code:
root@mystic1:/var/log/ceph# systemctl status ceph-radosgw.target                                                              
Unit ceph-radosgw.target could not be found.
root@mystic1:/var/log/ceph# systemctl status ceph.target                                                                      
● ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once
     Loaded: loaded (/lib/systemd/system/ceph.target; enabled; preset: enabled)
     Active: active since Thu 2023-09-07 02:45:25 EDT; 27s ago

Sep 07 02:45:25 mystic1 systemd[1]: Reached target ceph.target - ceph target allowing to start/stop all ceph*@.service instan>
lines 1-5/5 (END)
 
Proxmox VEs Ceph does not use cephadm, you should not mess around with external tools like cephadm or ceph-dashboard, as it can hurt you infrastructure, seems like it did now. I would reinstall and reimport the osds - Im not sure how to do this on 3 complete new nodes.
 
Last edited:
I definitely didnt mess with external tools until exhausted other options. CEPH was non-responsive before. You think reinstall and reimporting OSDS is possible. I just dont want to lose the data that I thought is copied on 6 discs :)
 
Okey, let's assume your cephadm install didn't broke anything else....

Your problem is that your monitors are out of quorum.
Given that you can ping each node from each other node on the public network, I assume there's no network issue.
Given that ceph --admin-daemon /run/ceph/ceph-mon.$(hostname).asok mon_status only works on one node, I assume that the ceph monitor service does not start on the other two nodes. Hence, no quorum.

Let's move from there... what's the output of these commands on the two hosts where ceph --admin-daemon command does not work:
systemctl status ceph-mon@$(hostname).service
journalctl -u ceph-mon@$(hostname).service | tail -n20
 
I guess I wont be angry if put my name up in most stupid person of the month on top of this forum.

One of the node was not fully updated and repo was bad on that one. The thing is that when you update proxmox from webshell (running 3 nodes) and switch node while one node is updating... you dont see errors when you switch back to it.

That's where I would have missed upgrade error. Once the node was updated properly, got the quorum and here I am. Still sharing about this stupidity so next time hopefully someone else doesn't repeat this. Still haven't got my VM/LXC up but now is the start. I still learned a lot about ceph and huge thanks for all sharing the insight and guidance.


1694116637916.png
 
  • Like
Reactions: scyto
fwiw, you're using the same interface for corosync, ceph public, ceph private, and vm bridge traffic. all it would take is one ceph traffic storm to crash your entire cluster.

I asked about this during setting up and people said it wont help much to use second interface on server.
If I am hearing you right.... Should I do this and use those 4 extra 10G ports on my netgear managed switch. Does it look right?
Or should I just skip 1G connection and use 10G port only.


1694121107352.png
 
Last edited:
Given your switch situation, I would do something like this:

Code:
auto lo
iface lo inet loopback

# corosync
auto eno1
iface eno1 inet static
   address 192.168.100.150/24

#ceph
auto eno3
iface eno3 inet static
   address 192.168.101.150/24
   mtu 9000

# bridge for virtual traffic
iface eno2 inet manual
auto vmbr0
iface vmbr0 inet static
        address 192.168.1.150/24
        gateway 192.168.1.1
        bridge-ports eno2
        bridge-stp off
        bridge-fd 0

I hope it goes without saying you'd need to connect eno1 and eno2 to your switch :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!