Proxmox three node cluster - ceph - got timeout

vobo70 · Jan 26, 2022

Hello,
I have three node proxmox cluster:

optiplex 7020
xeon e3-1265lv3
16GB
120GB SSD for OS
512GB nvme for ceph
1GbE network for "external" access
dual 10GbE network (for cluster)

Network is connected as stated here:
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
as routed setup
each node can "talk" to other two nodes,

network config:
PVE1

Code:

auto lo
iface lo inet loopback

iface eno1 inet manual
        mtu 9000

auto enp1s0f0
iface enp1s0f0 inet static
        address 192.168.20.10/24
        mtu 9000
        up ip route add 192.168.20.30/32 dev enp1s0f0
        down ip route del 192.168.20.30/32

auto enp1s0f1
iface enp1s0f1 inet static
        address 192.168.20.10/24
        mtu 9000
        up ip route add 192.168.20.20/32 dev enp1s0f1
        down ip route del 192.168.20.20/32

auto vmbr0
iface vmbr0 inet static
        address 192.168.10.11/24
        gateway 192.168.10.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        mtu 9000

PVE2

Code:

auto lo
iface lo inet loopback

iface eno1 inet manual
        mtu 9000

auto enp1s0f0
iface enp1s0f0 inet static
        address 192.168.20.20/24
        mtu 9000
        up ip route add 192.168.20.10/32 dev enp1s0f0
        down ip route del 192.168.20.10/32

auto enp1s0f1
iface enp1s0f1 inet static
        address 192.168.20.20/24
        mtu 9000
        up ip route add 192.168.20.30/32 dev enp1s0f1
        down ip route del 192.168.20.30/32

auto vmbr0
iface vmbr0 inet static
        address 192.168.10.12/24
        gateway 192.168.10.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        mtu 9000

PVE3

Code:

auto lo
iface lo inet loopback

iface eno1 inet manual
        mtu 9000

auto enp1s0f0
iface enp1s0f0 inet static
        address 192.168.20.30/24
        mtu 9000
        up ip route add 192.168.20.20/32 dev enp1s0f0
        down ip route del 192.168.20.20/32

auto enp1s0f1
iface enp1s0f1 inet static
        address 192.168.20.30/24
        mtu 9000
        up ip route add 192.168.20.10/32 dev enp1s0f1
        down ip route del 192.168.20.10/32

auto vmbr0
iface vmbr0 inet static
        address 192.168.10.13/24
        gateway 192.168.10.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        mtu 9000

ceph.conf:

Code:

[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster network = 192.168.20.0/24
         fsid = 9f47e518-4613-4564-863b-e8d3a923a1f5
         mon_allow_pool_delete = true
         mon_host = 192.168.20.10
         ms_bind_ipv4 = true
         ms_bind_ipv6 = false
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 192.168.20.0/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve1]
         public_addr = 192.168.20.10

any command like:

Code:

root@pve1:~# pveceph status
command 'ceph -s' failed: got timeout

anyone can help me with this?
Thanks in advance.

gurubert · Jan 26, 2022

Is the MON process on 192.168.20.10 running?

What is the output of "systemctl status ceph-mon@pve1" on that node?

vobo70 · Jan 26, 2022

Was dead:

Code:

ceph-mon@pve1.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; disabled; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
             └─ceph-after-pve-cluster.conf
     Active: inactive (dead)

I started it and it's now like this:

Code:

root@pve1:~# systemctl status ceph-mon@pve1
● ceph-mon@pve1.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; disabled; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
             └─ceph-after-pve-cluster.conf
     Active: failed (Result: exit-code) since Wed 2022-01-26 17:29:19 CET; 10s ago
    Process: 1445 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id pve1 --setuser ceph --setgroup ceph (code>
   Main PID: 1445 (code=exited, status=1/FAILURE)
        CPU: 38ms

Jan 26 17:29:19 pve1 systemd[1]: ceph-mon@pve1.service: Scheduled restart job, restart counter is at 5.
Jan 26 17:29:19 pve1 systemd[1]: Stopped Ceph cluster monitor daemon.
Jan 26 17:29:19 pve1 systemd[1]: ceph-mon@pve1.service: Start request repeated too quickly.
Jan 26 17:29:19 pve1 systemd[1]: ceph-mon@pve1.service: Failed with result 'exit-code'.
Jan 26 17:29:19 pve1 systemd[1]: Failed to start Ceph cluster monitor daemon.
lines 1-14/14 (END)

vobo70 · Jan 26, 2022

I also found this
(don't know if it's causes the problem) in:
/var/log/ceph/ceph-mon.pve1.log

Code:

2022-01-26T17:28:38.760+0100 7fd0145f2580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve1' does not exist: have you run 'mkfs'?
2022-01-26T17:28:49.005+0100 7f7ae6332580  0 set uid:gid to 64045:64045 (ceph:ceph)
2022-01-26T17:28:49.005+0100 7f7ae6332580  0 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable), process ceph-mon, pid 1390
2022-01-26T17:28:49.005+0100 7f7ae6332580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve1' does not exist: have you run 'mkfs'?
2022-01-26T17:28:59.258+0100 7f1312c7e580  0 set uid:gid to 64045:64045 (ceph:ceph)
2022-01-26T17:28:59.258+0100 7f1312c7e580  0 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable), process ceph-mon, pid 1414
2022-01-26T17:28:59.258+0100 7f1312c7e580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve1' does not exist: have you run 'mkfs'?
2022-01-26T17:29:09.522+0100 7f7fd56ff580  0 set uid:gid to 64045:64045 (ceph:ceph)
2022-01-26T17:29:09.522+0100 7f7fd56ff580  0 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable), process ceph-mon, pid 1445
2022-01-26T17:29:09.522+0100 7f7fd56ff580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve1' does not exist: have you run 'mkfs'?

gurubert · Jan 27, 2022

It looks like something or somebody wiped the monitor database. That means that the Ceph cluster ceased to exist. You will have to start from scratch.
Unfortunately the cluster only had one MON so there was no redundancy. Start your next cluster with at least three MONs and always an odd number of MONs.

vobo70 · Jan 27, 2022

Is there any "manual" how to uninstall ceph from proxmox?
(I will try to search by myself anyway)

vobo70 · Jan 27, 2022

I tried to do if as stated:
https://forum.proxmox.com/threads/reinstall-ceph-on-proxmox-6.57691/page-2#post-300278
but no luck.
Those are clean cluster, I will reinstall proxmox on all 3 nodes

vobo70 · Feb 1, 2022

All software reinstalled and works fine; thanks for suport

Proxmox three node cluster - ceph - got timeout

vobo70

Well-Known Member

gurubert

Distinguished Member

vobo70

Well-Known Member

vobo70

Well-Known Member

gurubert

Distinguished Member

vobo70

Well-Known Member

vobo70

Well-Known Member

vobo70

Well-Known Member

We value your privacy