Proxmox three node cluster - ceph - got timeout

vobo70

Active Member
Nov 15, 2017
57
2
28
51
Warsaw, Poland
Hello,
I have three node proxmox cluster:

optiplex 7020
xeon e3-1265lv3
16GB
120GB SSD for OS
512GB nvme for ceph
1GbE network for "external" access
dual 10GbE network (for cluster)

Network is connected as stated here:
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
as routed setup
each node can "talk" to other two nodes,

network config:
PVE1
Code:
auto lo
iface lo inet loopback

iface eno1 inet manual
        mtu 9000

auto enp1s0f0
iface enp1s0f0 inet static
        address 192.168.20.10/24
        mtu 9000
        up ip route add 192.168.20.30/32 dev enp1s0f0
        down ip route del 192.168.20.30/32

auto enp1s0f1
iface enp1s0f1 inet static
        address 192.168.20.10/24
        mtu 9000
        up ip route add 192.168.20.20/32 dev enp1s0f1
        down ip route del 192.168.20.20/32

auto vmbr0
iface vmbr0 inet static
        address 192.168.10.11/24
        gateway 192.168.10.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        mtu 9000
PVE2
Code:
auto lo
iface lo inet loopback

iface eno1 inet manual
        mtu 9000

auto enp1s0f0
iface enp1s0f0 inet static
        address 192.168.20.20/24
        mtu 9000
        up ip route add 192.168.20.10/32 dev enp1s0f0
        down ip route del 192.168.20.10/32

auto enp1s0f1
iface enp1s0f1 inet static
        address 192.168.20.20/24
        mtu 9000
        up ip route add 192.168.20.30/32 dev enp1s0f1
        down ip route del 192.168.20.30/32

auto vmbr0
iface vmbr0 inet static
        address 192.168.10.12/24
        gateway 192.168.10.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        mtu 9000
PVE3
Code:
auto lo
iface lo inet loopback

iface eno1 inet manual
        mtu 9000

auto enp1s0f0
iface enp1s0f0 inet static
        address 192.168.20.30/24
        mtu 9000
        up ip route add 192.168.20.20/32 dev enp1s0f0
        down ip route del 192.168.20.20/32

auto enp1s0f1
iface enp1s0f1 inet static
        address 192.168.20.30/24
        mtu 9000
        up ip route add 192.168.20.10/32 dev enp1s0f1
        down ip route del 192.168.20.10/32

auto vmbr0
iface vmbr0 inet static
        address 192.168.10.13/24
        gateway 192.168.10.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        mtu 9000
ceph.conf:

Code:
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster network = 192.168.20.0/24
         fsid = 9f47e518-4613-4564-863b-e8d3a923a1f5
         mon_allow_pool_delete = true
         mon_host = 192.168.20.10
         ms_bind_ipv4 = true
         ms_bind_ipv6 = false
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 192.168.20.0/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve1]
         public_addr = 192.168.20.10
any command like:
Code:
root@pve1:~# pveceph status
command 'ceph -s' failed: got timeout
anyone can help me with this?
Thanks in advance.
 

vobo70

Active Member
Nov 15, 2017
57
2
28
51
Warsaw, Poland
Was dead:
Code:
ceph-mon@pve1.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; disabled; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
             └─ceph-after-pve-cluster.conf
     Active: inactive (dead)
I started it and it's now like this:

Code:
root@pve1:~# systemctl status ceph-mon@pve1
● ceph-mon@pve1.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; disabled; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
             └─ceph-after-pve-cluster.conf
     Active: failed (Result: exit-code) since Wed 2022-01-26 17:29:19 CET; 10s ago
    Process: 1445 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id pve1 --setuser ceph --setgroup ceph (code>
   Main PID: 1445 (code=exited, status=1/FAILURE)
        CPU: 38ms

Jan 26 17:29:19 pve1 systemd[1]: ceph-mon@pve1.service: Scheduled restart job, restart counter is at 5.
Jan 26 17:29:19 pve1 systemd[1]: Stopped Ceph cluster monitor daemon.
Jan 26 17:29:19 pve1 systemd[1]: ceph-mon@pve1.service: Start request repeated too quickly.
Jan 26 17:29:19 pve1 systemd[1]: ceph-mon@pve1.service: Failed with result 'exit-code'.
Jan 26 17:29:19 pve1 systemd[1]: Failed to start Ceph cluster monitor daemon.
lines 1-14/14 (END)
 

vobo70

Active Member
Nov 15, 2017
57
2
28
51
Warsaw, Poland
I also found this
(don't know if it's causes the problem) in:
/var/log/ceph/ceph-mon.pve1.log
Code:
2022-01-26T17:28:38.760+0100 7fd0145f2580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve1' does not exist: have you run 'mkfs'?
2022-01-26T17:28:49.005+0100 7f7ae6332580  0 set uid:gid to 64045:64045 (ceph:ceph)
2022-01-26T17:28:49.005+0100 7f7ae6332580  0 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable), process ceph-mon, pid 1390
2022-01-26T17:28:49.005+0100 7f7ae6332580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve1' does not exist: have you run 'mkfs'?
2022-01-26T17:28:59.258+0100 7f1312c7e580  0 set uid:gid to 64045:64045 (ceph:ceph)
2022-01-26T17:28:59.258+0100 7f1312c7e580  0 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable), process ceph-mon, pid 1414
2022-01-26T17:28:59.258+0100 7f1312c7e580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve1' does not exist: have you run 'mkfs'?
2022-01-26T17:29:09.522+0100 7f7fd56ff580  0 set uid:gid to 64045:64045 (ceph:ceph)
2022-01-26T17:29:09.522+0100 7f7fd56ff580  0 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable), process ceph-mon, pid 1445
2022-01-26T17:29:09.522+0100 7f7fd56ff580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-pve1' does not exist: have you run 'mkfs'?
 

gurubert

Well-Known Member
It looks like something or somebody wiped the monitor database. That means that the Ceph cluster ceased to exist. You will have to start from scratch.
Unfortunately the cluster only had one MON so there was no redundancy. Start your next cluster with at least three MONs and always an odd number of MONs.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!