[SOLVED] Tried my first 3Node CEPH Cluster on 6.4-6

mircsicz

Well-Known Member
Sep 1, 2015
62
4
48
near Frankfurt
Hi all,

been running ProxMox since more or less five years, now I wanted to try my first real cluster:

I've got three Supermicro 1HE Xeon 4 drive 3.5" Server's with dualPort 10G Card's for CEPH, all just freshly reinstalled each with a SSD as their install drive and 4 Drive's in the cage. Setup went smooth so far, did the basic setup using Ansible: adding my user, his ssh-key and some additional Apps like tmux, htop and telegraf.

Then I created the cluster and on top of that setup Ceph, and as I tried to restore my first LXC, (8GB Debian CT) it took 8h to restore the 8GB... :-(

Then I had my first look on my Grafana Dashboard and realized the IOwait of up to 20%... So today I dove into it:

Code:
root@pve1|2|3:~# pveversion
pve-manager/6.4-6/be2fa32c (running kernel: 5.4.106-1-pve)

So the Webinterface doesn't really give a good feedback as it tends to run in 500 errors. But it all boils down to the Mon on PVE2 seems to run but doesn't make it through to the others:
Code:
root@pve2:~# service ceph-mon@pve2 status
● ceph-mon@pve2.service - Ceph cluster monitor daemon
   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)
  Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
           └─ceph-after-pve-cluster.conf
   Active: active (running) since Thu 2021-05-27 11:26:18 CEST; 21min ago
 Main PID: 531557 (ceph-mon)
    Tasks: 26
   Memory: 79.9M
   CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@pve2.service
           └─531557 /usr/bin/ceph-mon -f --cluster ceph --id pve2 --setuser ceph --setgroup cep

May 27 11:26:18 pve2 systemd[1]: Started Ceph cluster monitor daemon.

Code:
root@pve1:~# ceph -s
  cluster:
    id:     95b97ce6-42d5-47fb-b97d-cc040dd50455
    health: HEALTH_WARN
            2 osds down
            1 host (4 osds) down
            Slow OSD heartbeats on back (longest 16716.332ms)
            Slow OSD heartbeats on front (longest 17096.120ms)
            Reduced data availability: 13 pgs inactive, 12 pgs down
Degraded data redundancy: 642/2223 objects degraded (28.880%), 112 pgs degraded, 113 pgs undersized
            2 daemons have recently crashed
            1 slow ops, oldest one blocked for 6180 sec, osd.10 has slow ops


  services:
    mon: 2 daemons, quorum pve1,pve3 (age 3m)
    mgr: pve2(active, since 19m), standbys: pve3, pve1
    osd: 12 osds: 6 up (since 97m), 8 in (since 2h); 4 remapped pgs


  data:
    pools:   2 pools, 129 pgs
    objects: 741 objects, 2.8 GiB
    usage:   12 GiB used, 5.4 TiB / 5.5 TiB avail
    pgs:     10.078% pgs not active
642/2223 objects degraded (28.880%)
             29/2223 objects misplaced (1.305%)
             112 active+undersized+degraded
             12  down
             4   active+clean+remapped
             1   undersized+peered


  progress:
    PG autoscaler decreasing pool 1 PGs from 128 to 32 (17m)
      [............................]

And a lot of the OSD's are down...
Code:
root@pve1:~# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1 10.91638 root default
-3 3.63879 host pve1
0 ssd 0.90970 osd.0 down 1.00000 1.00000
1 ssd 0.90970 osd.1 down 0 1.00000
2 ssd 0.90970 osd.2 down 0 1.00000
3 ssd 0.90970 osd.3 down 1.00000 1.00000
-7 3.63879 host pve2
4 hdd 0.90970 osd.4 up 1.00000 1.00000
5 hdd 0.90970 osd.5 up 1.00000 1.00000
6 hdd 0.90970 osd.6 up 1.00000 1.00000
11 hdd 0.90970 osd.11 up 1.00000 1.00000
-10 3.63879 host pve3
7 hdd 0.90970 osd.7 down 0 1.00000
8 hdd 0.90970 osd.8 up 1.00000 1.00000
9 hdd 0.90970 osd.9 down 0 1.00000
 10    hdd   0.90970          osd.10      up   1.00000  1.00000

So I'm really hoping to get some guidance here, as I don't know where to start with all that...
 

Attachments

  • Bildschirmfoto 2021-05-27 um 11.44.11.png
    Bildschirmfoto 2021-05-27 um 11.44.11.png
    149.3 KB · Views: 17
Last edited:
What is your network setup, and how fast is it?
A connection on the Ceph network is possible between all nodes? (ping test)

You can directly check the state the monitors are in. The Ceph docs have a good section about that.

Anything in the logs of the two OSDs on node 3 that are down? /var/log/ceph/...
 
Hi Aaron,

THX for your reply :cool:

What is your network setup, and how fast is it?
A connection on the Ceph network is possible between all nodes? (ping test)
As stated in my first posting it's a 10G connection.

I had a brief ping between the hosts before I setup CEPH, while retesting pings before I replied I just realized there's some loss :-(
You can directly check the state the monitors are in. The Ceph docs have a good section about that.

Anything in the logs of the two OSDs on node 3 that are down? /var/log/ceph/...
THX, but I guess ping's have to be lossless before I dig into that...
 
Are u using jumbo frames on ceph interfaces if yes is framesize correctly set on all interfaces + switch? "did the basic setup using Ansible:" what does that mean? Is it a ansible playbook from your own?
 
Are u using jumbo frames on ceph interfaces if yes is framesize correctly set on all interfaces + switch? "did the basic setup using Ansible:" what does that mean? Is it a ansible playbook from your own?
My whole network setup looks like this:

Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.10.253.61/24
    gateway 10.10.253.1
    bridge_ports eno1
    bridge_stp off
    bridge_fd 0

iface eno2 inet manual

auto vmbr1
iface vmbr1 inet static
    address 10.10.254.61/24
    bridge_ports enp1s0f0 enp1s0f1
    bridge_stp off
    bridge_fd 0

where vmbr0 is the LAN interface and vmbr1 is the 10G network exclusively for CEPH which is setup as a ring without a switch ;-)

Ansible only setup some extra packages I prefer to have at hand, telegraf for metrics and my personal user account for SSH yes all usingmy own playbooks ;-)
 
My whole network setup looks like this:

Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.10.253.61/24
    gateway 10.10.253.1
    bridge_ports eno1
    bridge_stp off
    bridge_fd 0

iface eno2 inet manual

auto vmbr1
iface vmbr1 inet static
    address 10.10.254.61/24
    bridge_ports enp1s0f0 enp1s0f1
    bridge_stp off
    bridge_fd 0

where vmbr0 is the LAN interface and vmbr1 is the 10G network exclusively for CEPH which is setup as a ring without a switch ;-)

Ansible only setup some extra packages I prefer to have at hand, telegraf for metrics and my personal user account for SSH yes all usingmy own playbooks ;-)
I dont know if that might cause the problem but why are u using a vmbridge for ceph network? Cant you take a simple single interface for meshed setup: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server ?

Can u try this, you only need vmbridges if u wanto connect vms to that network which you shouldnt on ceph network.
 
Last edited:
I dont know if that might cause the problem but why are u using a vmbridge for ceph network? Cant you take a simple single interface for meshed setup: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server ?

Can u try this, you only need vmbridges if u wanto connect vms to that network which you shouldnt on ceph network.
simply because for me this was the most logical (and only way I came up with) to do it:
Hy3oFt.jpg


But I'm totally open for suggestions out of my box :rolleyes:
 
Last edited:
Hmm, wait a second, so by adding both NICs to the bridge on all nodes, you got a loop. Which means you need some STP running to cut of the loop at some point. By default, STP is off on the bridge. See https://forum.proxmox.com/threads/linux-bridge-howto-bridge-stp-on.68641/

But even then, the traffic between two of the 3 nodes will go over the third, depending on where STP will do the cut. This might be okayish but definitely not the best performance and latency wise. That's why the Full Mesh Ceph page in the PVE wiki prefers a different setup where the traffic will only be sent to the actual target node (routed). The downside is, that it is not as fault tolerant unless you create every connection between the nodes as a bond with 2 cables. The main intention is to be used in a 3node cluster in close proximity (same rack) with very short cables between the nodes. The probability for problems occurring during regular operations (faulty cable, NIC, ...) should be low enough. Then again, if fault tolerance has a high priority -> bond or going all the way with 2 separate switches.
 
@jsterr THX a ton for bringing jumbo frames to my awareness!

I changed my network interfaces to this:
Code:
auto vmbr1
iface vmbr1 inet static
        address 10.10.254.61/24
        bridge_ports enp1s0f0 enp1s0f1
        bridge_stp on
        mtu 9000
        bridge_fd 0

And now my ping loss is gone:
bUQMgG.jpg


UPDATE an hour later:
anjgVw.jpg
 
Last edited:
Thx for replying again, thought I had already posted my reply to @jsterr but failed on clicking the "Post reply" Button :eek:

Hmm, wait a second, so by adding both NICs to the bridge on all nodes, you got a loop. Which means you need some STP running to cut of the loop at some point. By default, STP is off on the bridge. See https://forum.proxmox.com/threads/linux-bridge-howto-bridge-stp-on.68641/
As I've modified /etc/network/interfaces it now works as expected...

Code:
root@pve2:~# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1 10.91638 root default
-3 3.63879 host pve1
0 ssd 0.90970 osd.0 up 1.00000 1.00000
1 ssd 0.90970 osd.1 up 1.00000 1.00000
2 ssd 0.90970 osd.2 up 1.00000 1.00000
3 ssd 0.90970 osd.3 up 1.00000 1.00000
-7 3.63879 host pve2
4 hdd 0.90970 osd.4 up 1.00000 1.00000
5 hdd 0.90970 osd.5 up 1.00000 1.00000
6 hdd 0.90970 osd.6 up 1.00000 1.00000
11 hdd 0.90970 osd.11 up 1.00000 1.00000
-10 3.63879 host pve3
7 hdd 0.90970 osd.7 up 1.00000 1.00000
8 hdd 0.90970 osd.8 up 1.00000 1.00000
9 hdd 0.90970 osd.9 up 1.00000 1.00000
 10    hdd   0.90970          osd.10      up   1.00000  1.00000

But one msg keeps bothering me:
gwD9d4.jpg


"HEALTH WARN" probably won't go away because of "2 daemons have recently crashed"... Is there a way to acknowledge that message ?

And is it correct that you expect a better performance from changing my /etc/network/interfaces to this:
Code:
# Connected to PVE2 (.62)
auto enp1s0f0
iface enp1s0f0 inet static
        address  10.15.15.61
        netmask  255.255.255.0
        up ip route add 10.15.15.62/32 dev enp1s0f0
        down ip route del 10.15.15.62/32

# Connected to PVE3 (.63)
auto enp1s0f1
iface enp1s0f1 inet static
        address  10.15.15.61
        netmask  255.255.255.0
        up ip route add 10.15.15.63/32 dev enp1s0f0
        down ip route del 10.15.15.63/32

Besides that I'm ready to mark the topic SOLVED ;-)
 
Last edited:
As I've modified /etc/network/interfaces it now works as expected...
Great to hear :)

"HEALTH WARN" probably won't go away because of "2 daemons have recently crashed"... Is there a way to acknowledge that message ?
That should go away after some time. I am not sure after what time Ceph does not consider it recently anymore ;)

Besides that I'm ready to mark the topic SOLVED ;-)
Feel free to do so by editing the first post and selecting the prefix from the drop-down menu next to the title.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!