Ceph Configuration Issues in Blade Environment

Seraf

New Member
Aug 31, 2020
3
0
1
39
Hello everyone!

Not new to Ceph config/deploy but have a number of new variables compared to my previous experience setting up Ceph.

Here's a lay of the land to get started:

We have 8 m610 blades in an m1000e chassis. The chassis has a 1GbE passthrough module in fabric A1, 1GbE switch module in A2, Infiniband switch in B1/C1, and a SFP+ switch module in B2, C2 has a blank. The m610s all have an infiniband mezzanine card installed. Only a portion of the blades have 10GbE mezzanine cards so for the purposes of this deploy we are ignoring any and all SFP+ interfaces.

Proxmox is installed on USB keys on each blade's internal USB port (yes, I'm aware this is not preferred due to the possibility of device burnout, and we made this decision regardless), and each blade has at least one 840GiB 10K SAS drive (some have two). Each node has been configured with the appropriate infiniband support and IPOIB has been configured and confirmed to work.

Public network is sitting on 10.3.2.0/8
Cluster network is sitting on 10.3.5.0/24

Core proxmox functions all seem to be healthy. All nodes successfully joined the cluster, and we are able to manage all of the nodes through the unified cluster interface without issue.

Ceph installed to all of the units through the GUI; however, initial configuration was not successful due to the networking detecting multiple possible options for cluster network and so we had to manually initialize the ceph cluster on CLI and manually configure monitors with --mon-address

Of the eight nodes we currently have four monitors configured and two managers. We initially had a manager running on node1 (pxmox-s02) but that node was behaving somewhat poorly so we created on two other nodes (pxmox-s03 and pxmox-s04) and destroyed the monitor on pxmox-s02. Here's our first issue; that monitor continues to show in the GUI and is currently showing as 'active' while the other two are in standby. Attempts to destroy this through the GUI provide an 'entry has no host' error and no entry is in /var/lib/ceph/mgr for this manager.

Here is the configuration displayed in the ceph panel:
Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.3.5.13/24
     fsid = ae835579-4b40-47e3-a215-9ce2a2edc319
     mon_allow_pool_delete = true
     mon_host = 10.3.5.13 10.3.5.20 10.3.5.21 10.3.5.14
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.3.2.13/8

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pxmox-s05]
     host = pxmox-s05
     mds standby for name = pve

[mds.pxmox-s10]
     host = pxmox-s10
     mds_standby_for_name = pve

[mon.pxmox-s03]
     public_addr = 10.3.5.13

[mon.pxmox-s04]
     public_addr = 10.3.5.14

[mon.pxmox-s10]
     public_addr = 10.3.5.20

[mon.pxmox-s11]
     public_addr = 10.3.5.21

We are frequently seeing timeouts. Often times commands must be rerun as the first try results in a "get timeout" error.

OSDs have been created, via the GUI, for each drive on each blade. Operation completes successfully but once completed the OSDs do not show up in the OSD subsection of the Ceph group in the GUI. Additionally, Ceph throws a warning that OSD count 0 < osd_pool_default_size 3. The OSDs do show in the crushmap, included below.

Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 6 osd.6 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pxmox-s03 {
    id -3        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    # weight 1.637
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.819
    item osd.1 weight 0.819
}
host pxmox-s04 {
    id -5        # do not change unnecessarily
    id -6 class hdd        # do not change unnecessarily
    # weight 1.637
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 0.819
    item osd.3 weight 0.819
}
host pxmox-s02 {
    id -7        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    # weight 0.819
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.819
}
host pxmox-s13 {
    id -9        # do not change unnecessarily
    id -10 class hdd        # do not change unnecessarily
    # weight 0.819
    alg straw2
    hash 0    # rjenkins1
    item osd.6 weight 0.819
}
host pxmox-s10 {
    id -11        # do not change unnecessarily
    id -12 class hdd        # do not change unnecessarily
    # weight 0.819
    alg straw2
    hash 0    # rjenkins1
    item osd.9 weight 0.819
}
host pxmox-s12 {
    id -13        # do not change unnecessarily
    id -14 class hdd        # do not change unnecessarily
    # weight 0.819
    alg straw2
    hash 0    # rjenkins1
    item osd.11 weight 0.819
}
host pxmox-s11 {
    id -15        # do not change unnecessarily
    id -16 class hdd        # do not change unnecessarily
    # weight 0.819
    alg straw2
    hash 0    # rjenkins1
    item osd.10 weight 0.819
}
root default {
    id -1        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 7.368
    alg straw2
    hash 0    # rjenkins1
    item pxmox-s03 weight 1.637
    item pxmox-s04 weight 1.637
    item pxmox-s02 weight 0.819
    item pxmox-s13 weight 0.819
    item pxmox-s10 weight 0.819
    item pxmox-s12 weight 0.819
    item pxmox-s11 weight 0.819
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

At this point we're stuck on how to proceed. We were able to create a Ceph pool, and a CephFS MDS but cannot create a CephFS pool as the process times out.

Any assistance and guidance would be highly appreciated!
 
Shooting from the hip, I would say that the timeouts might stem from the USB stick. Since the MON DB is located there as well. This may also be the reason why the OSDs won't connect. But for more, you will need to check your logs.

Here's our first issue; that monitor continues to show in the GUI and is currently showing as 'active' while the other two are in standby
MON and MGR are two different services. The MONs are all active and hold the quorum. Without them Ceph doesn't work. The MGR provides stats and certain management function. There only one is active and the others wait in standby.
 
I've done a check through the logs and they do show indications of latency.

Code:
Sep 01 10:51:42 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:51:42.444 7fcb50baa700 -1 mon.pxmox-s11@2(electing) e6 get_health_metrics reporting 10 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:51:47 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:51:47.444 7fcb50baa700 -1 mon.pxmox-s11@2(electing) e6 get_health_metrics reporting 10 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:51:52 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:51:52.443 7fcb50baa700 -1 mon.pxmox-s11@2(electing) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:52:00 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:52:00.759 7fcb50baa700 -1 mon.pxmox-s11@2(peon) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:52:05 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:52:05.763 7fcb50baa700 -1 mon.pxmox-s11@2(peon) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:52:10 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:52:10.763 7fcb50baa700 -1 mon.pxmox-s11@2(peon) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:52:15 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:52:15.763 7fcb50baa700 -1 mon.pxmox-s11@2(electing) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:52:23 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:52:23.903 7fcb50baa700 -1 mon.pxmox-s11@2(electing) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:52:28 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:52:28.903 7fcb50baa700 -1 mon.pxmox-s11@2(peon) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:52:33 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:52:33.902 7fcb50baa700 -1 mon.pxmox-s11@2(peon) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:52:38 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:52:38.902 7fcb50baa700 -1 mon.pxmox-s11@2(electing) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:52:43 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:52:43.906 7fcb50baa700 -1 mon.pxmox-s11@2(peon) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)
Sep 01 10:52:48 pxmox-s11 ceph-mon[1379]: 2020-09-01 10:52:48.906 7fcb50baa700 -1 mon.pxmox-s11@2(peon) e6 get_health_metrics reporting 8 slow ops, oldest is mon_command({"prefix":"pg dump","format":"json","dumpcontents":["osds"]} v 0)

This is pretty common across all of them. For the most part they stay online but occasionally they'll go 'unknown' or a couple will completely stop.

As for the managers, I am aware that only one should be in Active and the others in Standby. The intent was to have two configured so that we can ensure availability when performing reboots on the cluster. The issue is that one that was destroyed is the one showing as active while the two that are supposed to be the active/standby nodes are instead both in standby.
The log file for the phantom manager is showing blank while the standby nodes show below:
Code:
Aug 31 14:59:59 pxmox-s04 ceph-mgr[1383]: 2020-08-31 14:59:59.572 7f40b3661700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-08-31 13:59:59.576116)
Aug 31 14:59:59 pxmox-s04 ceph-mgr[1383]: 2020-08-31 14:59:59.572 7f40b4663700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-08-31 13:59:59.576859)
Aug 31 15:00:09 pxmox-s04 ceph-mgr[1383]: 2020-08-31 15:00:09.576 7f40b3661700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-08-31 14:00:09.576304)
Aug 31 15:00:09 pxmox-s04 ceph-mgr[1383]: 2020-08-31 15:00:09.576 7f40b4663700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-08-31 14:00:09.577187)
Aug 31 15:00:19 pxmox-s04 ceph-mgr[1383]: 2020-08-31 15:00:19.571 7f40b3661700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-08-31 14:00:19.576492)
Sep 01 00:00:00 pxmox-s04 ceph-mgr[1383]: 2020-09-01 00:00:00.386 7f40b5665700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror  (PID: 259115) UID: 0
Sep 01 00:00:00 pxmox-s04 ceph-mgr[1383]: 2020-09-01 00:00:00.398 7f40b5665700 -1 received  signal: Hangup from  (PID: 259116) UID: 0

Of course there's still the issue with the OSDs. They were created but now they do not appear in the OSD sections of the UI, Ceph itself seems to think they are not there despite being in the crushmap, and they show as Bluestore devices in each node's Disks section.

Originally, our intent was to make use of our Equallogic SAN to host the Proxmox LUNs but despite my best efforts following both the guide on the proxmox wiki and other debian guides for configuring iSCSI boot I was unable to get past initramfs in the boot sequence. The blades all have QLogic devices which are capable of Boot from iSCSI target (and worked flawlessly in ESXi) but I couldn't crack that nut with Debian unfortunately. If it's possible (and I suspect it would improve performance over the USB keys) we'd prefer to either entirely boot off of iSCSI or leave just enough to start the environment on USB and have the rest be done off iSCSI; but I'm not sure how to do this and would need guidance to a how-to.

Ultimately our intent is to use the two front bay drive slots on each blade for the Ceph cluster for high performance disk and high availability.
 
Originally, our intent was to make use of our Equallogic SAN to host the Proxmox LUNs but despite my best efforts following both the guide on the proxmox wiki and other debian guides for configuring iSCSI boot I was unable to get past initramfs in the boot sequence.
Booting the Proxmox VE from iSCSI is not a supported setup. But if you still have the need, then you could try Debian itself first and install Proxmox VE on top of it afterwards. This allows you to customize the disk/partition layout.

Ultimately our intent is to use the two front bay drive slots on each blade for the Ceph cluster for high performance disk and high availability.
Then the iSCSI connection is not a good idea, since it will introduce latency. The main thing is, that the blade center has limitation that are not really fit for Ceph. See the precondition section in our documentation for some gudiance.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

Aug 31 14:59:59 pxmox-s04 ceph-mgr[1383]: 2020-08-31 14:59:59.572 7f40b3661700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-08-31 13:59:59.576116)
Seems the time might not be synchronized throughout the cluster. The time needs to the same on all nodes.
 
Booting the Proxmox VE from iSCSI is not a supported setup. But if you still have the need, then you could try Debian itself first and install Proxmox VE on top of it afterwards. This allows you to customize the disk/partition layout.
This was what we had tried when booting via iSCSI. It was a customized Debian installation and the intent was then to move to the proxmox kernel. Couldn't get to that point, regrettably. We were able to get CentOS going but admittedly I have even less experience with GlusterFS and CentOS than I do with Ceph (which I have at least gotten running on a few clusters in the past).

Then the iSCSI connection is not a good idea, since it will introduce latency. The main thing is, that the blade center has limitation that are not really fit for Ceph. See the precondition section in our documentation for some gudiance.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
I have reviewed this prior to designing this environment and I am unsure how this indicates that the needs won't be met...
The cluster network (corosync and Ceph) are configured to sit on the infiniband network which is providing 40Gbps traffic dedicated to those functions. Latency is low (as to be expected of infiniband) as well. Each node has 2x 6 Core CPUs @ 2.6GHz, with 32GiB of ram (we are planning on adjusting this to increase, just pending a few more supplies).

Each OSD is a 900GB (840GiB display in the UI) 10k SAS. So while not SSD these will outperform a more traditional HDD install. As a comparison my previous Ceph cluster is a mix of drives ranging from 400GB to 2TB and all 7k SATA and I was able to saturate the endpoint links (1GbE to client) on file operations. I expect the blade solution designed to perform as well if not better.

Seems the time might not be synchronized throughout the cluster. The time needs to the same on all nodes.
I had thought that as well but I confirmed across each node that they are using the same time settings and are within 1s of each other.

At this point my next step is going to be to take a couple of the nodes that I know are performing better (at least one is admittedly functioning markedly slower than its peers) and completely reinstall from scratch and configure Ceph via CLI instead of the UI. I have a suspicion that the attempt to configure via UI caused some failure artifacts that were not able to be purged. I don't have anything to confirm that assumption with, as in my research I have found few examples of Ceph under Proxmox being deployed on IPoIB (this seems to be more common in the CentOS/RHEL world). A shame really, as IB is a very inexpensive option for high performance linkage.
 
I have reviewed this prior to designing this environment and I am unsure how this indicates that the needs won't be met...
I don't say that it won't work, just that the MON DB is by default located on the root partition. And that this will introduce latency when additionally the network stack is involved. MON DB -> filesystem (kernel) -> network (kernel) -> NETWORK -> SAN network (kernel) -> filesystem (kernel) -> HBA/RAID -> Disk.

Each OSD is a 900GB (840GiB display in the UI) 10k SAS. So while not SSD these will outperform a more traditional HDD install. As a comparison my previous Ceph cluster is a mix of drives ranging from 400GB to 2TB and all 7k SATA and I was able to saturate the endpoint links (1GbE to client) on file operations. I expect the blade solution designed to perform as well if not better.
Yes, good SAS disks outperform SATA disks. For comparison see our Ceph benchmark paper.
https://proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark

I have a suspicion that the attempt to configure via UI caused some failure artifacts that were not able to be purged.
The CLI & GUI use the same API to manage Ceph.

A shame really, as IB is a very inexpensive option for high performance linkage.
Ceph speaks Ethernet only. So it will be EoIB and that cuts the bandwidth roughly in half.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!