Ceph Ruined My Christmas

fakebizprez · 2024-12-25T15:14:55+0100

Merry Christmas, everyone, if that's what you're into.

I have been using Ceph for a few months now and it has been a great experience. I have four Dell R740s and one R730 in the cluster, and I plan to add two C240 M4s to deploy a mini-cloud at other locations (but that's a conversation for another day).

I have been running Ceph on one subnet with 10GBe NICs. Most of my LAN is 10GBe, except for PIs and NUCs (for personal use) and PBS (working on a better solution for that). I have done a lot of research on Ceph, including reading documentation, forum posts, and watching countless YouTube videos. However, a lot of it still seems like voodoo to me, as I am a slow learner. One thing that was clear is that Ceph should be on its own subnet or mesh network.

Initially, I had a C3850 with 1GBE and a multi-10GBe module. But I have since added a second switch, a Cisco N9K-C92160YC-X with 48 ports of 1/10/25G SFP and 6 40G QSFP (or 4 100G). I also replaced the NDC of each PowerEdge with a Mellanox Dual Gbe NIC. On Monday night, I reconfigured the network.

The first warning sign was when it took over an hour to restore a small VM from PBS, which usually takes less than a minute. I read something about increasing the MTU to 9000, but that only made things worse. Fast forward to today, and I have OSDs dropping like flies and the entire cluster is essentially inoperable. The more I tinker with it, the worse it gets.

Code:

[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 10.0.0.0/24
        fsid = e4aa8136-854c-4504-b839-795aaac19cd3
        mon_allow_pool_delete = true
        mon_host = 192.168.128.200 192.168.128.202 192.168.128.201
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 192.168.128.0/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.scumified]
        host = scumified
        mds_standby_for_name = pve

[mon.creepified]
        public_addr = 192.168.128.202

[mon.scumified]
        public_addr = 192.168.128.200

[mon.vilified]
        public_addr = 192.168.128.201

mds_standby_for_name that is new, never used to say that.

Code:

HEALTH_WARN: 1 filesystem is degraded
fs cloud-images is degraded

Code:

HEALTH_WARN: 1 MDSs report slow metadata IOs
mds.scumified(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 299 secs

Code:

HEALTH_WARN: Reduced data availability: 256 pgs inactive, 17 pgs down, 233 pgs peering, 1 pg incomplete
pg 2.16 is stuck peering for 20h, current state peering, last acting [8,16,20]
pg 2.17 is stuck peering for 14h, current state peering, last acting [23,13]
pg 2.18 is stuck peering for 14h, current state peering, last acting [18,16]
pg 2.19 is stuck peering for 14h, current state peering, last acting [6,11,13]
pg 2.1a is stuck peering for 16h, current state peering, last acting [11,14,2]
pg 2.1b is stuck peering for 14h, current state peering, last acting [22,14,11]
pg 3.16 is stuck peering for 14h, current state peering, last acting [9,1,20]
pg 3.17 is down, acting [20]
pg 3.18 is stuck peering for 14h, current state peering, last acting [18,16,6]
pg 3.19 is stuck peering for 14h, current state peering, last acting [1,22,10]
pg 3.1a is stuck peering for 14h, current state peering, last acting [16,5,20]
pg 3.1b is stuck peering for 13h, current state peering, last acting [15,7,20]
pg 4.10 is stuck peering for 14h, current state peering, last acting [13,5]
pg 4.11 is stuck peering for 14h, current state peering, last acting [7,18]
pg 4.1c is stuck peering for 14h, current state peering, last acting [15,6,23]
pg 4.1d is down, acting [18]
pg 4.1e is stuck peering for 13h, current state peering, last acting [0,7]
pg 4.1f is stuck peering for 14h, current state peering, last acting [17,8,1]
pg 5.10 is stuck peering for 14h, current state peering, last acting [20,16,7]
pg 5.11 is stuck peering for 14h, current state peering, last acting [20,8,16]
pg 5.1c is stuck peering for 32h, current state peering, last acting [15,2]
pg 5.1d is stuck peering for 14h, current state peering, last acting [9,17]
pg 5.1e is stuck peering for 14h, current state peering, last acting [7,15]
pg 5.1f is stuck peering for 14h, current state peering, last acting [11,23]
pg 6.12 is stuck peering for 14h, current state peering, last acting [13,10,20]
pg 6.13 is stuck peering for 14h, current state peering, last acting [20,23,16]
pg 6.1c is stuck peering for 14h, current state peering, last acting [14,23,5]
pg 6.1d is stuck peering for 14h, current state peering, last acting [8,22]
pg 6.1e is down, acting [23,13]
pg 6.1f is stuck peering for 13h, current state peering, last acting [3,20,11]
pg 7.10 is stuck inactive for 14h, current state peering, last acting [22,20,16]
pg 7.12 is stuck peering for 13h, current state peering, last acting [2,22]
pg 7.13 is stuck peering for 14h, current state peering, last acting [18,9,14]
pg 7.1c is down, acting [15]
pg 7.1d is stuck peering for 14h, current state peering, last acting [14,20,0]
pg 7.1e is stuck peering for 14h, current state peering, last acting [1,23,8]
pg 7.1f is stuck peering for 14h, current state peering, last acting [18,15,23]
pg 8.10 is down, acting [17]
pg 8.11 is stuck peering for 32h, current state peering, last acting [10,20,15]
pg 8.12 is stuck peering for 32h, current state peering, last acting [8,22,14]
pg 8.13 is stuck peering for 14h, current state peering, last acting [23,17]
pg 8.1c is stuck peering for 14h, current state peering, last acting [18,16]
pg 8.1d is stuck peering for 14h, current state peering, last acting [6,20,13]
pg 8.1f is stuck peering for 14h, current state peering, last acting [14,23,5]
pg 9.10 is stuck peering for 114s, current state peering, last acting [11,18]
pg 9.11 is stuck peering for 12h, current state peering, last acting [23]
pg 9.12 is stuck peering for 5h, current state peering, last acting [20,13]
pg 9.13 is stuck peering for 3m, current state peering, last acting [16,23]
pg 9.1c is stuck peering for 12m, current state peering, last acting [23,16]
pg 9.1d is stuck peering for 44m, current state peering, last acting [13,20]
pg 9.1e is stuck peering for 11m, current state peering, last acting [6,7]

Code:

HEALTH_WARN: Degraded data redundancy: 2/14449 objects degraded (0.014%), 2 pgs degraded, 3 pgs undersized
pg 6.e is activating+undersized+degraded, acting [8,1]
pg 9.14 is stuck undersized for 4m, current state undersized+peered, last acting [1]
pg 9.19 is stuck undersized for 4m, current state undersized+peered, last acting [3]
pg 9.1f is stuck undersized for 4m, current state undersized+degraded+peered, last acting [6]

Code:

HEALTH_WARN: 337 slow ops, oldest one blocked for 21260 sec, daemons [osd.0,osd.13,osd.14,osd.17,osd.18,osd.20,osd.22,osd.23,osd.3,mon.creepified]... have slow ops.

I have attached photos of my network user interface for each node, as well as the storage user interface. I am most likely neglecting some small, but significant part of the configuration, and I hope I'm not making myself out to be too much of an imbecile. I would greatly appreciate any guidance on setting this up correctly, rather than reverting back to the basic setup. Thank you for taking the time to read this. Merry Christmas.

ness1602 · 2024-12-25T15:20:23+0100

Cluster nodes are different from ceph cluster, can you give us images from node/CEPH, if you can from one node,but all views?

fakebizprez · 2024-12-25T16:04:56+0100

Thank you for the prompt response. I was aware that Proxmox HA was distinct from the Ceph cluster, I am not sure why I captured that screenshot. I apologize for the confusion. I have taken screenshots of each UI from all five nodes and have saved the crush logs in separate .txt files. The files are too large to upload here, so I am sharing them through my company's Google Drive. If you are not comfortable downloading from there, please let me know and I will find an alternative method. I appreciate the help.

https://drive.google.com/file/d/123UC8bYy90RxVEAAxfGxc79IoWO8saf_/view?usp=sharing

By the way, after posting, I deleted the MDS from the first configuration file.
Also, if this seems empty and underutilized for Ceph, it is because I backed everything up and started fresh before setting up the new network. Additionally, none of the data or software being hosted is critical until March when we bring the entire organization on-premises. This has been a 14-month endeavor (learning networking) as we wait for our Cloud contracts to expire.

One more thing worth adding, I did not create a VLan for this isolated Ceph network.

ness1602 · 2024-12-25T17:50:03+0100

You have separate public and cluster ceph networks, are they both available? do you have a need for cluster network? if not use just public_network.
Also, there are a lot of osds down, i would check the disks if they are working as expected(hddsentinel?) .Moreover, why so many pools? Usually one pool for ct/vm ,and cephfs for ISO's?
I would do a deep ssh inspection of the hosts and see if there are some serious networking problems. Ceph is very finnicky with bad networking.

gurubert · 2024-12-25T20:04:00+0100

Check all Interfaces if the MTU has been set to 1500 again.
Do not mix MTUs in the same Ethernet segment.

kellogs · 2024-12-25T23:12:04+0100

Hello

For Proxmox usually I would set it up this way

1. host management network aka corosync network (you could use a dedicated HA switch even 1Gbps is ok)
2. vm network which i use SDN for this
3. ceph public network (where your compute node mount (clients) to communite with Ceph Cluster)
4. ceph cluster network (this is dedicated to internal Ceph communication between OSDs)

for 2, 3 and 4 you could use say a pair of 100G switch and each of your node has 2x 100G VLT/LACP/MLAG and set all MTU to 1500.

Once you know are experienced and wish to tune for example MTU moditication and etc ... setup another lab environment.

Lets keep things simple and make it work and stable first.

fakebizprez · 2024-12-26T00:08:14+0100

gurubert said:
Check all Interfaces if the MTU has been set to 1500 again.
Do not mix MTUs in the same Ethernet segment.

My first concern was the VM restore taking over an hour.
After learning that this could be due to the default MTU of 1500, I changed all MTUs to 9000.
I did not notice a significant difference.
Later, I deployed a fresh LXC and received an error stating that it could not find a host. I then changed the public network to 1500 and kept the private network at 9000.
This is when things got worse, and I started experiencing issues with OSDs falling off.
I believe in keeping things simple, so should all MTUs be set to 1500?

ness1602 said:
You have separate public and cluster ceph networks, are they both available? do you have a need for cluster network? if not use just public_network.
Also, there are a lot of osds down, i would check the disks if they are working as expected(hddsentinel?) .Moreover, why so many pools? Usually one pool for ct/vm ,and cephfs for ISO's?
I would do a deep ssh inspection of the hosts and see if there are some serious networking problems. Ceph is very finnicky with bad networking.

Are they both available, as in publicly? No, the documentation says that the private Ceph network should not be routed to the public. As for why I did this after months of having no issues with Ceph - eventually there will be hundreds of VMs on this network for our own organization, as well as clients and we will have to scale out. My logic was to try this now when there's no harm in doing so, but specifically because this is best practices for higher performance from the official documentation .

As for the pools? I've been figuring out this system as I go along. I did not realize that a Pool and a RBD are essentially the same thing, from what I've gathered. I've looked up best practices, and have had many conversations with the "Ceph Helper" GPT, but am still unsure on what the "go-to" configuration is. To be honest, when I started this journey months ago, I didn't even know about the filesystem, I thought Ceph was just a more complex/superior solution to Minio and S3 storage. The way you laid it out is much more appealing. Regarding the networking, everything was fantastic until I messed with subnets.

kellogs said:
Hello

For Proxmox usually I would set it up this way

1. host management network aka corosync network (you could use a dedicated HA switch even 1Gbps is ok)
2. vm network which i use SDN for this
3. ceph public network (where your compute node mount (clients) to communite with Ceph Cluster)
4. ceph cluster network (this is dedicated to internal Ceph communication between OSDs)

for 2, 3 and 4 you could use say a pair of 100G switch and each of your node has 2x 100G VLT/LACP/MLAG and set all MTU to 1500.

Once you know are experienced and wish to tune for example MTU moditication and etc ... setup another lab environment.

Lets keep things simple and make it work and stable first.

1. I believe I am using corosync.
2. I have never messed with the SDN section of Proxmox yet. I generally get myself into trouble when I mess with networking (case in point today). Could you elaborate more on the SDN setup?
3. All good on this.
4. This is exactly what I was trying to do. How do you go about this? All I did was create a separate Linux Bridge on each node with the subnet of 10.0.0.0/24. I did not create a VLAN on my switch or OPNsense. Later I saw documentation regarding a "mesh" network where the second physical interfaces of each node connect to eachother, and do not go through the switch. This is what I'm most interested in learning.

fakebizprez · 2024-12-26T00:13:54+0100

One other thing I forgot to ask, how many MDS should be setup? One per node? Or one per cluster?

kellogs · 2024-12-26T00:26:33+0100

please read here for SDN https://pve.proxmox.com/pve-docs/chapter-pvesdn.html

I would not do any "mesh networks for ceph" if this is a production and not a lab.

For Ceph with Proxmox please read https://pve.proxmox.com/pve-docs/chapter-pveceph.html

kellogs · 2024-12-26T00:28:44+0100

fakebizprez said:
One other thing I forgot to ask, how many MDS should be setup? One per node? Or one per cluster?

you only need to setup MDS if you use CephFS for storing ISO which i dont because i am using it as a block storage only. A single MDS is good enough for a small cluster but has no redundancy. For production maybe 2.

fakebizprez · 2024-12-26T01:27:47+0100

I do want to utilize Ceph FS as well, but I don't use actual ISOs, because of cloudinit, or is that still considered using an ISO even though I'm not storing the boot image on storage?

fakebizprez · 2024-12-26T01:28:34+0100

kellogs said:
you only need to setup MDS if you use CephFS for storing ISO which i dont because i am using it as a block storage only. A single MDS is good enough for a small cluster but has no redundancy. For production maybe 2.

For curiosities sake, what are you using instead of Ceph FS and why aren't you using Ceph FS?

kellogs · 2024-12-26T01:35:49+0100

fakebizprez said:
For curiosities sake, what are you using instead of Ceph FS and why aren't you using Ceph FS?

my ceph storage is very expensive enterprise NVME so i reserved them for VMs only. My ISOs are stored on a synology mount. (NFS).

fakebizprez · 2024-12-26T01:59:04+0100

Yeah I'm all SSD for OSD with one NVME in each for DB disk. All future builds will be U.2 I imagine. Are you backing up proxmox with NFS or using PBS?

fakebizprez · 2024-12-26T02:36:08+0100

Not that I have any desire to set something like this up in the near future, but wouldn't SR-IOV on the Ceph Private Subnet offer the best performance?

kellogs · 2024-12-26T07:41:26+0100

fakebizprez said:
Yeah I'm all SSD for OSD with one NVME in each for DB disk. All future builds will be U.2 I imagine. Are you backing up proxmox with NFS or using PBS?

I am backing to PBS running on synology as a VM.

kellogs · 2024-12-26T07:42:09+0100

fakebizprez said:
Not that I have any desire to set something like this up in the near future, but wouldn't SR-IOV on the Ceph Private Subnet offer the best performance?

not sure never do it. mine is very straight forward 2x 100G and 10x 3.84 TB OSDs NVME Enterprise disk

ness1602 · 2024-12-26T07:49:17+0100

i would in your case drop GPT and hire and engineer,atleast for a few hours.

fakebizprez · 2024-12-26T08:25:18+0100

ness1602 said:
i would in your case drop GPT and hire and engineer,atleast for a few hours.

Over the last year, I have formed good friendships with various admins and engineers who have taught me a lot, despite my background in software development. However, when I realized the scale of the project, my first step was to study up and then hire some freelance engineers through Fiverr who specialize in Ceph and Proxmox. We initially considered OpenStack, but it became clear that it was overkill and would require a team.

The issue I've had is the trues SMEs on Ceph, that I've been able to find, are in high demand, but I will have paid for plenty of hours of professional support before we host our software that keeps our business running on our networks. By Fall, we aim to have a part-time engineer on staff; even if I knew the documentation word for word, I have many other responsibilities. It has been fun, and very aggravating to learn so much over the last year. I equate networking to baseball, it will humble you...fast and often.

Don't knock the LLMs, though. If I had the time to put all of Ceph and Proxmox's documentation in Markdown or JSON format, and finetune it for a week - it'd be a hell of an assistant. Most of the time now, I'm using a self-hosted version of Perplexica (Perplexity clone) because all of the sources will be links to documentation or posts to these forums.

ness1602 · 2024-12-26T08:43:35+0100

The CEPH is rather specific, because it is a complex software, which can kill your storage pretty fast. That is why i usually recommend bringing 3-node test cluster,and there you can molest it and everything you need, intentionally crash it etc,etc. Then when you see everything ceph can do,transfer that knowledge to prod. And don't give everyone access to CEPH, as i said,it can crash really quickly.

Ceph Ruined My Christmas

New Member

Attachments

Famous Member

New Member

Famous Member

Distinguished Member

Member

New Member

New Member

Member

Member

New Member

New Member

Member

New Member

New Member

Member

Member

Famous Member

New Member

Famous Member