Kubernetes / CephFS: pod liveliness probes failing (when cloning disks?)

lknite · 2024-10-22T22:21:57+0200

I have three proxmox servers, all with 32 cores / 128 gb ram.

2 have 2 nvmes at 1TB (making for 2 hosts with 2 TB each)
1 has 2 nvmes at 2TB (making for 1 host with 4 TB)

I have ceph enabled on the three hosts and all has been well, until I started digging into cluster-api and proxmox.
Storage has its own nic at 10gb, and another nic for everything else.

I've been setting up kubernetes clusters and watching things progress and have noticed that some pods in running clusters experience liveliness probe failures, across several different apps, whenever I clone a disk. Here for example is the cluster where I'm running cluster-api from:

Code:

NAMESPACE                           NAME                                                            READY   STATUS    RESTARTS        AGE
capi-ipam-in-cluster-system         capi-ipam-in-cluster-controller-manager-556fb8d5dd-54hrc        1/1     Running   343 (42m ago)   4d21h
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-5df59854d-dtbsr       1/1     Running   290 (42m ago)   16d
capi-system                         capi-controller-manager-8574ccdd5b-5gfd8                        1/1     Running   297 (42m ago)   16d
capmox-system                       capmox-controller-manager-5444b4b979-6zn69                      1/1     Running   240 (42m ago)   3d17h
kube-system                         kube-controller-manager-k-clusterapi                            1/1     Running   417 (42m ago)   17d
kube-system                         kube-scheduler-k-clusterapi                                     1/1     Running   414 (42m ago)   17d
tigera-operator                     tigera-operator-77f994b5bb-kngbd                                1/1     Running   494             17d

I'm not positive the issue is when cloning a disk. This is what I'm looking for some help on. I thought using nvmes and a dedicated 10gb nic should be ok. Am I not running an ideal scenario? If this is no good then I would guess everyone running kubernetes would be seeing this also. When I do a clone I see cephfs performances showing about 40-50MiB read/write until it finishes. (Does that sound like max speed?)

If this setup really isn't enough, I could see myself buying additional hardware to get 2 10gb nics dedicated on each server for just storage. (each host already has 2 10gb nics, i'm just using one for storage, and one for everything else).

Here are some pods restarting in a cluster stood up via proxmox cluster-api provider:

Search

Search

Kubernetes / CephFS: pod liveliness probes failing (when cloning disks?)

lknite

New Member