Kubernetes / CephFS: pod liveliness probes failing (when cloning disks?)

lknite

Member
Sep 27, 2024
63
5
8
I have three proxmox servers, all with enough cpu and ram.

2 have 2 nvmes at 1TB (making for 2 hosts with 2 TB each)
1 has 2 nvmes at 2TB (making for 1 host with 4 TB)

I have ceph enabled on the three hosts and all has been well, until I started digging into cluster-api and proxmox.
Storage has its own nic at 10gb, and another nic for everything else.

I've been setting up kubernetes clusters and watching things progress and have noticed that some pods in running clusters experience liveliness probe failures, across several different apps, whenever I clone a disk. Here for example is the cluster where I'm running cluster-api from:
Code:
NAMESPACE                           NAME                                                            READY   STATUS    RESTARTS        AGE
capi-ipam-in-cluster-system         capi-ipam-in-cluster-controller-manager-556fb8d5dd-54hrc        1/1     Running   343 (42m ago)   4d21h
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-5df59854d-dtbsr       1/1     Running   290 (42m ago)   16d
capi-system                         capi-controller-manager-8574ccdd5b-5gfd8                        1/1     Running   297 (42m ago)   16d
capmox-system                       capmox-controller-manager-5444b4b979-6zn69                      1/1     Running   240 (42m ago)   3d17h
kube-system                         kube-controller-manager-k-clusterapi                            1/1     Running   417 (42m ago)   17d
kube-system                         kube-scheduler-k-clusterapi                                     1/1     Running   414 (42m ago)   17d
tigera-operator                     tigera-operator-77f994b5bb-kngbd                                1/1     Running   494             17d

I'm not positive the issue is when cloning a disk. This is what I'm looking for some help on. I thought using nvmes and a dedicated 10gb nic should be ok. Am I not running an ideal scenario? If this is no good then I would guess everyone running kubernetes would be seeing this also. When I do a clone I see cephfs performances showing about 40-50MiB read/write until it finishes. (Does that sound like max speed?)

If this setup really isn't enough, I could see myself buying additional hardware to get 2 10gb nics dedicated on each server for just storage. (each host already has 2 10gb nics, i'm just using one for storage, and one for everything else).

Here are some pods restarting in a cluster stood up via proxmox cluster-api provider:
1729638717408.png
 
Last edited:
Some additional data, I see that there is a spike in Write data every now and then and that spike is associated with some pods restarting due to liveness probes. Trying to time it. I'm not sure what it could be doing ... there isn't really anything going on, so not sure what is writing. Still, should the write not be fast enough not to cause this issue?

1729642528503.png
 
I didn't even know 25G nics existed. Ok what to do... I already have 2 10gb nics. First I'll try using both of those.

But, it sounds like really, I just need to get 3 dual-port nics at least 25G ... and then just hook all the servers together directly without a switch.

I'll start saving up.
 
Does this look right for 10gb? Or is it possible this is 1gb speed? (or are these nvmes possibly a bottle neck?)

1729644813903.png

This is my first time setting up cephfs. I enabled a mon on each host & a 2nd mgr. 10.0.1.*/24 is the storage nic. Here's the configuration:

1729645038883.png
 
Last edited:
Oh my gosh, the 10gb nics are all running at 1gb. Well, impressive it worked as well as it did. Hmmm....

Code:
root@pve-a:~# ethtool enp16s0f0
Settings for enp16s0f0:
        Supported ports: [ TP ]
        Supported link modes:   100baseT/Full
                                1000baseT/Full
                                10000baseT/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  100baseT/Full
                                1000baseT/Full
                                10000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: 1000Mb/s
        Duplex: Full
        Auto-negotiation: on
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        MDI-X: Unknown
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes
root@pve-a:~# lspci |grep -i ether
09:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
0b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 01)
10:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
10:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)

Cable is CAT6 and less than 6ft long. Switch is definitely 10gb for each port.
 
Last edited:
Ok, got the three nics to use 10gb by using:

Code:
ethtool -s enp1s0f1 autoneg on speed 10000 duplex full
 
iperf3 is showing 1g speeds rather than 10g, will open up a new question for getting 10gb to work:
Code:
root@pve-b:~# iperf3 -c 10.0.0.23
Connecting to host 10.0.0.23, port 5201
[  5] local 10.0.0.22 port 41866 connected to 10.0.0.23 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   113 MBytes   948 Mbits/sec    6    423 KBytes       
[  5]   1.00-2.00   sec   111 MBytes   930 Mbits/sec    0    486 KBytes       
[  5]   2.00-3.00   sec   106 MBytes   889 Mbits/sec    1    498 KBytes