Ceph Slow Ops

gyverchang · Jan 14, 2023

opsGood day everyone!

I have been using Proxmox for quite sometime and I'm loving it. I bought a new Dell c6420 with the plan to deploy it at a Datacenter to offload our local infrastructure(our site is super prone to electricity outage).

This is my set up:

Proxmox Packages

proxmox-ve: 7.3-1 (running kernel: 5.13.19-6-pve) pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4) pve-kernel-5.15: 7.3-1 pve-kernel-helper: 7.3-1 pve-kernel-5.15.83-1-pve: 5.15.83-1 pve-kernel-5.15.74-1-pve: 5.15.74-1 pve-kernel-5.13.19-6-pve: 5.13.19-15 ceph: 17.2.5-pve1 ceph-fuse: 17.2.5-pve1 corosync: 3.1.7-pve1 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve2 libproxmox-acme-perl: 1.4.3 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.3-1 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.3-1 libpve-guest-common-perl: 4.2-3 libpve-http-server-perl: 4.1-5 libpve-storage-perl: 7.3-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 openvswitch-switch: 2.15.0+ds1-2+deb11u1 proxmox-backup-client: 2.3.2-1 proxmox-backup-file-restore: 2.3.2-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.3 pve-cluster: 7.3-1 pve-container: 4.4-2 pve-docs: 7.3-1 pve-edk2-firmware: 3.20220526-1 pve-firewall: 4.2-7 pve-firmware: 3.6-2 pve-ha-manager: 3.5.1 pve-i18n: 2.8-1 pve-qemu-kvm: 7.1.0-4 pve-xtermjs: 4.16.0-1 qemu-server: 7.3-2 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.8.0~bpo11+2 vncterm: 1.7-1 zfsutils-linux: 2.1.7-pve2

Kernel: Linux 5.13.19-6-pve #1 SMP PVE 5.13.19-15 (Tue, 29 Mar 2022 15:59:50 +0200)
Downgraded from 5.15 as I read from some posts that downgrading to 5.13.19-6 solves this problem.

3 nodes already added into a cluster
All SSDs are connected to the UCEA-200 NVME Controller card
Node 1: 2X WUS4BB076D7P3E3 7.68tb
Node 2: Pending set up
Node 3: Monitor, Manager and Metadata only
Node 4: 1X WUS4BB076D7P3E3 7.68tb

The final set up would be 4 nodes with one NVME SSD each with a max of 3 Monitors.

All the nodes are connected via 10gbe ethernet.

I am using the default Ceph config

# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class ssd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 zone type 10 region type 11 root # buckets host sys01 { id -3 # do not change unnecessarily id -2 class ssd # do not change unnecessarily # weight 13.97260 alg straw2 hash 0 # rjenkins1 item osd.0 weight 6.98630 item osd.1 weight 6.98630 } host sys04 { id -5 # do not change unnecessarily id -4 class ssd # do not change unnecessarily # weight 6.98630 alg straw2 hash 0 # rjenkins1 item osd.2 weight 6.98630 } root default { id -1 # do not change unnecessarily id -6 class ssd # do not change unnecessarily # weight 20.95889 alg straw2 hash 0 # rjenkins1 item sys01 weight 13.97260 item sys04 weight 6.98630 } # rules rule replicated_rule { id 0 type replicated step take default step chooseleaf firstn 0 type host step emit } # end crush map

OSD Setup

Default Pool Options

As this is a very simplified set up, for testing before final deployment next month, pardon me if the settings are not optimized or wrong(I'm still learning about ceph).

Upon the creation of a pool, errors about OSD slow ops appeared almost immediately and upon writing to it, lots of PG errors appeared immediately.

Connection between nodes are fine as they are on a separated network dedicated to this test setup.

I have recreated this ceph cluster for over 3 times but it always end up in this state.

Would appreciate some insight on this issue and some assistance if possible.

Thank you you in advance guys!

jdancer · Jan 14, 2023

I run a 3-node and 5-node all SAS HDD Ceph cluster. The 3-node is on 12-year old servers using a full-mesh 4 x 1GbE broadcast network. The 5-node Ceph cluster is Dell 12th-gen servers using 2 x 10GbE networking to ToR switches.

Not considered best practice but the Corosync, Ceph Public & Private networking all run on a single 10GbE network. The other 10GbE is for VM network traffic.

Write IOPS are in the hundreds and reads about double write IOPS.

I use the following optimization after much trial-and-error:

Set write cache enable to 1 on SAS drives (sdparm -s WCE=1 -S /dev/sd[x])
Set VM cache to none
Set VM to use VirtIO-single SCSI controller and enable IO thread and discard option
Set VM CPU type to 'host'
Set VM CPU NUMA if server has 2 or more physical CPU sockets
Set VM VirtIO Multiqueue to number of cores/vCPUs
Set VM to have qemu-guest-agent software installed
Set Linux VMs IO scheduler to none/noop
Set RBD pool to use the 'krbd' option if using Ceph

gyverchang · Jan 20, 2023

Thanks for the info. I got my issue fixed after finding out one of my Cisco 10GBE DAC was botched. After replacing it I have all greens on my ceph dashboard.

Thank you for your insight into this issue!

Search

Search

Ceph Slow Ops

gyverchang

New Member

jdancer

Renowned Member

gyverchang

New Member