Ceph Slow Ops

gyverchang

New Member
Oct 7, 2022
6
1
3
opsGood day everyone!

I have been using Proxmox for quite sometime and I'm loving it. I bought a new Dell c6420 with the plan to deploy it at a Datacenter to offload our local infrastructure(our site is super prone to electricity outage).

This is my set up:

Proxmox Packages
proxmox-ve: 7.3-1 (running kernel: 5.13.19-6-pve) pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4) pve-kernel-5.15: 7.3-1 pve-kernel-helper: 7.3-1 pve-kernel-5.15.83-1-pve: 5.15.83-1 pve-kernel-5.15.74-1-pve: 5.15.74-1 pve-kernel-5.13.19-6-pve: 5.13.19-15 ceph: 17.2.5-pve1 ceph-fuse: 17.2.5-pve1 corosync: 3.1.7-pve1 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve2 libproxmox-acme-perl: 1.4.3 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.3-1 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.3-1 libpve-guest-common-perl: 4.2-3 libpve-http-server-perl: 4.1-5 libpve-storage-perl: 7.3-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 openvswitch-switch: 2.15.0+ds1-2+deb11u1 proxmox-backup-client: 2.3.2-1 proxmox-backup-file-restore: 2.3.2-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.3 pve-cluster: 7.3-1 pve-container: 4.4-2 pve-docs: 7.3-1 pve-edk2-firmware: 3.20220526-1 pve-firewall: 4.2-7 pve-firmware: 3.6-2 pve-ha-manager: 3.5.1 pve-i18n: 2.8-1 pve-qemu-kvm: 7.1.0-4 pve-xtermjs: 4.16.0-1 qemu-server: 7.3-2 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.8.0~bpo11+2 vncterm: 1.7-1 zfsutils-linux: 2.1.7-pve2

Kernel: Linux 5.13.19-6-pve #1 SMP PVE 5.13.19-15 (Tue, 29 Mar 2022 15:59:50 +0200)
Downgraded from 5.15 as I read from some posts that downgrading to 5.13.19-6 solves this problem.

3 nodes already added into a cluster
All SSDs are connected to the UCEA-200 NVME Controller card
Node 1: 2X WUS4BB076D7P3E3 7.68tb
Node 2: Pending set up
Node 3: Monitor, Manager and Metadata only
Node 4: 1X WUS4BB076D7P3E3 7.68tb

The final set up would be 4 nodes with one NVME SSD each with a max of 3 Monitors.

All the nodes are connected via 10gbe ethernet.

I am using the default Ceph config

# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class ssd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 zone type 10 region type 11 root # buckets host sys01 { id -3 # do not change unnecessarily id -2 class ssd # do not change unnecessarily # weight 13.97260 alg straw2 hash 0 # rjenkins1 item osd.0 weight 6.98630 item osd.1 weight 6.98630 } host sys04 { id -5 # do not change unnecessarily id -4 class ssd # do not change unnecessarily # weight 6.98630 alg straw2 hash 0 # rjenkins1 item osd.2 weight 6.98630 } root default { id -1 # do not change unnecessarily id -6 class ssd # do not change unnecessarily # weight 20.95889 alg straw2 hash 0 # rjenkins1 item sys01 weight 13.97260 item sys04 weight 6.98630 } # rules rule replicated_rule { id 0 type replicated step take default step chooseleaf firstn 0 type host step emit } # end crush map

OSD Setup
1673697582732.png

Default Pool Options
1673697604282.png

As this is a very simplified set up, for testing before final deployment next month, pardon me if the settings are not optimized or wrong(I'm still learning about ceph).

Upon the creation of a pool, errors about OSD slow ops appeared almost immediately and upon writing to it, lots of PG errors appeared immediately.

1673697828544.png

Connection between nodes are fine as they are on a separated network dedicated to this test setup.

I have recreated this ceph cluster for over 3 times but it always end up in this state.

Would appreciate some insight on this issue and some assistance if possible.

Thank you you in advance guys!
 
Last edited:
I run a 3-node and 5-node all SAS HDD Ceph cluster. The 3-node is on 12-year old servers using a full-mesh 4 x 1GbE broadcast network. The 5-node Ceph cluster is Dell 12th-gen servers using 2 x 10GbE networking to ToR switches.

Not considered best practice but the Corosync, Ceph Public & Private networking all run on a single 10GbE network. The other 10GbE is for VM network traffic.

Write IOPS are in the hundreds and reads about double write IOPS.

I use the following optimization after much trial-and-error:

Set write cache enable to 1 on SAS drives (sdparm -s WCE=1 -S /dev/sd[x])
Set VM cache to none
Set VM to use VirtIO-single SCSI controller and enable IO thread and discard option
Set VM CPU type to 'host'
Set VM CPU NUMA if server has 2 or more physical CPU sockets
Set VM VirtIO Multiqueue to number of cores/vCPUs
Set VM to have qemu-guest-agent software installed
Set Linux VMs IO scheduler to none/noop
Set RBD pool to use the 'krbd' option if using Ceph
 
  • Like
Reactions: gyverchang
Thanks for the info. I got my issue fixed after finding out one of my Cisco 10GBE DAC was botched. After replacing it I have all greens on my ceph dashboard.

Thank you for your insight into this issue!
 
  • Like
Reactions: ITT

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!