opsGood day everyone!
I have been using Proxmox for quite sometime and I'm loving it. I bought a new Dell c6420 with the plan to deploy it at a Datacenter to offload our local infrastructure(our site is super prone to electricity outage).
This is my set up:
Proxmox Packages
Kernel: Linux 5.13.19-6-pve #1 SMP PVE 5.13.19-15 (Tue, 29 Mar 2022 15:59:50 +0200)
Downgraded from 5.15 as I read from some posts that downgrading to 5.13.19-6 solves this problem.
3 nodes already added into a cluster
All SSDs are connected to the UCEA-200 NVME Controller card
Node 1: 2X WUS4BB076D7P3E3 7.68tb
Node 2: Pending set up
Node 3: Monitor, Manager and Metadata only
Node 4: 1X WUS4BB076D7P3E3 7.68tb
The final set up would be 4 nodes with one NVME SSD each with a max of 3 Monitors.
All the nodes are connected via 10gbe ethernet.
I am using the default Ceph config
OSD Setup
Default Pool Options
As this is a very simplified set up, for testing before final deployment next month, pardon me if the settings are not optimized or wrong(I'm still learning about ceph).
Upon the creation of a pool, errors about OSD slow ops appeared almost immediately and upon writing to it, lots of PG errors appeared immediately.
Connection between nodes are fine as they are on a separated network dedicated to this test setup.
I have recreated this ceph cluster for over 3 times but it always end up in this state.
Would appreciate some insight on this issue and some assistance if possible.
Thank you you in advance guys!
I have been using Proxmox for quite sometime and I'm loving it. I bought a new Dell c6420 with the plan to deploy it at a Datacenter to offload our local infrastructure(our site is super prone to electricity outage).
This is my set up:
Proxmox Packages
proxmox-ve: 7.3-1 (running kernel: 5.13.19-6-pve) pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4) pve-kernel-5.15: 7.3-1 pve-kernel-helper: 7.3-1 pve-kernel-5.15.83-1-pve: 5.15.83-1 pve-kernel-5.15.74-1-pve: 5.15.74-1 pve-kernel-5.13.19-6-pve: 5.13.19-15 ceph: 17.2.5-pve1 ceph-fuse: 17.2.5-pve1 corosync: 3.1.7-pve1 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve2 libproxmox-acme-perl: 1.4.3 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.3-1 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.3-1 libpve-guest-common-perl: 4.2-3 libpve-http-server-perl: 4.1-5 libpve-storage-perl: 7.3-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 openvswitch-switch: 2.15.0+ds1-2+deb11u1 proxmox-backup-client: 2.3.2-1 proxmox-backup-file-restore: 2.3.2-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.3 pve-cluster: 7.3-1 pve-container: 4.4-2 pve-docs: 7.3-1 pve-edk2-firmware: 3.20220526-1 pve-firewall: 4.2-7 pve-firmware: 3.6-2 pve-ha-manager: 3.5.1 pve-i18n: 2.8-1 pve-qemu-kvm: 7.1.0-4 pve-xtermjs: 4.16.0-1 qemu-server: 7.3-2 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.8.0~bpo11+2 vncterm: 1.7-1 zfsutils-linux: 2.1.7-pve2
Kernel: Linux 5.13.19-6-pve #1 SMP PVE 5.13.19-15 (Tue, 29 Mar 2022 15:59:50 +0200)
Downgraded from 5.15 as I read from some posts that downgrading to 5.13.19-6 solves this problem.
3 nodes already added into a cluster
All SSDs are connected to the UCEA-200 NVME Controller card
Node 1: 2X WUS4BB076D7P3E3 7.68tb
Node 2: Pending set up
Node 3: Monitor, Manager and Metadata only
Node 4: 1X WUS4BB076D7P3E3 7.68tb
The final set up would be 4 nodes with one NVME SSD each with a max of 3 Monitors.
All the nodes are connected via 10gbe ethernet.
I am using the default Ceph config
# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class ssd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 zone type 10 region type 11 root # buckets host sys01 { id -3 # do not change unnecessarily id -2 class ssd # do not change unnecessarily # weight 13.97260 alg straw2 hash 0 # rjenkins1 item osd.0 weight 6.98630 item osd.1 weight 6.98630 } host sys04 { id -5 # do not change unnecessarily id -4 class ssd # do not change unnecessarily # weight 6.98630 alg straw2 hash 0 # rjenkins1 item osd.2 weight 6.98630 } root default { id -1 # do not change unnecessarily id -6 class ssd # do not change unnecessarily # weight 20.95889 alg straw2 hash 0 # rjenkins1 item sys01 weight 13.97260 item sys04 weight 6.98630 } # rules rule replicated_rule { id 0 type replicated step take default step chooseleaf firstn 0 type host step emit } # end crush map
OSD Setup
Default Pool Options
As this is a very simplified set up, for testing before final deployment next month, pardon me if the settings are not optimized or wrong(I'm still learning about ceph).
Upon the creation of a pool, errors about OSD slow ops appeared almost immediately and upon writing to it, lots of PG errors appeared immediately.
Connection between nodes are fine as they are on a separated network dedicated to this test setup.
I have recreated this ceph cluster for over 3 times but it always end up in this state.
Would appreciate some insight on this issue and some assistance if possible.
Thank you you in advance guys!
Last edited: