Hi Everyone,
We have a constant write (4-10MBs, total) application with ~ 120 vm's and burned though ssd's in just under 2 years with our dedicated hosting provider so they recommended we got back to spindles. We have built 5 new servers as follows:
we had ceph running on our last 5 node cluster but there were three front ends and two file servers as the primary ceph nodes with smaller osd's on the 3 nodes so again not optimal but we got good performance until the disks started to drop like flies after around 1.5 years.
vm's are setup as follows:
PVE Host: %Cpu(s): 0.2 us, 0.2 sy, 0.0 ni, 98.8 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
Proxmox: 7.4-3 (all nodes updated and running same versions)
During ceph rebalance we see average of 60 MiB/s
average writes when when iowa were recorded above 10-15
Ceph:
[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.62.2.97/27 fsid = 460edbc2-56d9-44d0-a07d-891edffe6f0b mon_allow_pool_delete = true mon_host = 10.62.2.97 10.62.2.99 10.62.2.101 10.62.2.98 10.62.2.100 ms_bind_ipv4 = true ms_bind_ipv6 = false osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.62.2.97/27 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mds] keyring = /var/lib/ceph/mds/ceph-$id/keyring [mds.pve1] host = pve1 mds_standby_for_name = pve [mds.pve2] host = pve2 mds_standby_for_name = pve [mds.pve3] host = pve3 mds_standby_for_name = pve [mds.pve4] host = pve4 mds_standby_for_name = pve [mds.pve5] host = pve5 mds standby for name = pve [mon.pve1] public_addr = 10.62.2.97 [mon.pve2] public_addr = 10.62.2.98 [mon.pve3] public_addr = 10.62.2.99
osd options for max_capacity range from 191-430
weight per node is ~7.2
not sure what else would be helpful.
The cluster is not in production yet but er need to get this sorted. it feels like a VM config issue, we can not find any evidence of a bottle neck in ceph and even with 40 vms running do not see an increase in iowa than when we are running 1?
All help would be appreciated.
We have a constant write (4-10MBs, total) application with ~ 120 vm's and burned though ssd's in just under 2 years with our dedicated hosting provider so they recommended we got back to spindles. We have built 5 new servers as follows:
- high core count, 256G RAM, 480 SSD for the OS and 4 x 2TB
- An OSD per spindle
- 1 x 10G network for Ceph, cluster, VM traffic running in there own Qvlan
- 1 X 10G network for our internet connect witch reports speed of 10G but is throttle to 1G for internet
we had ceph running on our last 5 node cluster but there were three front ends and two file servers as the primary ceph nodes with smaller osd's on the 3 nodes so again not optimal but we got good performance until the disks started to drop like flies after around 1.5 years.
vm's are setup as follows:
- vertio single
- disks: iothread=1, discard=0
- both imported vms from the old cluster and a newly created vm tested same results.
PVE Host: %Cpu(s): 0.2 us, 0.2 sy, 0.0 ni, 98.8 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
Proxmox: 7.4-3 (all nodes updated and running same versions)
During ceph rebalance we see average of 60 MiB/s
average writes when when iowa were recorded above 10-15
Ceph:
[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.62.2.97/27 fsid = 460edbc2-56d9-44d0-a07d-891edffe6f0b mon_allow_pool_delete = true mon_host = 10.62.2.97 10.62.2.99 10.62.2.101 10.62.2.98 10.62.2.100 ms_bind_ipv4 = true ms_bind_ipv6 = false osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.62.2.97/27 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mds] keyring = /var/lib/ceph/mds/ceph-$id/keyring [mds.pve1] host = pve1 mds_standby_for_name = pve [mds.pve2] host = pve2 mds_standby_for_name = pve [mds.pve3] host = pve3 mds_standby_for_name = pve [mds.pve4] host = pve4 mds_standby_for_name = pve [mds.pve5] host = pve5 mds standby for name = pve [mon.pve1] public_addr = 10.62.2.97 [mon.pve2] public_addr = 10.62.2.98 [mon.pve3] public_addr = 10.62.2.99
osd options for max_capacity range from 191-430
weight per node is ~7.2
not sure what else would be helpful.
The cluster is not in production yet but er need to get this sorted. it feels like a VM config issue, we can not find any evidence of a bottle neck in ceph and even with 40 vms running do not see an increase in iowa than when we are running 1?
All help would be appreciated.