Proxmox Extremely slow recovery & Ceph

infinityM · Sep 14, 2020

Hey Guys,

Can you please assist?
I'm struggling with ceph, Everything was running smoothly untill friday.

Suddenly my proxmox cluster is dead slow and if I enable ceph's backfill as normal the vps's load jump to 100+ and no services are able to start up...
Nothing much changed on ceph except for 1 node reinstallation and re-add to the cluster...

Can you please help me figure out why the sync speed is like 12MiB/s PLEASE?

infinityM · Sep 14, 2020

As a note. What I did do now, was I setup 2 bond's for each server's eno1 eno3 and eno2 eno4 for redundancy, and configured the bridge on those bond's to see if it helps performance, but it remains extremely slow... If I enable backfill all vps's immediately become unavailable...

Any idea what I can check?

Alwin · Sep 14, 2020

Please post with your thread opening, status and other config options, so people can have a look right away.

In what state is the cluster now (ceph -s)?

infinityM · Sep 14, 2020

Alwin said:
Please post with your thread opening, status and other config options, so people can have a look right away.

In what state is the cluster now (ceph -s)?

Sorry about that... Here's the status of ceph as it stands now...

root@pve:~# ceph -s
cluster:
id: 248fab2c-bd08-43fb-a562-08144c019785
health: HEALTH_WARN
Degraded data redundancy: 903844/5938684 objects degraded (15.220%), 358 pgs degraded, 358 pgs undersized
2 daemons have recently crashed

services:
mon: 3 daemons, quorum pve,c4,c6 (age 4h)
mgr: c6(active, since 5h), standbys: pve
osd: 25 osds: 25 up (since 4h), 25 in (since 4h); 390 remapped pgs

data:
pools: 1 pools, 1024 pgs
objects: 2.97M objects, 11 TiB
usage: 21 TiB used, 14 TiB / 35 TiB avail
pgs: 903844/5938684 objects degraded (15.220%)
357240/5938684 objects misplaced (6.015%)
634 active+clean
352 active+undersized+degraded+remapped+backfill_wait
32 active+remapped+backfill_wait
6 active+undersized+degraded+remapped+backfilling

io:
client: 77 MiB/s rd, 4.7 MiB/s wr, 73 op/s rd, 243 op/s wr
recovery: 17 MiB/s, 4 objects/s

I did limit the recovery to 1 per osd though so I suspect it's slowing things down, but if I make it bigger, the VM's become unresponsive

Alwin · Sep 14, 2020

infinityM said:
As a note. What I did do now, was I setup 2 bond's for each server's eno1 eno3 and eno2 eno4 for redundancy, and configured the bridge on those bond's to see if it helps performance, but it remains extremely slow... If I enable backfill all vps's immediately become unavailable...

What type of bond did you setup? Are Ceph's public & cluster network on the same bond?

infinityM · Sep 14, 2020

Alwin said:
What type of bond did you setup? Are Ceph's public & cluster network on the same bond?

That's a good question. So i know my corosync is on a seperate IP range, but I don't know how to check for the ceph sync IP range, since i'd like to make it the internal IP range... How can I check?

Also, in PM, if I bond the ports using balance-rr does that mean the bond of 2 x 10Gb ports can then do 20Gb/s through the bond? I'm not sure how ceph's bonds work as yet... Trying to figure it out D:

Alwin · Sep 14, 2020

infinityM said:
Also, in PM, if I bond the ports using balance-rr does that mean the bond of 2 x 10Gb ports can then do 20Gb/s through the bond? I'm not sure how ceph's bonds work as yet... Trying to figure it out D:

Outgoing traffic can use the 20 Gb/s but not the incoming. Hence, it will still be limited 10 Gb/s.

infinityM said:
That's a good question. So i know my corosync is on a seperate IP range, but I don't know how to check for the ceph sync IP range, since i'd like to make it the internal IP range... How can I check?

Compare what's your network configuration with the IP range in the ceph.conf.

infinityM · Sep 14, 2020

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 129.232.156.12/28
fsid = 248fab2c-bd08-43fb-a562-08144c019785
mon_allow_pool_delete = true
mon_host = 129.232.156.15 129.232.156.19 129.232.156.16
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 129.232.156.12/28

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.c4]
public_addr = 129.232.156.19

[mon.c6]
public_addr = 129.232.156.16

I'm not sure how to differentiate between the data sync ip range and the communication IP range though? I'd like those to be separate...
And how can one safely change it without risk of data corruption?

David Herselman · Sep 14, 2020

Also, don't run balance-rr...
Use a hashing based director where traffic for streams follow a consistent path...

infinityM · Sep 14, 2020

Alwin said:
Outgoing traffic can use the 20 Gb/s but not the incoming. Hence, it will still be limited 10 Gb/s.

Theoretically though that means ceph can send out 2 sets of 10GB/s to different servers though should it be re-balancing although each of those servers would only receive 10GB/s per server?

infinityM · Sep 14, 2020

David Herselman said:
Also, don't run balance-rr...
Use a hashing based director where traffic for streams follow a consistent path...

Please elaborate? I'm not sure what you're suggesting?

Alwin · Sep 15, 2020

infinityM said:
I'm not sure how to differentiate between the data sync ip range and the communication IP range though? I'd like those to be separate...
And how can one safely change it without risk of data corruption?

Since Proxmox VE is client and server at the same time, a separation will split the used bandwidth. But simply changing the IP of the cluster_network is not a problem. You will need to configure a second IP address on the bond before you restart all the OSD daemons.

In any case, these changes don't touch the written data on Ceph. If something doesn't work, then you simply change back your configuration.

infinityM said:
Please elaborate? I'm not sure what you're suggesting?

In short use LACP, it has different hash policies or active-backup. For LACP, your switch needs to have that function as well. If it is a small cluster, you could also run a full-mesh network, without any switch involved.
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

David Herselman · Sep 16, 2020

infinityM said:
Please elaborate? I'm not sure what you're suggesting?

Balance-rr round robins packet delivery which will 100% lead to packets arriving at the destination of of order. Not necessarily because the cables are different lengths, simply due to packets arriving on different NICs and being collected by separate threads running on differently loaded cores.

LACP hashes information such as source and destination IPs, even layer 3 port information to ensure streams follow a consistent path.

On RR a brute force udp iperf test will show excellent results but data transfers will result in many retransmits and window scaling subsequently never kicking in.

Elevator pitch:
Less is more, hashing multiple flows over LACP uplinks is super reliable and consisten when used on nearly all switches.

From https://wiki.mikrotik.com/wiki/Manual:Interface/Bonding

802.3ad
802.3ad mode is an IEEE standard also called LACP (Link Aggregation Control Protocol). It includes automatic configuration of the aggregates, so minimal configuration of the switch is needed. This standard also mandates that frames will be delivered in order and connections should not see mis-ordering of packets. The standard also mandates that all devices in the aggregate must operate at the same speed and duplex mode.

LACP balances outgoing traffic across the active ports based on hashed protocol header information and accepts incoming traffic from any active port. The hash includes the Ethernet source and destination address and if available, the VLAN tag, and the IPv4/IPv6 source and destination address. How this is calculated depends on transmit-hash-policy parameter. The ARP link monitoring is not recommended, because the ARP replies might arrive only on one slave port due to transmit hash policy on the LACP peer device. This can result in unbalanced transmitted traffic, so MII link monitoring is the recommended option.

balance-rr
If this mode is set, packets are transmitted in sequential order from the first available slave to the last.
Balance-rr is the only mode that will send packets across multiple interfaces that belong to the same TCP/IP connection.
When utilizing multiple sending and multiple receiving links, packets are often received out of order, which result in segment retransmission, for other protocols such as UDP it is not a problem if client software can tolerate out-of-order packets.
If switch is used to aggregate links together, then appropriate switch port configuration is required, however many switches do not support balance-rr.

Search

Search

Proxmox Extremely slow recovery & Ceph

infinityM

Well-Known Member

infinityM

Well-Known Member

Alwin

Proxmox Retired Staff

infinityM

Well-Known Member

Alwin

Proxmox Retired Staff

infinityM

Well-Known Member

Alwin

Proxmox Retired Staff

infinityM

Well-Known Member

David Herselman

Renowned Member

infinityM

Well-Known Member

infinityM

Well-Known Member

Alwin

Proxmox Retired Staff

David Herselman

Renowned Member