[SOLVED] cluster ceph very slow when one node is offline

erwann · Oct 14, 2019

Hi,

We have a cluster of 4 servers (all of them are up-to-date 6.0.7 version)
We have configured ceph on both, with 3 monitors.
All server work well when thez are all online but when one host is down , the other are very very slow.

Have you ever seen this?

Regards,

Alwin · Oct 14, 2019

erwann said:
All server work well when thez are all online but when one host is down , the other are very very slow.

How? Can you please elaborate more about your cluster and that issue?

erwann · Oct 14, 2019

Our cluster in a new cluster and there is a few vm on it.
Most of them are webservers.
When all nodes are up , ceph health is ok.
All servers are responding.
But when I reboot a node for maintenance or other, all vms not responding. Other three nodes have Higu CPU load ,and it seems to be too IOPs on ceph.

Alwin · Oct 14, 2019

How does a ceph osd tree look like? And what does ceph -s show when one node is down? What hardware did you use to build the cluster?

erwann · Oct 14, 2019

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 16.34796 root default
-3 4.08699 host cog-pve01
0 hdd 0.81740 osd.0 up 1.00000 1.00000
1 hdd 0.81740 osd.1 up 1.00000 1.00000
2 hdd 0.81740 osd.2 up 1.00000 1.00000
3 hdd 0.81740 osd.3 up 1.00000 1.00000
4 hdd 0.81740 osd.4 up 1.00000 1.00000
-5 4.08699 host cog-pve02
5 hdd 0.81740 osd.5 up 1.00000 1.00000
6 hdd 0.81740 osd.6 up 1.00000 1.00000
7 hdd 0.81740 osd.7 up 1.00000 1.00000
8 hdd 0.81740 osd.8 up 1.00000 1.00000
9 hdd 0.81740 osd.9 up 1.00000 1.00000
-7 4.08699 host cog-pve03
10 hdd 0.81740 osd.10 up 1.00000 1.00000
11 hdd 0.81740 osd.11 up 1.00000 1.00000
12 hdd 0.81740 osd.12 up 1.00000 1.00000
13 hdd 0.81740 osd.13 up 1.00000 1.00000
14 hdd 0.81740 osd.14 up 1.00000 1.00000
-9 4.08699 host cog-pve04
15 hdd 0.81740 osd.15 up 1.00000 1.00000
16 hdd 0.81740 osd.16 up 1.00000 1.00000
17 hdd 0.81740 osd.17 up 1.00000 1.00000
18 hdd 0.81740 osd.18 up 1.00000 1.00000
19 hdd 0.81740 osd.19 up 1.00000 1.00000

for hardware , we have 4 fujitsu with 2 Intel Xeon 4114 , 128 Go Ram and 4 network interfaces (2 * 10G for ceph netork and 2 * 10G for VM network) , and 5 * SAS 900 Go per server .

I have to schedule a maintenance for get "ceph -s " return.

Alwin · Oct 14, 2019

erwann said:
for hardware , we have 4 fujitsu with 2 Intel Xeon 4114 , 128 Go Ram and 4 network interfaces (2 * 10G for ceph netork and 2 * 10G for VM network) , and 5 * SAS 900 Go per server .

What about the disk controller? And where is the corosync traffic located?

Can you please also post the crushmap?

erwann · Oct 14, 2019

the crushmap is

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host cog-pve01 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 4.087
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.817
item osd.1 weight 0.817
item osd.2 weight 0.817
item osd.3 weight 0.817
item osd.4 weight 0.817
}
host cog-pve02 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 4.087
alg straw2
hash 0 # rjenkins1
item osd.5 weight 0.817
item osd.6 weight 0.817
item osd.7 weight 0.817
item osd.8 weight 0.817
item osd.9 weight 0.817
}
host cog-pve03 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 4.087
alg straw2
hash 0 # rjenkins1
item osd.10 weight 0.817
item osd.11 weight 0.817
item osd.12 weight 0.817
item osd.13 weight 0.817
item osd.14 weight 0.817
}
host cog-pve04 {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 4.087
alg straw2
hash 0 # rjenkins1
item osd.15 weight 0.817
item osd.16 weight 0.817
item osd.17 weight 0.817
item osd.18 weight 0.817
item osd.19 weight 0.817
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 16.348
alg straw2
hash 0 # rjenkins1
item cog-pve01 weight 4.087
item cog-pve02 weight 4.087
item cog-pve03 weight 4.087
item cog-pve04 weight 4.087
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

Disk controler is CP400i RAID Controller 12Gb/s 8 port based on LSI SAS3008
the corosync trafic is on Vm network (not Ceph dedicated network)

Alwin · Oct 14, 2019

erwann said:
Disk controler is CP400i RAID Controller 12Gb/s 8 port based on LSI SAS3008

Hm. RAID controller are a NO-GO for Ceph, please find more information on our docs.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

erwann said:
the corosync trafic is on Vm network (not Ceph dedicated network)

This will cause interference and might sooner or later disrupt corosync communication, that will lead to non-writable '/etc/pve' and node resets if HA is activated. Separate the corosync traffic onto its own physical NIC ports and add a second ring for redundancy. Bandwidth is less important for corosync, more important is low and stable latency.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network

erwann · Oct 14, 2019

Every SAS disk is configured on RAID 0 with no cache.

For corosync network, physical NICs dedicated to VM are 2*10Gb Fo. Our link between node in under 0.15 ms (where documentation said "This needs a reliable network with latencies under 2 milliseconds (LAN performance) to work properly "

Alwin · Oct 14, 2019

erwann said:
Our link between node in under 0.15 ms (where documentation said "This needs a reliable network with latencies under 2 milliseconds (LAN performance) to work properly "

This is correct, but as I tried to explain above, the other traffic on that interface might interfere sooner or later with corosync. For example if one of the webserver will be under a DoS attack, the interface might just be overloaded and no other traffic will be able to pass, including corosync.

erwann said:
Every SAS disk is configured on RAID 0 with no cache.

This is not the same as an HBA. From experience RAID controller mask properties of the physical disk, still optimize read/writes and often features can't be deactivated due to its technical layout. This is why we also have it in our precondition section.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

And from your original description, it sounds like blocked IO (slow requests), where the recovery/backfill of Ceph overloads the RAID controller. There should be messages in the Ceph logs (/var/log/ceph/), on what was going on during the time of the slow down.

erwann · Oct 16, 2019

Hi ,
I take a maintenance to get the "ceph -s" return.
Regards,

spirit · Oct 16, 2019

>>And from your original description, it sounds like blocked IO (slow requests), where the recovery/backfill of Ceph overloads the RAID controller. >>There should be messages in the Ceph logs (/var/log/ceph/), on what was going on during the time of the slow down.

Looking at last screenshot, osd noout flag is set. so no rebalance should occur.

@erwann : what is your size && min_size for differents pools ?

erwann · Oct 16, 2019

yes this morning I test noout flag and nodown flag, but I have same behavior with ether flag.
I have only one pool with size 2 and mi_size 2 (placement 1024)

tabakoff · Oct 16, 2019

Hello,
we have the same problem. We have 4 node cluster with ceph and HA for VMs. When 4 nodes are up, performance is good:

Object prefix: benchmark_data_vm-host5_509634
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 239 223 891.928 892 0.0874671 0.0706119
2 16 454 438 875.92 860 0.0648146 0.0710034
3 16 672 656 874.575 872 0.0482636 0.0717775
4 16 900 884 883.904 912 0.0464978 0.0716867
5 16 1129 1113 890.303 916 0.024642 0.0715882

But when 1 node goes down, this happens:

Object prefix: benchmark_data_vm-host5_509929
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 34 18 71.9934 72 0.0223113 0.0443496
2 16 34 18 35.995 0 - 0.0443496
3 16 34 18 23.9968 0 - 0.0443496
4 16 34 18 17.9976 0 - 0.0443496
5 16 34 18 14.3981 0 - 0.0443496
6 16 34 18 11.9984 0 - 0.0443496

we use pve-manager/6.0-7/28984024 (running kernel: 5.0.21-2-pve) on our 4 nodes.

CEPH.CONF


[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.37.28.0/24
     fsid = d2a41b91-e4ec-4e4c-bab9-c08c9fedc78c
     mon_allow_pool_delete = true
     mon_host = 10.37.28.6 10.37.28.5 10.37.28.4 10.37.28.3
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.37.27.0/24
     ms_bind_port_max = 8300
    debug asok = 0/0
debug auth = 0/0
debug bdev = 0/0
debug bluefs = 0/0
debug bluestore = 0/0
debug buffer = 0/0
debug civetweb = 0/0
debug client = 0/0
debug compressor = 0/0
debug context = 0/0
debug crush = 0/0
debug crypto = 0/0
debug dpdk = 0/0
debug eventtrace = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug fuse = 0/0
debug heartbeatmap = 0/0
debug javaclient = 0/0
debug journal = 0/0
debug journaler = 0/0
debug kinetic = 0/0
debug kstore = 0/0
debug leveldb = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug memdb = 0/0
debug mgr = 0/0
debug mgrc = 0/0
debug mon = 0/0
debug monc = 0/00
debug ms = 0/0
debug none = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rbd mirror = 0/0
debug rbd replay = 0/0
debug refs = 0/0
debug reserver = 0/0
debug rgw = 0/0
debug rocksdb = 0/0
debug striper = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0
debug xio = 0/0
[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon]
mon allow pool delete = True
mon health preluminous compat = True
mon osd down out interval = 300



[osd]
bluestore cache autotune = 0
bluestore cache kv ratio = 0.2
bluestore cache meta ratio = 0.8
bluestore cache size ssd = 8G
bluestore csum type = none
bluestore extent map shard max size = 200
bluestore extent map shard min size = 50
bluestore extent map shard target size = 100
bluestore rocksdb options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
osd map share max epochs = 100
osd max backfills = 5
osd memory target = 4294967296
osd op num shards = 8
osd op num threads per shard = 2
osd min pg log entries = 10
osd max pg log entries = 10
osd pg log dups tracked = 10
osd pg log trim min = 10

Alwin · Oct 16, 2019

erwann said:
I have only one pool with size 2 and mi_size 2 (placement 1024)

Size is the target replica count for Ceph and min_size is the replica count needed to keep the pool in read/write mode. As one node is down, one one replica exists and the pool is in read-only, so IO is blocked till at least two replicas exist.

tabakoff · Oct 16, 2019

When we changed the number of mi_size from 2 to 3, doesn't change the performance of write/read speed when 1 node goes down.

Alwin · Oct 16, 2019

tabakoff said:
When we changed the number of mi_size from 2 to 3, doesn't change the performance of write/read speed when 1 node goes down.

The min_size is the minimum count of a repica for a PG that needs to exist on the cluster to have the pool still in read/write operation. As one node is down, some of those PGs are only once available. Those replicas have to be recovered first.

When you set size = 3 then 3x replicas will be created and there will be two available if one node is down. But this will reduce the amount of available space.

erwann · Oct 17, 2019

@Alwin :

Why did I do that ?
Thanks for your explanations!
I will change pool configuration and retest it.

erwann · Oct 18, 2019

@Alwin : Thanks to you,
our latest tests are ok.
During a reboot of a node , webservers are unavailable few seconds (switching to other replicat of data maybe?), but all vm are ok after that.
And when node is up, rebuild is automatic and works perfectly.
We could work with one node down as expected.

PS: I will reconfigure cluster network on dedicated gigabit port (bonding of 2 per server) as you explained at #8.
Thank you !

tabakoff · Oct 18, 2019

@Alwin: Thanks for your help. We did our pool 4/3 size, but it still doesn't worked like we expected.
This is our test:
We have one pool with 8 osds - size 4, min_size 3.
We disable ens2f1 (cluster interface 10G) and put the command on other node:


rados bench -p SATA 60 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_vm-host5_1920940
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        17         1   3.99979         4     0.20863     0.20863
    2      16        17         1   1.99975         0           -     0.20863
    3      16        17         1   1.33318         0           -     0.20863
    4      16        17         1  0.999871         0           -     0.20863
    5      16        17         1  0.799901         0           -     0.20863
    6      16        17         1  0.666586         0           -     0.20863
    7      16        17         1  0.571363         0           -     0.20863

We have enabled igmp and igmp snooping (is there anything in common with our problem) on our switch (Extreme Networks Summit X670-G2-72x) and then did this:
post-up ( echo 1 > /sys/devices/virtual/net/$IFACE/bridge/multicast_querier )
post-up ( echo 0 > /sys/class/net/$IFACE/bridge/multicast_snooping )
We tried and to disable it, without results.
Where we can search the problem. Thanks in advance, you are awesome.

[SOLVED] cluster ceph very slow when one node is offline

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

Attachments

Distinguished Member

New Member

New Member

Attachments

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

New Member

New Member