[SOLVED] cluster ceph very slow when one node is offline

erwann

New Member
Oct 14, 2019
12
0
1
Hi,

We have a cluster of 4 servers (all of them are up-to-date 6.0.7 version)
We have configured ceph on both, with 3 monitors.
All server work well when thez are all online but when one host is down , the other are very very slow.

Have you ever seen this?

Regards,
 
All server work well when thez are all online but when one host is down , the other are very very slow.
How? Can you please elaborate more about your cluster and that issue?
 
Our cluster in a new cluster and there is a few vm on it.
Most of them are webservers.
When all nodes are up , ceph health is ok.
All servers are responding.
But when I reboot a node for maintenance or other, all vms not responding. Other three nodes have Higu CPU load ,and it seems to be too IOPs on ceph.
 
How does a ceph osd tree look like? And what does ceph -s show when one node is down? What hardware did you use to build the cluster?
 
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 16.34796 root default
-3 4.08699 host cog-pve01
0 hdd 0.81740 osd.0 up 1.00000 1.00000
1 hdd 0.81740 osd.1 up 1.00000 1.00000
2 hdd 0.81740 osd.2 up 1.00000 1.00000
3 hdd 0.81740 osd.3 up 1.00000 1.00000
4 hdd 0.81740 osd.4 up 1.00000 1.00000
-5 4.08699 host cog-pve02
5 hdd 0.81740 osd.5 up 1.00000 1.00000
6 hdd 0.81740 osd.6 up 1.00000 1.00000
7 hdd 0.81740 osd.7 up 1.00000 1.00000
8 hdd 0.81740 osd.8 up 1.00000 1.00000
9 hdd 0.81740 osd.9 up 1.00000 1.00000
-7 4.08699 host cog-pve03
10 hdd 0.81740 osd.10 up 1.00000 1.00000
11 hdd 0.81740 osd.11 up 1.00000 1.00000
12 hdd 0.81740 osd.12 up 1.00000 1.00000
13 hdd 0.81740 osd.13 up 1.00000 1.00000
14 hdd 0.81740 osd.14 up 1.00000 1.00000
-9 4.08699 host cog-pve04
15 hdd 0.81740 osd.15 up 1.00000 1.00000
16 hdd 0.81740 osd.16 up 1.00000 1.00000
17 hdd 0.81740 osd.17 up 1.00000 1.00000
18 hdd 0.81740 osd.18 up 1.00000 1.00000
19 hdd 0.81740 osd.19 up 1.00000 1.00000

for hardware , we have 4 fujitsu with 2 Intel Xeon 4114 , 128 Go Ram and 4 network interfaces (2 * 10G for ceph netork and 2 * 10G for VM network) , and 5 * SAS 900 Go per server .

I have to schedule a maintenance for get "ceph -s " return.
 
for hardware , we have 4 fujitsu with 2 Intel Xeon 4114 , 128 Go Ram and 4 network interfaces (2 * 10G for ceph netork and 2 * 10G for VM network) , and 5 * SAS 900 Go per server .
What about the disk controller? And where is the corosync traffic located?

Can you please also post the crushmap?
 
the crushmap is

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host cog-pve01 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 4.087
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.817
item osd.1 weight 0.817
item osd.2 weight 0.817
item osd.3 weight 0.817
item osd.4 weight 0.817
}
host cog-pve02 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 4.087
alg straw2
hash 0 # rjenkins1
item osd.5 weight 0.817
item osd.6 weight 0.817
item osd.7 weight 0.817
item osd.8 weight 0.817
item osd.9 weight 0.817
}
host cog-pve03 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 4.087
alg straw2
hash 0 # rjenkins1
item osd.10 weight 0.817
item osd.11 weight 0.817
item osd.12 weight 0.817
item osd.13 weight 0.817
item osd.14 weight 0.817
}
host cog-pve04 {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 4.087
alg straw2
hash 0 # rjenkins1
item osd.15 weight 0.817
item osd.16 weight 0.817
item osd.17 weight 0.817
item osd.18 weight 0.817
item osd.19 weight 0.817
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 16.348
alg straw2
hash 0 # rjenkins1
item cog-pve01 weight 4.087
item cog-pve02 weight 4.087
item cog-pve03 weight 4.087
item cog-pve04 weight 4.087
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map



Disk controler is CP400i RAID Controller 12Gb/s 8 port based on LSI SAS3008
the corosync trafic is on Vm network (not Ceph dedicated network)
 
Disk controler is CP400i RAID Controller 12Gb/s 8 port based on LSI SAS3008
Hm. RAID controller are a NO-GO for Ceph, please find more information on our docs.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

the corosync trafic is on Vm network (not Ceph dedicated network)
This will cause interference and might sooner or later disrupt corosync communication, that will lead to non-writable '/etc/pve' and node resets if HA is activated. Separate the corosync traffic onto its own physical NIC ports and add a second ring for redundancy. Bandwidth is less important for corosync, more important is low and stable latency.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network
 
Every SAS disk is configured on RAID 0 with no cache.

For corosync network, physical NICs dedicated to VM are 2*10Gb Fo. Our link between node in under 0.15 ms (where documentation said "This needs a reliable network with latencies under 2 milliseconds (LAN performance) to work properly "
 
Our link between node in under 0.15 ms (where documentation said "This needs a reliable network with latencies under 2 milliseconds (LAN performance) to work properly "
This is correct, but as I tried to explain above, the other traffic on that interface might interfere sooner or later with corosync. For example if one of the webserver will be under a DoS attack, the interface might just be overloaded and no other traffic will be able to pass, including corosync.

Every SAS disk is configured on RAID 0 with no cache.
This is not the same as an HBA. From experience RAID controller mask properties of the physical disk, still optimize read/writes and often features can't be deactivated due to its technical layout. This is why we also have it in our precondition section.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

And from your original description, it sounds like blocked IO (slow requests), where the recovery/backfill of Ceph overloads the RAID controller. There should be messages in the Ceph logs (/var/log/ceph/), on what was going on during the time of the slow down.
 
Hi ,
I take a maintenance to get the "ceph -s" return.
Regards,
 

Attachments

  • ceph.png
    ceph.png
    26.6 KB · Views: 30
>>And from your original description, it sounds like blocked IO (slow requests), where the recovery/backfill of Ceph overloads the RAID controller. >>There should be messages in the Ceph logs (/var/log/ceph/), on what was going on during the time of the slow down.

Looking at last screenshot, osd noout flag is set. so no rebalance should occur.

@erwann : what is your size && min_size for differents pools ?
 
yes this morning I test noout flag and nodown flag, but I have same behavior with ether flag.
I have only one pool with size 2 and mi_size 2 (placement 1024)
 
Hello,
we have the same problem. We have 4 node cluster with ceph and HA for VMs. When 4 nodes are up, performance is good:

Object prefix: benchmark_data_vm-host5_509634
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 239 223 891.928 892 0.0874671 0.0706119
2 16 454 438 875.92 860 0.0648146 0.0710034
3 16 672 656 874.575 872 0.0482636 0.0717775
4 16 900 884 883.904 912 0.0464978 0.0716867
5 16 1129 1113 890.303 916 0.024642 0.0715882


But when 1 node goes down, this happens:

Object prefix: benchmark_data_vm-host5_509929
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 34 18 71.9934 72 0.0223113 0.0443496
2 16 34 18 35.995 0 - 0.0443496
3 16 34 18 23.9968 0 - 0.0443496
4 16 34 18 17.9976 0 - 0.0443496
5 16 34 18 14.3981 0 - 0.0443496
6 16 34 18 11.9984 0 - 0.0443496

we use pve-manager/6.0-7/28984024 (running kernel: 5.0.21-2-pve) on our 4 nodes.

CEPH.CONF
[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.37.28.0/24 fsid = d2a41b91-e4ec-4e4c-bab9-c08c9fedc78c mon_allow_pool_delete = true mon_host = 10.37.28.6 10.37.28.5 10.37.28.4 10.37.28.3 osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.37.27.0/24 ms_bind_port_max = 8300 debug asok = 0/0 debug auth = 0/0 debug bdev = 0/0 debug bluefs = 0/0 debug bluestore = 0/0 debug buffer = 0/0 debug civetweb = 0/0 debug client = 0/0 debug compressor = 0/0 debug context = 0/0 debug crush = 0/0 debug crypto = 0/0 debug dpdk = 0/0 debug eventtrace = 0/0 debug filer = 0/0 debug filestore = 0/0 debug finisher = 0/0 debug fuse = 0/0 debug heartbeatmap = 0/0 debug javaclient = 0/0 debug journal = 0/0 debug journaler = 0/0 debug kinetic = 0/0 debug kstore = 0/0 debug leveldb = 0/0 debug lockdep = 0/0 debug mds = 0/0 debug mds balancer = 0/0 debug mds locker = 0/0 debug mds log = 0/0 debug mds log expire = 0/0 debug mds migrator = 0/0 debug memdb = 0/0 debug mgr = 0/0 debug mgrc = 0/0 debug mon = 0/0 debug monc = 0/00 debug ms = 0/0 debug none = 0/0 debug objclass = 0/0 debug objectcacher = 0/0 debug objecter = 0/0 debug optracker = 0/0 debug osd = 0/0 debug paxos = 0/0 debug perfcounter = 0/0 debug rados = 0/0 debug rbd = 0/0 debug rbd mirror = 0/0 debug rbd replay = 0/0 debug refs = 0/0 debug reserver = 0/0 debug rgw = 0/0 debug rocksdb = 0/0 debug striper = 0/0 debug throttle = 0/0 debug timer = 0/0 debug tp = 0/0 debug xio = 0/0 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mon] mon allow pool delete = True mon health preluminous compat = True mon osd down out interval = 300 [osd] bluestore cache autotune = 0 bluestore cache kv ratio = 0.2 bluestore cache meta ratio = 0.8 bluestore cache size ssd = 8G bluestore csum type = none bluestore extent map shard max size = 200 bluestore extent map shard min size = 50 bluestore extent map shard target size = 100 bluestore rocksdb options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB osd map share max epochs = 100 osd max backfills = 5 osd memory target = 4294967296 osd op num shards = 8 osd op num threads per shard = 2 osd min pg log entries = 10 osd max pg log entries = 10 osd pg log dups tracked = 10 osd pg log trim min = 10
 

Attachments

I have only one pool with size 2 and mi_size 2 (placement 1024)
Size is the target replica count for Ceph and min_size is the replica count needed to keep the pool in read/write mode. As one node is down, one one replica exists and the pool is in read-only, so IO is blocked till at least two replicas exist.
 
When we changed the number of mi_size from 2 to 3, doesn't change the performance of write/read speed when 1 node goes down.
 
When we changed the number of mi_size from 2 to 3, doesn't change the performance of write/read speed when 1 node goes down.
The min_size is the minimum count of a repica for a PG that needs to exist on the cluster to have the pool still in read/write operation. As one node is down, some of those PGs are only once available. Those replicas have to be recovered first.

When you set size = 3 then 3x replicas will be created and there will be two available if one node is down. But this will reduce the amount of available space.
 
@Alwin : Thanks to you,
our latest tests are ok.
During a reboot of a node , webservers are unavailable few seconds (switching to other replicat of data maybe?), but all vm are ok after that.
And when node is up, rebuild is automatic and works perfectly.
We could work with one node down as expected.

PS: I will reconfigure cluster network on dedicated gigabit port (bonding of 2 per server) as you explained at #8.
Thank you !
 
@Alwin: Thanks for your help. We did our pool 4/3 size, but it still doesn't worked like we expected.
This is our test:
We have one pool with 8 osds - size 4, min_size 3.
We disable ens2f1 (cluster interface 10G) and put the command on other node:
rados bench -p SATA 60 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects Object prefix: benchmark_data_vm-host5_1920940 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 17 1 3.99979 4 0.20863 0.20863 2 16 17 1 1.99975 0 - 0.20863 3 16 17 1 1.33318 0 - 0.20863 4 16 17 1 0.999871 0 - 0.20863 5 16 17 1 0.799901 0 - 0.20863 6 16 17 1 0.666586 0 - 0.20863 7 16 17 1 0.571363 0 - 0.20863
We have enabled igmp and igmp snooping (is there anything in common with our problem) on our switch (Extreme Networks Summit X670-G2-72x) and then did this:
post-up ( echo 1 > /sys/devices/virtual/net/$IFACE/bridge/multicast_querier )
post-up ( echo 0 > /sys/class/net/$IFACE/bridge/multicast_snooping )
We tried and to disable it, without results.
Where we can search the problem. Thanks in advance, you are awesome.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!