Ceph unstable Behaviour causing VM hanging

ermanishchawla

Well-Known Member
Mar 23, 2020
331
34
48
37
I have a 12 node ceph cluster running on Proxmox 6 where each node is having 4 X 1.92TB SSD configured in the OSD Pool. The nodes are deployed in Blade chassis with each chassis having 4 Node each. Thus there are 3 chassis each having 4 Nodes

I have the following observation
whenever all nodes are up by doing rados bench, I am getting around 1000MB/s performance and no write issues
Now I have given a power off to one of the chassis so now I am having 8 nodes running, in this scenario
rados bench initially showed 1000MB/s and then subsequently it is reduced to 0 MB/s and no write is happening

This behaviour is erratic as all the VM running now are not able to write and goes to hang state

What could be the reason
 
The benchmark result after 2 node failure


rados bench -p vmstore 10 write

0 0 0 0 0 0 - 0
1 16 247 231 923.907 924 0.021279 0.0457412
2 16 400 384 767.907 612 0.0238713 0.0384574
3 16 477 461 614.583 308 0.0219978 0.0358692
4 16 522 506 505.935 180 0.0260516 0.0346615
5 16 525 509 407.146 12 0.0252096 0.034593
6 16 525 509 339.286 0 - 0.034593
7 16 525 509 290.817 0 - 0.034593
8 16 525 509 254.465 0 - 0.034593
9 16 525 509 226.19 0 - 0.034593
10 16 525 509 203.57 0 - 0.034593
11 16 525 509 185.063 0 - 0.034593
12 16 525 509 169.641 0 - 0.034593
13 16 525 509 156.592 0 - 0.034593
14 16 525 509 145.406 0 - 0.034593
15 16 525 509 135.713 0 - 0.034593
16 16 525 509 127.23 0 - 0.034593
17 16 525 509 119.746 0 - 0.034593
18 16 525 509 113.094 0 - 0.034593
19 16 525 509 107.141 0 - 0.034593
 
Ceph -s

ceph -s
cluster:
id: b020e833-3252-416a-b904-40bb4c97af5e
health: HEALTH_WARN
8 osds down
2 hosts (8 osds) down
Reduced data availability: 15 pgs inactive
Degraded data redundancy: 62825/412215 objects degraded (15.241%), 220 pgs degraded, 220 pgs undersized
4 daemons have recently crashed
too few PGs per OSD (27 < min 30)
2/12 mons down, quorum inc1pve25,inc1pve26,inc1pve27,inc1pve28,inc1pve29,inc1pve30,inc1pve31,inc1pve32,inc1pve33,inc1pve34

services:
mon: 12 daemons, quorum inc1pve25,inc1pve26,inc1pve27,inc1pve28,inc1pve29,inc1pve30,inc1pve31,inc1pve32,inc1pve33,inc1pve34 (age 7m), out of quorum: inc1pve35, inc1pve36
mgr: inc1pve27(active, since 8h), standbys: inc1pve31, inc1pve30, inc1pve28, inc1pve29, inc1pve25, inc1pve26, inc1pve32, inc1pve34, inc1pve33
osd: 48 osds: 40 up (since 7m), 48 in (since 10m)

data:
pools: 1 pools, 512 pgs
objects: 137.41k objects, 537 GiB
usage: 1.6 TiB used, 82 TiB / 84 TiB avail
pgs: 2.930% pgs not active
62825/412215 objects degraded (15.241%)
292 active+clean
205 active+undersized+degraded
15 undersized+degraded+peered
 
root@inc1pve25:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-network-perl: 0.4-4
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
 
Ok made the changes
still same error after shutdown of 2 nodes, the write stop




:~# ceph -s
cluster:
id: b020e833-3252-416a-b904-40bb4c97af5e
health: HEALTH_WARN
8 osds down
2 hosts (8 osds) down
Reduced data availability: 95 pgs inactive
Degraded data redundancy: 3007/17169 objects degraded (17.514%), 876 pgs degraded, 941 pgs undersized
19 daemons have recently crashed
2/12 mons down, quorum inc1pve25,inc1pve29,inc1pve32,inc1pve26,inc1pve33,inc1pve34,inc1pve28,inc1pve30,inc1pve31,inc1pve27

services:
mon: 12 daemons, quorum inc1pve25,inc1pve29,inc1pve32,inc1pve26,inc1pve33,inc1pve34,inc1pve28,inc1pve30,inc1pve31,inc1pve27 (age 2m), out of quorum: inc1pve35, inc1pve36
mgr: inc1pve34(active, since 35m), standbys: inc1pve32, inc1pve25, inc1pve29, inc1pve26, inc1pve27, inc1pve30, inc1pve31, inc1pve28, inc1pve33
osd: 48 osds: 40 up (since 2m), 48 in (since 39m)

data:
pools: 1 pools, 2048 pgs
objects: 5.72k objects, 22 GiB
usage: 115 GiB used, 84 TiB / 84 TiB avail
pgs: 4.639% pgs not active
3007/17169 objects degraded (17.514%)
1107 active+clean
787 active+undersized+degraded
89 undersized+degraded+peered
59 active+undersized
6 undersized+peered
 
Exactly after 10 minutes

2020-06-04 17:15:48.182222 mon.inc1pve25 [WRN] Health check update: Degraded data redundancy: 3176/18288 objects degraded (17.367%), 885 pgs degraded, 941 pgs undersized (PG_DEGRADED)
2020-06-04 17:15:57.616732 mon.inc1pve25 [WRN] Health check update: Degraded data redundancy: 3243/18795 objects degraded (17.255%), 885 pgs degraded, 941 pgs undersized (PG_DEGRADED)

2020-06-04 17:18:21.993904 mon.inc1pve25 [WRN] Health check update: Degraded data redundancy: 3282/19020 objects degraded (17.256%), 885 pgs degraded, 941 pgs undersized (PG_DEGRADED)
2020-06-04 17:18:27.654785 mon.inc1pve25 [WRN] Health check update: Degraded data redundancy: 3432/20067 objects degraded (17.103%), 890 pgs degraded, 941 pgs undersized (PG_DEGRADED)
2020-06-04 17:18:57.662356 mon.inc1pve25 [INF] Marking osd.40 out (has been down for 604 seconds)
2020-06-04 17:18:57.662415 mon.inc1pve25 [INF] Marking osd.41 out (has been down for 604 seconds)
2020-06-04 17:18:57.662425 mon.inc1pve25 [INF] Marking osd.42 out (has been down for 604 seconds)
2020-06-04 17:18:57.662440 mon.inc1pve25 [INF] Marking osd.43 out (has been down for 604 seconds)
2020-06-04 17:18:57.662470 mon.inc1pve25 [INF] Marking osd.44 out (has been down for 604 seconds)
2020-06-04 17:18:57.662484 mon.inc1pve25 [INF] Marking osd.45 out (has been down for 604 seconds)
2020-06-04 17:18:57.662497 mon.inc1pve25 [INF] Marking osd.46 out (has been down for 604 seconds)
2020-06-04 17:18:57.662510 mon.inc1pve25 [INF] Marking osd.47 out (has been down for 604 seconds)
2020-06-04 17:18:57.662779 mon.inc1pve25 [INF] Health check cleared: OSD_DOWN (was: 8 osds down)
2020-06-04 17:18:57.662796 mon.inc1pve25 [INF] Health check cleared: OSD_HOST_DOWN (was: 2 hosts (8 osds) down)
2020-06-04 17:19:02.918432 mon.inc1pve25 [WRN] Health check update: Degraded data redundancy: 455266/20871 objects degraded (2181.333%), 935 pgs degraded (PG_DEGRADED)
2020-06-04 17:19:02.918465 mon.inc1pve25 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 95 pgs inactive)
2020-06-04 17:19:12.687450 mon.inc1pve25 [WRN] Health check update: Degraded data redundancy: 398031/24726 objects degraded (1609.767%), 861 pgs degraded (PG_DEGRADED)
2020-06-04 17:19:17.694680 mon.inc1pve25 [WRN] Health check update: Degraded data redundancy: 333925/32493 objects degraded (1027.683%), 743 pgs degraded (PG_DEGRADED)
2020-06-04 17:19:22.744439 mon.inc1pve25 [WRN] Health check update: Degraded data redundancy: 303033/36321 objects degraded (834.319%), 682 pgs degraded (PG_DEGRADED)


Now writes are happening once failed node OSD are marked down and removed from OSD Map
 
Exactly after 15 minutes

# ceph -w
cluster:
id: b020e833-3252-416a-b904-40bb4c97af5e
health: HEALTH_WARN
19 daemons have recently crashed
2/12 mons down, quorum inc1pve25,inc1pve29,inc1pve32,inc1pve26,inc1pve33,inc1pve34,inc1pve28,inc1pve30,inc1pve31,inc1pve27

services:
mon: 12 daemons, quorum inc1pve25,inc1pve29,inc1pve32,inc1pve26,inc1pve33,inc1pve34,inc1pve28,inc1pve30,inc1pve31,inc1pve27 (age 12m), out of quorum: inc1pve35, inc1pve36
mgr: inc1pve34(active, since 45m), standbys: inc1pve32, inc1pve25, inc1pve29, inc1pve26, inc1pve27, inc1pve30, inc1pve31, inc1pve28, inc1pve33
osd: 48 osds: 40 up (since 12m), 40 in (since 2m)

data:
pools: 1 pools, 2048 pgs
objects: 38.09k objects, 149 GiB
usage: 485 GiB used, 69 TiB / 70 TiB avail
pgs: 2048 active+clean

io:
client: 1.3 GiB/s wr, 0 op/s rd, 332 op/s wr


Note : Still 2 nodes are down
 
When a CEPH node is down, CEPH will need some time to recovery the data from the other CEPH nodes and put the all PGs back in the "active+clean" state, which is a CEPH's regular behavior.

Assuming that you have 12 CEPH nodes (4x1.92TB SSDs), each node is responsible to handle 8,33% of your total data storage. When you have two nodes offline you need to give some time to CEPH rebuild its state copying data from the other 10 nodes. So, with 2 nodes offline, you have 16,67% data misplaced of your total data and CEPH will fix that by itself.

Are you testing how CEPH performs in case of bigger issues? How are you "losing" these 2 CEPH nodes (power or network, hardware failures, etc.) or jus t disconnecting their power cables to see how CEPH will address the issue?
 
When a CEPH node is down, CEPH will need some time to recovery the data from the other CEPH nodes and put the all PGs back in the "active+clean" state, which is a CEPH's regular behavior.

Assuming that you have 12 CEPH nodes (4x1.92TB SSDs), each node is responsible to handle 8,33% of your total data storage. When you have two nodes offline you need to give some time to CEPH rebuild its state copying data from the other 10 nodes. So, with 2 nodes offline, you have 16,67% data misplaced of your total data and CEPH will fix that by itself.

Are you testing how CEPH performs in case of bigger issues? How are you "losing" these 2 CEPH nodes (power or network, hardware failures, etc.) or jus t disconnecting their power cables to see how CEPH will address the issue?


Yes I am disconnecting them
My test cases are like this

1. Planned shutdown: I am following a procedure, making osd out first and then giving a shutdown ( no issues)
2. Unplanned shutdown: I am removing power to understand how ceph will behave, with 1 node down, my writes are continuing, with 2 nodes failed my write stops and resume after 10 minutes when ceph make the osd out from the crush map

I just wanted to understand is there better way to withstand 2 node failure with no writes being stopped
 
I don't understand. It did withstand didn't it?

I meant even with 2 nodes failure, the write should not be halted

The OSDs are marked out after 10min (default).

Yes I know that and I have tweaked the value to understand the behaviour , just wanted to understand how we can make sure writes are continuing even when my 2 nodes failed ( without 10 minutes break)
 
19 daemons have recently crashed
Which daemons did crash? It seems an awful lot.

2020-06-04 17:19:02.918432 mon.inc1pve25 [WRN] Health check update: Degraded data redundancy: 455266/20871 objects degraded (2181.333%), 935 pgs degraded (PG_DEGRADED)
This looks odd, 2181% objects degraded. How does your crushmap look like?
 
Crush Map


# begin crush map


tunable choose_local_tries 0


tunable choose_local_fallback_tries 0


tunable choose_total_tries 50


tunable chooseleaf_descend_once 1


tunable chooseleaf_vary_r 1


tunable chooseleaf_stable 1


tunable straw_calc_version 1


tunable allowed_bucket_algs 54





# devices


device 0 osd.0 class ssd


device 1 osd.1 class ssd


device 2 osd.2 class ssd


device 3 osd.3 class ssd


device 4 osd.4 class ssd


device 5 osd.5 class ssd


device 6 osd.6 class ssd


device 7 osd.7 class ssd


device 8 osd.8 class ssd


device 9 osd.9 class ssd


device 10 osd.10 class ssd


device 11 osd.11 class ssd


device 12 osd.12 class ssd


device 13 osd.13 class ssd


device 14 osd.14 class ssd


device 15 osd.15 class ssd


device 16 osd.16 class ssd


device 17 osd.17 class ssd


device 18 osd.18 class ssd


device 19 osd.19 class ssd


device 20 osd.20 class ssd


device 21 osd.21 class ssd


device 22 osd.22 class ssd


device 23 osd.23 class ssd


device 24 osd.24 class ssd


device 25 osd.25 class ssd


device 26 osd.26 class ssd


device 27 osd.27 class ssd


device 28 osd.28 class ssd


device 29 osd.29 class ssd


device 30 osd.30 class ssd


device 31 osd.31 class ssd


device 32 osd.32 class ssd


device 33 osd.33 class ssd


device 34 osd.34 class ssd


device 35 osd.35 class ssd


device 36 osd.36 class ssd


device 37 osd.37 class ssd


device 38 osd.38 class ssd


device 39 osd.39 class ssd


device 40 osd.40 class ssd


device 41 osd.41 class ssd


device 42 osd.42 class ssd


device 43 osd.43 class ssd


device 44 osd.44 class ssd


device 45 osd.45 class ssd


device 46 osd.46 class ssd


device 47 osd.47 class ssd





# types


type 0 osd


type 1 host


type 2 chassis


type 3 rack


type 4 row


type 5 pdu


type 6 pod


type 7 room


type 8 datacenter


type 9 zone


type 10 region


type 11 root





# buckets


host inc1pve25 {


id -3 # do not change unnecessarily


id -4 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.0 weight 1.747


item osd.1 weight 1.747


item osd.2 weight 1.747


item osd.3 weight 1.747


}


host inc1pve26 {


id -5 # do not change unnecessarily


id -6 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.4 weight 1.747


item osd.5 weight 1.747


item osd.6 weight 1.747


item osd.7 weight 1.747


}


host inc1pve27 {


id -7 # do not change unnecessarily


id -8 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.8 weight 1.747


item osd.9 weight 1.747


item osd.10 weight 1.747


item osd.11 weight 1.747


}


host inc1pve28 {


id -9 # do not change unnecessarily


id -10 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.12 weight 1.747


item osd.13 weight 1.747


item osd.14 weight 1.747


item osd.15 weight 1.747


}


host inc1pve29 {


id -11 # do not change unnecessarily


id -12 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.16 weight 1.747


item osd.17 weight 1.747


item osd.18 weight 1.747


item osd.19 weight 1.747


}


host inc1pve30 {


id -13 # do not change unnecessarily


id -14 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.20 weight 1.747


item osd.21 weight 1.747


item osd.22 weight 1.747


item osd.23 weight 1.747


}


host inc1pve31 {


id -15 # do not change unnecessarily


id -16 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.24 weight 1.747


item osd.25 weight 1.747


item osd.26 weight 1.747


item osd.27 weight 1.747


}


host inc1pve32 {


id -17 # do not change unnecessarily


id -18 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.28 weight 1.747


item osd.29 weight 1.747


item osd.30 weight 1.747


item osd.31 weight 1.747


}


host inc1pve33 {


id -19 # do not change unnecessarily


id -20 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.32 weight 1.747


item osd.33 weight 1.747


item osd.34 weight 1.747


item osd.35 weight 1.747


}


host inc1pve34 {


id -21 # do not change unnecessarily


id -22 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.36 weight 1.747


item osd.37 weight 1.747


item osd.38 weight 1.747


item osd.39 weight 1.747


}


host inc1pve35 {


id -23 # do not change unnecessarily


id -24 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.40 weight 1.747


item osd.41 weight 1.747


item osd.42 weight 1.747


item osd.43 weight 1.747


}


host inc1pve36 {


id -25 # do not change unnecessarily


id -26 class ssd # do not change unnecessarily


# weight 6.986


alg straw2


hash 0 # rjenkins1


item osd.44 weight 1.747


item osd.45 weight 1.747


item osd.46 weight 1.747


item osd.47 weight 1.747


}


root default {


id -1 # do not change unnecessarily


id -2 class ssd # do not change unnecessarily


# weight 83.837


alg straw2


hash 0 # rjenkins1


item inc1pve25 weight 6.986


item inc1pve26 weight 6.986


item inc1pve27 weight 6.986


item inc1pve28 weight 6.986


item inc1pve29 weight 6.986


item inc1pve30 weight 6.986


item inc1pve31 weight 6.986


item inc1pve32 weight 6.986


item inc1pve33 weight 6.986


item inc1pve34 weight 6.986


item inc1pve35 weight 6.986


item inc1pve36 weight 6.986


}





# rules


rule replicated_rule {


id 0


type replicated


min_size 1


max_size 10


step take default


step chooseleaf firstn 0 type host


step emit


}


rule erasure-code {


id 1


type erasure


min_size 3


max_size 3


step set_chooseleaf_tries 5


step set_choose_tries 100


step take default


step chooseleaf indep 0 type host


step emit


}





# choose_args


choose_args 18446744073709551615 {


{


bucket_id -1


weight_set [


[ 7.321 7.398 7.296 7.188 7.481 6.802 6.680 7.360 6.990 6.979 7.537 6.726 ]


]


}


{


bucket_id -2


weight_set [


[ 7.321 7.398 7.296 7.188 7.481 6.802 6.680 7.360 6.990 6.979 7.537 6.726 ]


]


}


{


bucket_id -3


weight_set [


[ 1.664 1.880 1.944 1.833 ]


]


}


{


bucket_id -4


weight_set [


[ 1.664 1.880 1.944 1.833 ]


]


}


{


bucket_id -5


weight_set [


[ 1.794 1.779 1.877 1.946 ]


]


}


{


bucket_id -6


weight_set [


[ 1.794 1.779 1.877 1.946 ]


]


}


{


bucket_id -7


weight_set [


[ 1.726 1.869 1.796 1.906 ]


]


}


{


bucket_id -8


weight_set [


[ 1.726 1.869 1.796 1.906 ]


]


}


{


bucket_id -9


weight_set [


[ 1.678 1.638 1.769 2.103 ]


]


}


{


bucket_id -10


weight_set [


[ 1.678 1.638 1.769 2.103 ]


]


}


{


bucket_id -11


weight_set [


[ 1.882 1.832 1.786 1.980 ]


]


}


{


bucket_id -12


weight_set [


[ 1.882 1.832 1.786 1.980 ]


]


}


{


bucket_id -13


weight_set [


[ 1.624 1.922 1.727 1.528 ]


]


}


{


bucket_id -14


weight_set [


[ 1.624 1.922 1.727 1.528 ]


]


}


{


bucket_id -15


weight_set [


[ 1.791 1.724 1.727 1.438 ]


]


}


{


bucket_id -16


weight_set [


[ 1.791 1.724 1.727 1.438 ]


]


}


{


bucket_id -17


weight_set [


[ 1.755 1.725 2.134 1.745 ]


]


}


{


bucket_id -18


weight_set [


[ 1.755 1.725 2.134 1.745 ]


]


}


{


bucket_id -19


weight_set [


[ 1.903 1.829 1.665 1.593 ]


]


}


{


bucket_id -20


weight_set [


[ 1.903 1.829 1.665 1.593 ]


]


}


{


bucket_id -21


weight_set [


[ 1.779 1.686 1.796 1.718 ]


]


}


{


bucket_id -22


weight_set [


[ 1.779 1.686 1.796 1.718 ]


]


}


{


bucket_id -23


weight_set [


[ 1.915 1.864 1.591 2.167 ]


]


}


{


bucket_id -24


weight_set [


[ 1.915 1.864 1.591 2.167 ]


]


}


{


bucket_id -25


weight_set [


[ 1.728 1.794 1.760 1.444 ]


]


}


{


bucket_id -26


weight_set [


[ 1.728 1.794 1.760 1.444 ]


]


}


}





# end crush map
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!