health: HEALTH_ERR (problem with ceph)

kupitman · Jul 5, 2019

Hi all,
today i had some problem with our ceph cluster.

root@monitor01: # ceph -s
cluster:
id: 2dc4a5fc-8acd-480e-a444-f091d02271b8
health: HEALTH_ERR
noup,nodown,noout,noscrub,nodeep-scrub flag(s) set
Reduced data availability: 195 pgs inactive
Degraded data redundancy: 870/5334314610 objects degraded (0.000%), 196 pgs unclean, 1 pg degraded, 1 pg undersized
410 slow requests are blocked > 32 sec
1510 stuck requests are blocked > 4096 sec
too many PGs per OSD (212 > max 200)
clock skew detected on mon.monitor02, mon.monitor05, mon.monitor06
mons monitor01,monitor03,monitor04,monitor05,monitor06,monitor07 are low on available space

services:
mon: 7 daemons, quorum monitor01,monitor02,monitor03,monitor04,monitor05,monitor06,monitor07
mgr: monitor05(active, starting), standbys: monitor02, monitor06, monitor04, monitor07
osd: 238 osds: 236 up, 236 in; 1 remapped pgs
flags noup,nodown,noout,noscrub,nodeep-scrub

data:
pools: 17 pools, 16936 pgs
objects: 1695M objects, 124 TB
usage: 400 TB used, 400 TB / 801 TB avail
pgs: 1.151% pgs unknown
870/5334314610 objects degraded (0.000%)
16740 active+clean
195 unknown
1 active+undersized+degraded+remapped+backfilling

io:
client: 1350 B/s rd, 182 kB/s wr, 0 op/s rd, 32 op/s wr

kupitman · Jul 5, 2019

root@monitor01:/etc/ceph$ ceph health detail
HEALTH_ERR noup,nodown,noout,noscrub,nodeep-scrub flag(s) set; Reduced data availability: 195 pgs inactive; Degraded data redundancy: 870/5334314610 objects degraded (0.000%), 196 pgs unclean, 1 pg degraded, 1 pg undersized; 410 slow requests are blocked > 32 sec; 1510 stuck requests are blocked > 4096 sec; too many PGs per OSD (212 > max 200); clock skew detected on mon.monitor02, mon.monitor05, mon.monitor06; mons monitor01,monitor03,monitor04,monitor05,monitor06,monitor07 are low on available space
OSDMAP_FLAGS noup,nodown,noout,noscrub,nodeep-scrub flag(s) set
PG_AVAILABILITY Reduced data availability: 195 pgs inactive
pg 7.338 is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.33f is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.39c is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.3a5 is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.3f3 is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.46b is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.48a is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.49b is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.4dc is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.4de is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.5b7 is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.5c7 is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.5de is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.610 is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.61f is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.e3c is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.e4f is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.ef9 is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.f5c is stuck inactive for 634.390245, current state unknown, last acting []
pg 7.fc3 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.2e0 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.2e6 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.30d is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.488 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.4c0 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.4c5 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.4e2 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.4f8 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.511 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.52a is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.543 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.573 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.58a is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.654 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.e33 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.e3e is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.e8f is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.eab is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.ec9 is stuck inactive for 634.390245, current state unknown, last acting []
pg 19.fd9 is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.2d3 is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.33e is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.33f is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.3db is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.426 is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.43d is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.47d is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.4af is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.5a9 is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.60c is stuck inactive for 634.390245, current state unknown, last acting []
pg 28.61a is stuck inactive for 634.390245, current state unknown, last acting []
PG_DEGRADED Degraded data redundancy: 870/5334314610 objects degraded (0.000%), 196 pgs unclean, 1 pg degraded, 1 pg undersized
pg 7.338 is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.33f is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.39c is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.3a5 is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.3f3 is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.46b is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.48a is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.49b is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.4dc is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.4de is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.5b7 is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.5c7 is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.5de is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.610 is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.61f is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.e3c is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.e4f is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.ef9 is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.f5c is stuck unclean for 634.390245, current state unknown, last acting []
pg 7.fc3 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.2e0 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.2e6 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.30d is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.488 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.4c0 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.4c5 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.4e2 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.4f8 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.511 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.52a is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.543 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.573 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.58a is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.654 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.e33 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.e3e is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.e8f is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.eab is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.ec9 is stuck unclean for 634.390245, current state unknown, last acting []
pg 19.fd9 is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.2d3 is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.33e is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.33f is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.3db is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.426 is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.43d is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.47d is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.4af is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.5a9 is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.60c is stuck unclean for 634.390245, current state unknown, last acting []
pg 28.61a is stuck unclean for 634.390245, current state unknown, last acting []
REQUEST_SLOW 410 slow requests are blocked > 32 sec
5 ops are blocked > 2097.15 sec
7 ops are blocked > 1048.58 sec
2 ops are blocked > 524.288 sec
390 ops are blocked > 262.144 sec
3 ops are blocked > 131.072 sec
1 ops are blocked > 65.536 sec
2 ops are blocked > 32.768 sec
osd.84 has blocked requests > 262.144 sec
REQUEST_STUCK 1510 stuck requests are blocked > 4096 sec
290 ops are blocked > 67108.9 sec
4 ops are blocked > 33554.4 sec
1049 ops are blocked > 16777.2 sec
95 ops are blocked > 8388.61 sec
72 ops are blocked > 4194.3 sec
osds 3,6,18,27,28,30,32,40,42,44,46,61,66,69,76,79,81,87,88,93,100,105,107,111,116,124,127,131,133,137,148,149,150,152,161,165,166,167,168,169,171,182,184,187,193,194,195,210,211,218,224,226,232 have stuck requests > 16777.2 sec
osds 23,159,202 have stuck requests > 33554.4 sec
osds 47,52,64,74,90,96,101,106,188,189,199,234 have stuck requests > 67108.9 sec
TOO_MANY_PGS too many PGs per OSD (212 > max 200)
MON_CLOCK_SKEW clock skew detected on mon.monitor02, mon.monitor05, mon.monitor06
mon.monitor02 addr 10.17.12.132:6789/0 clock skew 0.0790575s > max 0.05s (latency 0.00166923s)
mon.monitor05 addr 10.17.12.143:6789/0 clock skew 0.0814445s > max 0.05s (latency 0.00163222s)
mon.monitor06 addr 10.17.12.144:6789/0 clock skew 0.0626219s > max 0.05s (latency 0.00185302s)
MON_DISK_LOW mons monitor01,monitor03,monitor04,monitor05,monitor06,monitor07 are low on available space
mon.monitor01 has 30% avail
mon.monitor03 has 21% avail
mon.monitor04 has 29% avail
mon.monitor05 has 29% avail
mon.monitor06 has 29% avail
mon.monitor07 has 29% avail

sb-jw · Jul 6, 2019

First of all how many Datacenter you have running CEPH Servers? 7 Mons is a huge overkill. 3 Mons are minimum, if you grow to many racks then 5 Mons are okay too. But 7? Yeah, if you have multiple locations maybe this is needed.

Check the OSD where the stuck pg is running. If you find the primary, try to restart the OSD, maybe they will peering again and can recover.

But you have to many PGs on your Disks, you should add more disks or reduce your pools.

MertsA · Jul 6, 2019

Ouch, what the hell did you do to your poor Ceph cluster? Who set all of the noout, noup, nodown, etc flags and why? How long has it been like that? I agree with the other poster that 7 mons really is overkill, I see that you have 238 osds which is large enough that maybe there might be some incentive to run 5 mons but 7 is only necessary if you want your cluster to be able to withstand 3 of the mons failing all at the same time. Same thing goes for MDS nodes, you should only need like two or three of them.

If I were in your shoes, to start with I would unset the nodown flag, because right now we don't even know which OSDs are even running because your cluster is set to just pretend that all of them are regardless of any failures. For the time being don't touch the scrub related flags and don't unset noout. You actually probably want noout set for the time being because the last thing your cluster needs is to be doing a bunch of probably unnecessary recovery work assuming that an OSD won't come back up when it's probably going to be as simple as restarting one single node. The main issue I see here is that someone just left noup and nodown set which is a terrible idea to do outside of planned maintenance taking down the entire Ceph cluster. Run the following commands and report back with what OSDs go down afterwards.

ceph osd unset nodown
ceph osd unset noup

MertsA · Jul 6, 2019

As for the rest of the issues, don't even start trying to fix them before the cluster is back up and all of the PGs are active. You really should fix them, but they aren't the cause of your current issues and fixing them right now could just make things worse if you tried it on top of what's causing your PGs to be stuck.

First there's the bunch of flags that were set. I'm assuming someone was trying to do some maintenance and was just copying and pasting commands without understanding what they were doing. Leave them be until you get the cluster back up but then unset all of those flags. There's a reason why having those flags set generates a warning, by doing so you essentially eliminate the redundancy that Ceph provides. You essentially disabled any sort of recovery other than stuff like checksum errors.

Then there's your PG count. Don't do anything drastic on this or follow some out of date guide on how to fix this. You'll need to look at all of your Ceph pools and figure out roughly how much data you're going to store in each pool. Go ahead and use a calculator here https://ceph.com/pgcalc/ to figure out how many PGs you should use for each pool. In older versions of Ceph, like Ceph Luminous which you're probably running, you could increase the PG count of a pool but not decrease it. Ceph Nautilus is now about to be shipped with PVE6 which allows you to decrease the number of PGs for a pool so sit tight on this and wait until PVE 6 is stable and you've upgraded to it and Ceph Nautilus and then take the numbers from the PG calculator and start resizing your pools. Ceph Nautilus also has an autoscaler that will try to pick a good pg number for you, you might want to look at that rather than using the calculator and picking it manually but that's all new in Nautilus and it needs to be enabled and configured so YMMV on that, I haven't had the opportunity to try it myself.

You've also got way too many MONs and MDSes, like I mentioned before, I really doubt there's any point to you running 7 MONs. 5 MONs is plenty, 3 MONs also would probably be totally fine. MDSes are not like MONs in that you only need one of them up for it to work so unless literally all of your MDSes fail at the same time it'll still work. You should probably cut back to just 2 or 3 MDS nodes.

And last there's the clock skew between your MONs. All of your MONs should be running NTP and since Ceph is a bit finicky when it comes to the clock skew between MONs you should probably just have all of them sync up to the same NTP server and have that NTP server be syncing to some public NTP server. You can run an NTP server on one of your MONs just fine, don't feel like you need to add an extra VM or hardware just for it, it's pretty trivial. NTP also isn't a single point of failure either, if the NTP server goes down Ceph will keep running just fine until you fix the NTP server and even if you don't notice that it failed you'll eventually get a warning from Ceph about it well before anything bad happens.

kupitman · Jul 8, 2019

sb-jw said:
First of all how many Datacenter you have running CEPH Servers? 7 Mons is a huge overkill. 3 Mons are minimum, if you grow to many racks then 5 Mons are okay too. But 7? Yeah, if you have multiple locations maybe this is needed.

Check the OSD where the stuck pg is running. If you find the primary, try to restart the OSD, maybe they will peering again and can recover.

But you have to many PGs on your Disks, you should add more disks or reduce your pools.

8 servers in ceph
ceph osd unset nodown
ceph osd unset noup
Yes, its has helped me
Thanks

MertsA · Jul 8, 2019

kupitman said:
8 servers in ceph

You need 3 mons, 2 or maybe 3 mds. What you have right now is just wasting resources and probably putting yourself at more risk of downtime incidents like this one. One thing, you only have 8 servers yet 238 OSDs?? Do you really have over 30 disks per server? You should have at most 1 OSD per physical disk and while you can leave your MONs on your root fs, you should not have your MONs running on the same disk that an OSD is on. This might seem like it works fine at first, but you'll run into big issues if the OSD that shares the disk gets swamped with IO. Lets say you make some large change in your crush map and the cluster needs to move around a lot of data, if you decide to speed up the recovery rate because by default it can be pretty slow, now not only is the OSD slowing down the rate at which it responds to requests, the MON is also going to start hanging on IO, this can lead to your MONs going down because your OSDs are swamped with data.

Above all else, there's a mountain of tutorials and documentation out there. Plenty of it is on ceph.com, there's videos on YouTube if you're more into that sort of thing. Even the ProxMox documentation has a chapter on Ceph. https://pve.proxmox.com/pve-docs/chapter-pveceph.html Please learn more about the fundamentals and best practices of using Ceph. Storage is at the heart of your virtualization stack, this is something that you should absolutely have a good handle on the fundamentals here. If you think this incident is bad, at least from the sound of it you didn't lose any data and just had downtime because of it. As annoying as this was to deal with, you got off easy. Making mistakes with your storage layer can be absolutely catastrophic. Figure this out before you have an even worse time learning it all the hard way.

Search

Search

health: HEALTH_ERR (problem with ceph)

kupitman

New Member

kupitman

New Member

sb-jw

Famous Member

MertsA

New Member

MertsA

New Member

kupitman

New Member

MertsA

New Member

We value your privacy