PVE Ceph High IO and CPU Utilization

Null() · May 4, 2016

Hello Proxmox community. I'm posting here today as I'm out of ideas on a issue I'm having. I'm currently running PVE 4.1-1/2f9650d4 in a 3 node cluster env with PVECEPH -hammer. Every now and then the node CPU will go to 99.9% then another node will follow etc, during this time quorum will degrade, OSD's will be unnecessarily marked as down and the storage network bandwidth will go to 15-20GB/s. A reboot per node fixes this issue, I wait for quorum then reboot the next one till all three are reboot and back online. The healing/rebuild process then begins and all is fine till next time, could be a day, few days or a week before experiencing the issue again. I'm not sure what logs you would need to help me identify the cause of this but here are some notes on the setup.

Storage network is on Infiniban DDR IBoIP
Management network is on GigE separate Vlan and own dedicated sub net
Sub net has own NTP running with nodes syncing every 3hours
3 Nodes in cluster with 5 OSD's each 3TB SATA drives
Node server 16Bay dual quad core xeon 16GB ram SSD for PVE and SSD for Journal

Thank you for all your help and suggestions.

iptgeek · May 4, 2016

What processes are hammering the CPU during this time? Is syslog / ceph logs filling up with anything?

Null() · May 4, 2016

iptel said:
What processes are hammering the CPU during this time? Is syslog / ceph logs filling up with anything?

Thank you for your reply. Here is syslog when it all begins

Code:

May  3 22:49:22 nas00 pmxcfs[1571]: [dcdb] notice: data verification successful

May  3 22:49:22 nas00 rrdcached[1510]: flushing old values

May  3 22:49:22 nas00 rrdcached[1510]: rotating journals

May  3 22:49:22 nas00 rrdcached[1510]: started new journal /var/lib/rrdcached/journal/rrd.journal.1462333762.759154

May  3 22:49:22 nas00 rrdcached[1510]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1462326562.759133

May  3 22:57:44 nas00 systemd-timesyncd[1043]: interval/delta/delay/jitter/drift 2048s/-0.001s/0.000s/0.003s/+26ppm

May  3 23:17:01 nas00 CRON[30912]: (root) CMD (  cd / && run-parts --report /etc/cron.hourly)

May  3 23:20:49 nas00 bash[2757]: 2016-05-03 23:20:49.700044 7f17b73cc700 -1 osd.2 2616 heartbeat_check: no reply from osd.6 since back 2016-0$

May  3 23:20:49 nas00 bash[2995]: 2016-05-03 23:20:49.972661 7f3ff586f700 -1 osd.3 2616 heartbeat_check: no reply from osd.8 since back 2016-0$

May  3 23:20:49 nas00 bash[2078]: 2016-05-03 23:20:49.973741 7fc58dabf700 -1 osd.5 2616 heartbeat_check: no reply from osd.10 since back 2016-$

May  3 23:20:50 nas00 bash[2757]: 2016-05-03 23:20:50.700201 7f17b73cc700 -1 osd.2 2616 heartbeat_check: no reply from osd.6 since back 2016-0$

May  3 23:20:50 nas00 bash[2757]: 2016-05-03 23:20:50.763161 7f179bb4b700 -1 osd.2 2616 heartbeat_check: no reply from osd.6 since back 2016-0$

May  3 23:20:50 nas00 bash[2995]: 2016-05-03 23:20:50.972811 7f3ff586f700 -1 osd.3 2616 heartbeat_check: no reply from osd.8 since back 2016-0$

This continues up until

Code:

May  3 23:21:22 nas00 bash[2078]: 2016-05-03 23:21:22.982782 7fc58dabf700 -1 osd.5 2621 heartbeat_check: no reply from osd.7 since back 2016-0$

May  3 23:21:23 nas00 bash[2290]: 2016-05-03 23:21:23.144480 7fcb123a4700 -1 osd.0 2621 heartbeat_check: no reply from osd.4 since back 2016-0$

May  3 23:31:52 nas00 systemd-timesyncd[1043]: interval/delta/delay/jitter/drift 2048s/-0.002s/0.000s/0.003s/+25ppm

May  3 23:49:22 nas00 pmxcfs[1571]: [dcdb] notice: data verification successful

May  3 23:49:22 nas00 rrdcached[1510]: flushing old values

May  3 23:49:22 nas00 rrdcached[1510]: rotating journals

May  3 23:49:22 nas00 rrdcached[1510]: started new journal /var/lib/rrdcached/journal/rrd.journal.1462337362.759141

May  3 23:49:22 nas00 rrdcached[1510]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1462330162.759161

May  4 00:06:00 nas00 systemd-timesyncd[1043]: interval/delta/delay/jitter/drift 2048s/+0.001s/0.000s/0.003s/+26ppm

May  4 00:17:01 nas00 CRON[2965]: (root) CMD (  cd / && run-parts --report /etc/cron.hourly)

May  4 00:36:58 nas00 bash[2078]: 2016-05-04 00:36:58.450372 7fc571e9e700 -1 osd.5 2624 heartbeat_check: no reply from osd.7 since back 2016-0$

May  4 00:36:58 nas00 bash[2078]: 2016-05-04 00:36:58.579536 7fc58dabf700 -1 osd.5 2624 heartbeat_check: no reply from osd.7 since back 2016-0$

May  4 00:36:59 nas00 bash[2078]: 2016-05-04 00:36:59.579895 7fc58dabf700 -1 osd.5 2624 heartbeat_check: no reply from osd.7 since back 2016-0$

May  4 00:37:00 nas00 bash[2078]: 2016-05-04 00:37:00.579994 7fc58dabf700 -1 osd.5 2624 heartbeat_check: no reply from osd.7 since back 2016-0$

After this it goes till 6AM when I started rebooting, I do not have any logs for what process caused this, next time I will have to run top and see before I reboot if I can even get to ssh.

Ceph log during issue

Code:

2016-05-04 00:36:26.102267 7fc4fd5da700  0 mon.0@0(leader).data_health(5760) update_stats avail 69% total 7179 MB, used 1830 MB, avail 4961 MB

2016-05-04 00:36:28.730577 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335111: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 60502 B/s wr, 9 op/s

2016-05-04 00:36:30.613788 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335112: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 181 kB/s wr, 21 op/s

2016-05-04 00:36:31.756561 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335113: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 181 kB/s wr, 23 op/s

2016-05-04 00:36:32.769424 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335114: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 177 kB/s wr, 29 op/s

2016-05-04 00:36:33.774125 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335115: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 178 kB/s wr, 30 op/s

2016-05-04 00:36:35.741452 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335116: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 32310 B/s wr, 8 op/s

2016-05-04 00:36:38.286279 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335117: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 44783 B/s wr, 12 op/s

2016-05-04 00:36:39.292440 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335118: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 80299 B/s wr, 18 op/s

2016-05-04 00:36:44.867137 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335119: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 742 kB/s rd, 179 kB/s wr, 26 op/s

2016-05-04 00:36:52.743909 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335120: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 503 kB/s rd, 115 kB/s wr, 13 op/s

2016-05-04 00:37:02.748911 7fc4fcdd9700  0 log_channel(cluster) log [INF] : mon.0 calling new monitor election

2016-05-04 00:37:02.749011 7fc4fcdd9700  1 mon.0@0(electing).elector(5761) init, last seen epoch 5761

2016-05-04 00:37:04.630362 7fc4fcdd9700  1 mon.0@0(electing).data_health(5760) service_dispatch not in quorum -- drop message

2016-05-04 00:37:07.752222 7fc4fd5da700  0 log_channel(cluster) log [INF] : mon.0@0 won leader election with quorum 0,2

2016-05-04 00:37:07.754002 7fc4fd5da700  0 log_channel(cluster) log [INF] : HEALTH_WARN; 1 mons down, quorum 0,2 0,2

2016-05-04 00:37:07.765461 7fc4fed77700  0 log_channel(cluster) log [INF] : monmap e3: 3 mons at {0=192.xxx.x.50:6789/0,1=192.xxx.x.51:6789/0,2=192.xxx.x.52:6789/0}

2016-05-04 00:37:07.765568 7fc4fed77700  0 log_channel(cluster) log [INF] : pgmap v4335121: 1024 pgs: 1024 active+clean; 716 GB data, 2107 GB used, 39790 GB / 41897 GB avail; 326 B/s rd, 76477 B/s wr, 5 op/s

2016-05-04 00:37:07.765625 7fc4fed77700  0 log_channel(cluster) log [INF] : mdsmap e1: 0/0/0 up

2016-05-04 00:37:07.765722 7fc4fed77700  0 log_channel(cluster) log [INF] : osdmap e2624: 15 osds: 15 up, 15 in

2016-05-04 00:37:07.765857 7fc4fed77700  1 mon.0@0(leader).osd e2624 prepare_failure osd.4 192.xxx.x.51:6816/2933 from osd.1 192.xxx.x.50:6808/2527 is reporting failure:1

2016-05-04 00:37:07.765873 7fc4fed77700  0 log_channel(cluster) log [DBG] : osd.4 192.xxx.x.51:6816/2933 reported failed by osd.1 192.xxx.x.50:6808/2527

2016-05-04 00:37:07.766210 7fc4fed77700  1 mon.0@0(leader).osd e2624 prepare_failure osd.4 192.xxx.x.51:6816/2933 from osd.1 192.xxx.x.50:6808/2527 is reporting failure:1

2016-05-04 00:37:07.766225 7fc4fed77700  0 log_channel(cluster) log [DBG] : osd.4 192.xxx.x.51:6816/2933 reported failed by osd.1 192.xxx.x.50:6808/2527

2016-05-04 00:37:07.766263 7fc4fed77700  1 mon.0@0(leader).osd e2624 prepare_failure osd.8 192.xxx.x.51:6812/2699 from osd.1 192.xxx.x.50:6808/2527 is reporting failure:1

2016-05-04 00:37:07.766278 7fc4fed77700  0 log_channel(cluster) log [DBG] : osd.8 192.xxx.x.51:6812/2699 reported failed by osd.1 192.xxx.x.50:6808/2527

2016-05-04 00:37:07.766344 7fc4fed77700  1 mon.0@0(leader).osd e2624 prepare_failure osd.7 192.xxx.x.51:6804/2247 from osd.0 192.xxx.x.50:6804/2293 is reporting failure:1

2016-05-04 00:37:07.766358 7fc4fed77700  0 log_channel(cluster) log [DBG] : osd.7 192.xxx.x.51:6804/2247 reported failed by osd.0 192.xxx.x.50:6804/2293

2016-05-04 00:37:07.766394 7fc4fed77700  1 mon.0@0(leader).osd e2624 prepare_failure osd.8 192.xxx.x.51:6812/2699 from osd.0 192.xxx.x.50:6804/2293 is reporting failure:1

2016-05-04 00:37:07.766408 7fc4fed77700  0 log_channel(cluster) log [DBG] : osd.8 192.xxx.x.51:6812/2699 reported failed by osd.0 192.xxx.x.50:6804/2293

2016-05-04 00:37:07.766469 7fc4fed77700  1 mon.0@0(leader).osd e2624 prepare_failure osd.8 192.xxx.x.51:6812/2699 from osd.1 192.xxx.x.50:6808/2527 is reporting failure:0

2016-05-04 00:37:07.766483 7fc4fed77700  0 log_channel(cluster) log [DBG] : osd.8 192.xxx.x.51:6812/2699 failure report canceled by osd.1 192.xxx.x.50:6808/2527

~~~~~TRUNKED~~~~~

Thank You!!!

Null() · May 5, 2016

I think the issue has something to do with the timing. After setting up the cluster I did install NTP client and had it sync the time on the hour, however I did notice the NTP servers used were from the debian pool and IP address was pointed to servers in europe with bad jitter and delay. This is strange as I did point this to the local NTP server in the past, anyways I changed it to the local server and I just order a serial GPS unit with PPS. I really hope this will fix my issues, if anyone has any comments on this feel free. I know ceph allows for a 0.05s deviation in timing.

wosp · May 5, 2016

Do you use PVE 4.x and have CEPH storage and PVE on the same node? Probably timesyncd is your problem.

Please see https://forum.proxmox.com/threads/pve-4-1-systemd-timesyncd-and-ceph-clock-skew.27043/

Null() · May 11, 2016

Hi guys, I would like to thank everyone for their help in resolving this issue. I just had the same issue happen to me this morning. This time only one node was affected, (nas2) and it self healed before I could even get to it. All the others did not follow as before I did no a lot of clocking improvements as suggested. Please see below the CEPH log

Code:

2016-05-11 08:47:15.962323 mon.0 192.168.1.50:6789/0 564411 : cluster [INF] pgmap v4892543: 1024 pgs: 1023 active+clean, 1 active+clean+scrubbing+deep; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 456 kB/s wr, 71 op/s
2016-05-11 08:47:19.362619 mon.0 192.168.1.50:6789/0 564412 : cluster [INF] pgmap v4892544: 1024 pgs: 1023 active+clean, 1 active+clean+scrubbing+deep; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 1906 B/s rd, 857 kB/s wr, 137 op/s
2016-05-11 08:49:16.685174 mon.1 192.168.1.51:6789/0 10097 : cluster [INF] mon.1 calling new monitor election
2016-05-11 08:49:16.688931 mon.0 192.168.1.50:6789/0 565092 : cluster [INF] mon.0 calling new monitor election
2016-05-11 08:49:16.960194 mon.0 192.168.1.50:6789/0 565093 : cluster [INF] mon.0@0 won leader election with quorum 0,1,2
2016-05-11 08:49:16.962658 mon.0 192.168.1.50:6789/0 565094 : cluster [INF] HEALTH_WARN; 233 pgs degraded; 153 pgs peering; 130 pgs stale; 42 pgs stuck inactive; 216 pgs stuck unclean; 233 pgs undersized; 12 requests are blocked > 32 sec; recovery 44130/582777 objects degraded (7.572%); 3/15 in osds are down
2016-05-11 08:49:16.967103 mon.0 192.168.1.50:6789/0 565095 : cluster [INF] monmap e3: 3 mons at {0=192.168.1.50:6789/0,1=192.168.1.51:6789/0,2=xxx.xxx.xx.xx:6789/0}
2016-05-11 08:49:16.967208 mon.0 192.168.1.50:6789/0 565096 : cluster [INF] pgmap v4892597: 1024 pgs: 130 stale+active+clean, 153 peering, 231 activating, 276 active+clean, 1 active+clean+scrubbing+deep, 233 active+undersized+degraded; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 44130/582777 objects degraded (7.572%)
2016-05-11 08:49:16.967279 mon.0 192.168.1.50:6789/0 565097 : cluster [INF] mdsmap e1: 0/0/0 up
2016-05-11 08:49:16.967400 mon.0 192.168.1.50:6789/0 565098 : cluster [INF] osdmap e3359: 15 osds: 12 up, 15 in
2016-05-11 08:49:11.122130 mon.2 xxx.xxx.xx.xx:6789/0 6259 : cluster [INF] mon.2 calling new monitor election
2016-05-11 08:49:18.688747 mon.0 192.168.1.50:6789/0 565099 : cluster [INF] osdmap e3360: 15 osds: 12 up, 15 in
2016-05-11 08:49:18.976685 mon.0 192.168.1.50:6789/0 565100 : cluster [INF] pgmap v4892598: 1024 pgs: 130 stale+active+clean, 151 peering, 233 activating, 276 active+clean, 1 active+clean+scrubbing+deep, 233 active+undersized+degraded; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 44130/582777 objects degraded (7.572%)
2016-05-11 08:49:18.502736 osd.13 xxx.xxx.xx.xx:6816/3007 825 : cluster [WRN] map e3359 wrongly marked me down
2016-05-11 08:49:20.910874 mon.0 192.168.1.50:6789/0 565101 : cluster [INF] pgmap v4892599: 1024 pgs: 130 stale+active+clean, 123 peering, 257 activating, 276 active+clean, 1 active+clean+scrubbing+deep, 4 activating+degraded, 233 active+undersized+degraded; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 15323 B/s wr, 6 op/s; 44130/582777 objects degraded (7.572%)
2016-05-11 08:49:22.037268 mon.0 192.168.1.50:6789/0 565102 : cluster [INF] osd.13 xxx.xxx.xx.xx:6816/3007 boot
2016-05-11 08:49:22.037367 mon.0 192.168.1.50:6789/0 565103 : cluster [INF] osdmap e3361: 15 osds: 13 up, 15 in
2016-05-11 08:49:05.521982 osd.11 xxx.xxx.xx.xx:6804/2301 788 : cluster [WRN] map e3355 wrongly marked me down
2016-05-11 08:49:19.200711 osd.8 192.168.1.51:6804/2248 724 : cluster [WRN] 4 slow requests, 1 included below; oldest blocked for > 56.283578 secs
2016-05-11 08:49:19.200721 osd.8 192.168.1.51:6804/2248 725 : cluster [WRN] slow request 30.294555 seconds old, received at 2016-05-11 08:48:48.906115: osd_op(client.949105.1:80929963 rbd_data.15e6362ae8944a.0000000000001cb0 [set-alloc-hint object_size 4194304 write_size 4194304,write 2363392~4096] 3.bb0985cf ondisk+write e3350) currently reached_pg
2016-05-11 08:49:25.153766 mon.0 192.168.1.50:6789/0 565104 : cluster [INF] osd.11 xxx.xxx.xx.xx:6804/2301 boot
2016-05-11 08:49:25.154019 mon.0 192.168.1.50:6789/0 565105 : cluster [INF] osdmap e3362: 15 osds: 14 up, 15 in
2016-05-11 08:49:25.157353 mon.0 192.168.1.50:6789/0 565106 : cluster [INF] pgmap v4892601: 1024 pgs: 15 stale+active+clean, 63 peering, 273 activating, 90 active+clean, 7 activating+degraded, 576 active+undersized+degraded; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 30542 B/s wr, 10 op/s; 109226/582777 objects degraded (18.742%)
2016-05-11 08:49:25.225003 mon.0 192.168.1.50:6789/0 565107 : cluster [INF] pgmap v4892602: 1024 pgs: 15 stale+active+clean, 63 peering, 273 activating, 90 active+clean, 7 activating+degraded, 576 active+undersized+degraded; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 5047 B/s wr, 2 op/s; 109226/582777 objects degraded (18.742%)
2016-05-11 08:49:19.546220 osd.3 192.168.1.50:6816/2950 921 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 60.563593 secs
2016-05-11 08:49:19.546230 osd.3 192.168.1.50:6816/2950 922 : cluster [WRN] slow request 60.563593 seconds old, received at 2016-05-11 08:48:18.982582: osd_op(client.122779.1:45985884 rbd_data.e97492ae8944a.00000000000f7df0 [set-alloc-hint object_size 4194304 write_size 4194304,write 839680~4096] 3.13d3f242 ondisk+write e3348) currently reached_pg
2016-05-11 08:49:21.628654 osd.12 xxx.xxx.xx.xx:6808/2533 751 : cluster [WRN] map e3360 wrongly marked me down
2016-05-11 08:49:22.465839 osd.2 192.168.1.50:6808/2489 894 : cluster [INF] 3.1b1 scrub starts
2016-05-11 08:49:22.718815 osd.5 192.168.1.50:6800/2040 920 : cluster [INF] 3.287 scrub starts

Null() · May 11, 2016

Code:

2016-05-11 08:49:23.502445 osd.9 xxx.xxx.xx.xx:6812/2764 798 : cluster [WRN] 2 slow requests, 1 included below; oldest blocked for > 47.148499 secs
2016-05-11 08:49:23.502473 osd.9 xxx.xxx.xx.xx:6812/2764 799 : cluster [WRN] slow request 30.198103 seconds old, received at 2016-05-11 08:48:53.304259: osd_op(client.949105.1:80929965 rbd_data.130dd674b0dc51.000000000000007d [set-alloc-hint object_size 4194304 write_size 4194304,write 2101248~4096] 3.6590fe7f ondisk+write e3350) currently waiting for subops from 5,7
2016-05-11 08:49:24.799762 osd.4 192.168.1.51:6800/2025 1002 : cluster [WRN] 6 slow requests, 2 included below; oldest blocked for > 86.323139 secs
2016-05-11 08:49:24.800017 osd.4 192.168.1.51:6800/2025 1003 : cluster [WRN] slow request 30.787218 seconds old, received at 2016-05-11 08:48:54.012476: osd_op(client.122779.1:45985977 rbd_header.e06af74b0dc51 [watch ??? cookie 3] 3.73fc9bb9 reassert_version=126'36851 ondisk+write e3351) currently reached_pg
2016-05-11 08:49:24.800026 osd.4 192.168.1.51:6800/2025 1004 : cluster [WRN] slow request 30.784797 seconds old, received at 2016-05-11 08:48:54.014897: osd_op(client.898065.1:57175014 rbd_header.1aec72ae8944a [watch ??? cookie 6] 3.36e830fb reassert_version=124'45943 ondisk+write e3351) currently reached_pg
2016-05-11 08:49:24.845818 osd.14 xxx.xxx.xx.xx:6800/2081 824 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 30.832190 secs
2016-05-11 08:49:24.845828 osd.14 xxx.xxx.xx.xx:6800/2081 825 : cluster [WRN] slow request 30.832190 seconds old, received at 2016-05-11 08:48:54.013542: osd_op(client.122779.1:45985975 rbd_header.e97492ae8944a [watch ??? cookie 5] 3.cb668771 reassert_version=126'77118 ondisk+write e3351) currently reached_pg
2016-05-11 08:49:24.845837 osd.14 xxx.xxx.xx.xx:6800/2081 826 : cluster [WRN] slow request 30.831711 seconds old, received at 2016-05-11 08:48:54.014022: osd_op(client.122779.1:45985976 rbd_header.1c49f2ae8944a [watch ??? cookie 1] 3.14b768ab reassert_version=88'1753 ondisk+write e3351) currently reached_pg
2016-05-11 08:49:25.503758 osd.9 xxx.xxx.xx.xx:6812/2764 800 : cluster [WRN] 3 slow requests, 1 included below; oldest blocked for > 49.149755 secs
2016-05-11 08:49:25.503782 osd.9 xxx.xxx.xx.xx:6812/2764 801 : cluster [WRN] slow request 30.826865 seconds old, received at 2016-05-11 08:48:54.676753: osd_op(client.898065.1:57175017 rbd_data.d4d62ae8944a.000000000001821e [set-alloc-hint object_size 4194304 write_size 4194304,write 2031616~4096] 3.61a77b11 ondisk+write e3351) currently waiting for subops from 2,7
2016-05-11 08:49:25.846066 osd.14 xxx.xxx.xx.xx:6800/2081 827 : cluster [WRN] 4 slow requests, 2 included below; oldest blocked for > 31.832466 secs
2016-05-11 08:49:25.846077 osd.14 xxx.xxx.xx.xx:6800/2081 828 : cluster [WRN] slow request 30.821160 seconds old, received at 2016-05-11 08:48:55.024848: osd_op(client.122779.1:45985980 rbd_header.e97492ae8944a [watch ??? cookie 5] 3.cb668771 reassert_version=126'77118 ondisk+write e3352) currently reached_pg
2016-05-11 08:49:25.846082 osd.14 xxx.xxx.xx.xx:6800/2081 829 : cluster [WRN] slow request 30.820611 seconds old, received at 2016-05-11 08:48:55.025397: osd_op(client.122779.1:45985981 rbd_header.1c49f2ae8944a [watch ??? cookie 1] 3.14b768ab reassert_version=88'1753 ondisk+write e3352) currently reached_pg
2016-05-11 08:49:26.509080 osd.9 xxx.xxx.xx.xx:6812/2764 802 : cluster [WRN] 4 slow requests, 1 included below; oldest blocked for > 50.155074 secs
2016-05-11 08:49:26.509102 osd.9 xxx.xxx.xx.xx:6812/2764 803 : cluster [WRN] slow request 30.186267 seconds old, received at 2016-05-11 08:48:56.322671: osd_op(client.949105.1:80929969 rbd_data.2d5d822ae8944a.0000000000010e29 [set-alloc-hint object_size 4194304 write_size 4194304,write 2428928~4096] 3.a12a9711 ondisk+write e3352) currently waiting for subops from 2,7
2016-05-11 08:49:26.742338 mon.0 192.168.1.50:6789/0 565108 : cluster [INF] osd.12 xxx.xxx.xx.xx:6808/2533 boot
2016-05-11 08:49:26.743922 mon.0 192.168.1.50:6789/0 565109 : cluster [INF] osdmap e3363: 15 osds: 15 up, 15 in
2016-05-11 08:49:26.801073 osd.4 192.168.1.51:6800/2025 1005 : cluster [WRN] 2 slow requests, 1 included below; oldest blocked for > 32.786130 secs
2016-05-11 08:49:26.801089 osd.4 192.168.1.51:6800/2025 1006 : cluster [WRN] slow request 30.280147 seconds old, received at 2016-05-11 08:48:56.520880: osd_op(client.898065.1:57175018 rbd_header.1aec72ae8944a [watch ??? cookie 6] 3.36e830fb reassert_version=124'45943 ondisk+write e3352) currently reached_pg
2016-05-11 08:49:28.950766 osd.2 192.168.1.50:6808/2489 895 : cluster [INF] 3.1b1 scrub ok

Null() · May 11, 2016

Code:

2016-05-11 08:49:29.801572 osd.4 192.168.1.51:6800/2025 1007 : cluster [WRN] 3 slow requests, 1 included below; oldest blocked for > 35.786622 secs
2016-05-11 08:49:29.801583 osd.4 192.168.1.51:6800/2025 1008 : cluster [WRN] slow request 30.103860 seconds old, received at 2016-05-11 08:48:59.697659: osd_op(client.122779.1:45985987 rbd_data.e78572ae8944a.0000000000000674 [set-alloc-hint object_size 4194304 write_size 4194304,write 2809856~4096] 3.9f5909d8 ondisk+write e3353) currently reached_pg
2016-05-11 08:49:31.802241 osd.4 192.168.1.51:6800/2025 1009 : cluster [WRN] 5 slow requests, 2 included below; oldest blocked for > 37.787283 secs
2016-05-11 08:49:31.802255 osd.4 192.168.1.51:6800/2025 1010 : cluster [WRN] slow request 30.619423 seconds old, received at 2016-05-11 08:49:01.182758: osd_op(client.898065.1:57175033 rbd_header.1aec72ae8944a [watch ??? cookie 6] 3.36e830fb reassert_version=124'45943 ondisk+write e3355) currently reached_pg
2016-05-11 08:49:31.802261 osd.4 192.168.1.51:6800/2025 1011 : cluster [WRN] slow request 30.860453 seconds old, received at 2016-05-11 08:49:00.941728: osd_op(client.898065.1:57175028 rbd_header.1aec72ae8944a [watch ??? cookie 6] 3.36e830fb reassert_version=124'45943 ondisk+write e3354) currently reached_pg
2016-05-11 08:49:33.412083 mon.0 192.168.1.50:6789/0 565110 : cluster [INF] osdmap e3364: 15 osds: 15 up, 15 in
2016-05-11 08:49:33.463430 mon.0 192.168.1.50:6789/0 565111 : cluster [INF] pgmap v4892604: 1024 pgs: 15 stale+active+clean, 62 peering, 265 activating, 112 active+clean, 7 activating+degraded, 563 active+undersized+degraded; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 0 B/s wr, 1 op/s; 106719/582777 objects degraded (18.312%)
2016-05-11 08:49:33.591222 mon.0 192.168.1.50:6789/0 565112 : cluster [INF] pgmap v4892605: 1024 pgs: 225 peering, 352 activating, 433 active+clean, 1 active+clean+scrubbing, 2 active+degraded, 11 activating+degraded; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 80150 B/s wr, 15 op/s; 2/582777 objects degraded (0.000%); 2853 kB/s, 0 objects/s recovering
2016-05-11 08:49:30.767902 osd.2 192.168.1.50:6808/2489 896 : cluster [INF] 3.165 scrub starts
2016-05-11 08:49:31.805920 osd.6 192.168.1.51:6812/2716 773 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.622076 secs
2016-05-11 08:49:31.805935 osd.6 192.168.1.51:6812/2716 774 : cluster [WRN] slow request 30.622076 seconds old, received at 2016-05-11 08:49:01.183802: osd_op(client.898065.1:57175035 rbd_data.18d44d2ae8944a.0000000000001611 [set-alloc-hint object_size 4194304 write_size 4194304,write 1515008~32768] 3.d043c9e8 ondisk+write e3355) currently reached_pg
2016-05-11 08:49:31.994223 osd.7 192.168.1.51:6808/2484 781 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.635419 secs
2016-05-11 08:49:31.994231 osd.7 192.168.1.51:6808/2484 782 : cluster [WRN] slow request 30.635419 seconds old, received at 2016-05-11 08:49:01.358758: osd_op(client.122779.1:45985995 rbd_data.e97492ae8944a.000000000007b985 [set-alloc-hint object_size 4194304 write_size 4194304,write 2539520~57344] 3.6376da14 ondisk+write e3355) currently reached_pg
2016-05-11 08:49:34.529271 osd.0 192.168.1.50:6812/2707 812 : cluster [INF] 3.178 scrub starts
2016-05-11 08:49:34.574830 osd.2 192.168.1.50:6808/2489 897 : cluster [INF] 3.165 scrub ok
2016-05-11 08:49:39.656002 osd.0 192.168.1.50:6812/2707 813 : cluster [INF] 3.178 scrub ok
2016-05-11 08:49:40.334785 mon.0 192.168.1.50:6789/0 565113 : cluster [INF] pgmap v4892606: 1024 pgs: 225 peering, 330 activating, 455 active+clean, 1 active+clean+scrubbing, 2 active+degraded, 11 activating+degraded; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 233 kB/s wr, 46 op/s; 2/582777 objects degraded (0.000%); 8438 kB/s, 2 objects/s recovering
2016-05-11 08:50:04.807956 osd.4 192.168.1.51:6800/2025 1020 : cluster [WRN] 6 slow requests, 1 included below; oldest blocked for > 65.110253 secs
2016-05-11 08:50:04.808228 osd.4 192.168.1.51:6800/2025 1021 : cluster [WRN] slow request 30.367437 seconds old, received at 2016-05-11 08:49:34.440475: osd_op(client.898065.1:57175139 rbd_data.106fd2ae8944a.00000000000048da [set-alloc-hint object_size 4194304 write_size 4194304,write 2277376~4096] 3.dcb20d16 ondisk+write e3364) currently waiting for degraded object
2016-05-11 09:07:16.635974 mon.0 192.168.1.50:6789/0 567843 : cluster [INF] pgmap v4892846: 1024 pgs: 1023 active+clean, 1 active+clean+scrubbing+deep; 740 GB data, 2182 GB used, 39714 GB / 41897 GB avail; 62436 B/s rd, 587 kB/s wr, 98 op/s
2016-05-11 09:07:11.568948 osd.2 192.168.1.50:6808/2489 1277 : cluster [INF] 3.6b scrub ok
2016-05-11 09:07:17.979713 mon.0 192.168.1.50:6789/0 567844 : cluster [INF] mon.0 calling new monitor election
2016-05-11 09:07:18.016441 mon.0 192.168.1.50:6789/0 567845 : cluster [INF] mon.0@0 won leader election with quorum 0,1,2
2016-05-11 09:07:18.018000 mon.0 192.168.1.50:6789/0 567846 : cluster [INF] HEALTH_OK
2016-05-11 09:07:21.598397 mon.2 xxx.xxx.xx.xx:6789/0 6260 : cluster [INF] mon.2 calling new monitor election
2016-05-11 09:07:21.603212 mon.0 192.168.1.50:6789/0 567847 : cluster [INF] mon.0 calling new monitor election
2016-05-11 09:07:21.603254 mon.1 192.168.1.51:6789/0 10102 : cluster [INF] mon.1 calling new monitor election
2016-05-11 09:07:21.606796 mon.0 192.168.1.50:6789/0 567848 : cluster [INF] mon.0@0 won leader election with quorum 0,1,2
2016-05-11 09:07:21.608486 mon.0 192.168.1.50:6789/0 567849 : cluster [INF] HEALTH_OK
2016-05-11 09:07:21.612223 mon.0 192.168.1.50:6789/0 567850 : cluster [INF] monmap e3: 3 mons at {0=192.168.1.50:6789/0,1=192.168.1.51:6789/0,2=xxx.xxx.xx.xx:6789/0}
2016-05-11 09:07:21.612294 mon.0 192.168.1.50:6789/0 567851 : cluster [INF] pgmap v4892846: 1024 pgs: 1023 active+clean, 1 active+clean+scrubbing+deep; 740 GB data, 2182 GB used, 39714 GB / 41897 GB avail; 62436 B/s rd, 587 kB/s wr, 98 op/s
2016-05-11 09:07:21.612649 mon.0 192.168.1.50:6789/0 567852 : cluster [INF] mdsmap e1: 0/0/0 up
2016-05-11 09:07:21.612809 mon.0 192.168.1.50:6789/0 567853 : cluster [INF] osdmap e3415: 15 osds: 15 up, 15 in

udo · May 12, 2016

Null() said:

Hi guys, I would like to thank everyone for their help in resolving this issue. I just had the same issue happen to me this morning. This time only one node was affected, (nas2) and it self healed before I could even get to it. All the others did not follow as before I did no a lot of clocking improvements as suggested. Please see below the CEPH log

Code:

2016-05-11 08:47:15.962323 mon.0 192.168.1.50:6789/0 564411 : cluster [INF] pgmap v4892543: 1024 pgs: 1023 active+clean, 1 active+clean+scrubbing+deep; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 456 kB/s wr, 71 op/s
2016-05-11 08:47:19.362619 mon.0 192.168.1.50:6789/0 564412 : cluster [INF] pgmap v4892544: 1024 pgs: 1023 active+clean, 1 active+clean+scrubbing+deep; 740 GB data, 2182 GB used, 39715 GB / 41897 GB avail; 1906 B/s rd, 857 kB/s wr, 137 op/s
2016-05-11 08:49:16.685174 mon.1 192.168.1.51:6789/0 10097 : cluster [INF] mon.1 calling new monitor election
2016-05-11 08:49:16.688931 mon.0 192.168.1.50:6789/0 565092 : cluster [INF] mon.0 calling new monitor election
2016-05-11 08:49:16.960194 mon.0 192.168.1.50:6789/0 565093 : cluster [INF] mon.0@0 won leader election with quorum 0,1,2
2016-05-11 08:49:16.962658 mon.0 192.168.1.50:6789/0 565094 : cluster [INF] HEALTH_WARN; 233 pgs degraded; 153 pgs peering; 130 pgs stale; 42 pgs stuck inactive; 216 pgs stuck unclean; 233 pgs undersized; 12 requests are blocked > 32 sec; recovery 44130/582777 objects degraded (7.572%); 3/15 in osds are down

Hi,
I don't think it's an timing issue! In this case you will get first an mon clock screw message in the ceph log.

Looks for me for an network or cpu-performance issue.
Are you see deep-scrub messages in any case before the issue start?

How looks your ceph.conf?

Udo

Null() · May 12, 2016

udo said:
Hi,
I don't think it's an timing issue! In this case you will get first an mon clock screw message in the ceph log.

Looks for me for an network or cpu-performance issue.
Are you see deep-scrub messages in any case before the issue start?

How looks your ceph.conf?

Udo

Udo, thank you for your reply. I thought it could be a CPU issue with the whole energy savings core powerdown/idle they enable in bios by default now but I have no such setting in bios for this. This is a dual quad core xeon machine, 8 cores total and all 3 machines are exactly the same. Due to the post limit on this forum I attached the log, it does appear it goes into a deep scrub. In regards to the network issue I did run iperf for a while before I put this into production and did not see any issues, I get 6.8Gbps with iperf.

udo · May 19, 2016

Null() said:
Udo, thank you for your reply. I thought it could be a CPU issue with the whole energy savings core powerdown/idle they enable in bios by default now but I have no such setting in bios for this. This is a dual quad core xeon machine, 8 cores total and all 3 machines are exactly the same. Due to the post limit on this forum I attached the log, it does appear it goes into a deep scrub. In regards to the network issue I did run iperf for a while before I put this into production and did not see any issues, I get 6.8Gbps with iperf.

Hi,
sorry for delay.

What's about the ceph.conf who i asked for?

About your logfile - first part:

Code:

2016-05-12 10:33:29.790572 osd.1 192.xxx.xxx.50:6804/2264 14 : cluster [WRN] 5 slow requests, 1 included below; oldest blocked for > 64.825911 secs
2016-05-12 10:33:29.790578 osd.1 192.xxx.xxx.50:6804/2264 15 : cluster [WRN] slow request 60.679060 seconds old, received at 2016-05-12 10:32:29.111483: osd_op(client.122779.1:47110705 rbd_data.e97492ae8944a.0000000000074b5d [set-alloc-hint object_size 4194304 write_size 4194304,write 3211264~4096] 3.d70d55b6 ondisk+write e3955) currently waiting for subops from 10,11
2016-05-12 10:33:34.394049 mon.0 192.xxx.xxx.50:6789/0 924 : cluster [INF] osd.10 192.xxx.xxx.51:6816/2950 failed (106 reports from 10 peers after 68.917934 >= grace 68.020971)
2016-05-12 10:33:25.308310 osd.5 192.xxx.xxx.50:6816/2938 19 : cluster [WRN] 18 slow requests, 6 included below; oldest blocked for > 58.095831 secs
...
2016-05-12 10:33:35.091229 mon.0 192.xxx.xxx.50:6789/0 930 : cluster [INF] osdmap e3956: 15 osds: 14 up, 15 in

your IO-system (I guess, or an disk) has an problem (to much IO - look with atop), so you got slow reads.
osd.10 timed out and marked as down.

and the same happens with the next osds...

Perhaps you have to many processes for house keeping (scrubbing)? This is the reason, why I asked for the ceph.conf.

Udo

Null() · May 19, 2016

udo said:
Hi,
sorry for delay.

What's about the ceph.conf who i asked for?

About your logfile - first part:

Code:

2016-05-12 10:33:29.790572 osd.1 192.xxx.xxx.50:6804/2264 14 : cluster [WRN] 5 slow requests, 1 included below; oldest blocked for > 64.825911 secs 2016-05-12 10:33:29.790578 osd.1 192.xxx.xxx.50:6804/2264 15 : cluster [WRN] slow request 60.679060 seconds old, received at 2016-05-12 10:32:29.111483: osd_op(client.122779.1:47110705 rbd_data.e97492ae8944a.0000000000074b5d [set-alloc-hint object_size 4194304 write_size 4194304,write 3211264~4096] 3.d70d55b6 ondisk+write e3955) currently waiting for subops from 10,11 2016-05-12 10:33:34.394049 mon.0 192.xxx.xxx.50:6789/0 924 : cluster [INF] osd.10 192.xxx.xxx.51:6816/2950 failed (106 reports from 10 peers after 68.917934 >= grace 68.020971) 2016-05-12 10:33:25.308310 osd.5 192.xxx.xxx.50:6816/2938 19 : cluster [WRN] 18 slow requests, 6 included below; oldest blocked for > 58.095831 secs ... 2016-05-12 10:33:35.091229 mon.0 192.xxx.xxx.50:6789/0 930 : cluster [INF] osdmap e3956: 15 osds: 14 up, 15 in

your IO-system (I guess, or an disk) has an problem (to much IO - look with atop), so you got slow reads.
osd.10 timed out and marked as down.

and the same happens with the next osds...

Perhaps you have to many processes for house keeping (scrubbing)? This is the reason, why I asked for the ceph.conf.

Udo

Udo, Thank you for your help I could of sworn I posted the config but I was wrong. I apologize, and thank you for your help

Code:

[global] auth client required = cephx auth cluster required = cephx auth service required = cephx cluster network = xxx.xxx.1.0/24 filestore xattr use omap = true fsid = cad72e57-aeb1-4912-bbdd-4b71fff512d9 keyring = /etc/pve/priv/$cluster.$name.keyring osd journal size = 5120 osd pool default min size = 1 public network = xxx.xxx.1.0/24 [osd] keyring = /var/lib/ceph/osd/ceph-$id/keyring [mon.1] host = nas01 mon addr = xxx.xxx.1.51:6789 [mon.2] host = nas02 mon addr = xxx.xxx.1.52:6789 [mon.0] host = nas00 mon addr = xxx.xxx.1.50:6789

Search

Search

PVE Ceph High IO and CPU Utilization

Null()

New Member

iptgeek

Active Member

Null()

New Member

Null()

New Member

wosp

Renowned Member

Null()

New Member

Null()

New Member

Null()

New Member

udo

Distinguished Member

Null()

New Member

Attachments

udo

Distinguished Member

Null()

New Member