[SOLVED] Added disks, updated pg_num, recovery going very slow

yavuz · Feb 15, 2018

Hi,

Yesterday we added 2 new disks to our cluster. Immediately afterwards it started rebalancing. According to pgcalc I decided to also update the pg_num and pgp_num from 512 to 800 (which is the max for my setup according to the warning).

Recovery is busy but it's very slow. It has been recovering since 14-02-2018 around 20:00 CET and it still isn't finished.

Setup:
2 x 10Gbps in LACP for CEPH with Intel X520-DA2 adapter

Code:

# uname -a
Linux hv02 4.13.13-5-pve #1 SMP PVE 4.13.13-36 (Mon, 15 Jan 2018 12:36:49 +0100) x86_64 GNU/Linux

Code:

# dpkg -l |grep pve-
ii  libpve-access-control                5.0-7                          amd64        Proxmox VE access control library
ii  libpve-common-perl                   5.0-25                         all          Proxmox VE base library
ii  libpve-guest-common-perl             2.0-14                         all          Proxmox VE common guest-related modules
ii  libpve-http-server-perl              2.0-8                          all          Proxmox Asynchrounous HTTP Server Implementation
ii  libpve-storage-perl                  5.0-17                         all          Proxmox VE storage management library
ii  pve-cluster                          5.0-19                         amd64        Cluster Infrastructure for Proxmox Virtual Environment
ii  pve-container                        2.0-18                         all          Proxmox VE Container management tool
ii  pve-docs                             5.1-16                         all          Proxmox VE Documentation
ii  pve-firewall                         3.0-5                          amd64        Proxmox VE Firewall
ii  pve-firmware                         2.0-3                          all          Binary firmware code for the pve-kernel
ii  pve-ha-manager                       2.0-4                          amd64        Proxmox VE HA Manager
ii  pve-kernel-4.13.13-5-pve             4.13.13-38                     amd64        The Proxmox PVE Kernel Image
ii  pve-libspice-server1                 0.12.8-3                       amd64        SPICE remote display system server library
ii  pve-manager                          5.1-43                         amd64        Proxmox Virtual Environment Management Tools
ii  pve-qemu-kvm                         2.9.1-6                        amd64        Full virtualization on x86 hardware
ii  pve-xtermjs                          1.0-2                          amd64        HTML/JS Shell client

Code:

# ceph -s
  cluster:
    id:     550fea30-6116-4190-abcf-c29882bdb9af
    health: HEALTH_ERR
            93506/2046162 objects misplaced (4.570%)
            Reduced data availability: 142 pgs inactive
            Degraded data redundancy: 220 pgs unclean
            2081 slow requests are blocked > 32 sec
            45469 stuck requests are blocked > 4096 sec

  services:
    mon: 3 daemons, quorum hv01,1,2
    mgr: 1(active), standbys: hv01, 2
    osd: 12 osds: 12 up, 12 in; 211 remapped pgs

  data:
    pools:   1 pools, 800 pgs
    objects: 666k objects, 2535 GB
    usage:   6955 GB used, 11822 GB / 18778 GB avail
    pgs:     1.125% pgs unknown
             16.625% pgs not active
             93506/2046162 objects misplaced (4.570%)
             580 active+clean
             133 activating+remapped
             77  active+remapped+backfill_wait
             9   unknown
             1   active+remapped+backfilling

  io:
    recovery: 23526 kB/s, 6 objects/s

Code:

# ceph -w | head -40
  cluster:
    id:     550fea30-6116-4190-abcf-c29882bdb9af
    health: HEALTH_ERR
            92876/2046162 objects misplaced (4.539%)
            Reduced data availability: 142 pgs inactive
            Degraded data redundancy: 220 pgs unclean
            2080 slow requests are blocked > 32 sec
            45510 stuck requests are blocked > 4096 sec

  services:
    mon: 3 daemons, quorum hv01,1,2
    mgr: 1(active), standbys: hv01, 2
    osd: 12 osds: 12 up, 12 in; 211 remapped pgs

  data:
    pools:   1 pools, 800 pgs
    objects: 666k objects, 2535 GB
    usage:   6957 GB used, 11820 GB / 18778 GB avail
    pgs:     1.125% pgs unknown
             16.625% pgs not active
             92876/2046162 objects misplaced (4.539%)
             580 active+clean
             133 activating+remapped
             77  active+remapped+backfill_wait
             9   unknown
             1   active+remapped+backfilling

  io:
    recovery: 21822 kB/s, 6 objects/s


2018-02-15 09:20:20.903510 mon.hv01 [ERR] Health check update: 45504 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
2018-02-15 09:20:25.903899 mon.hv01 [WRN] Health check update: 92876/2046162 objects misplaced (4.539%) (OBJECT_MISPLACED)
2018-02-15 09:20:25.903940 mon.hv01 [WRN] Health check update: 2080 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-15 09:20:25.903956 mon.hv01 [ERR] Health check update: 45510 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
2018-02-15 09:20:17.042601 osd.3 [WRN] 18962 slow requests, 5 included below; oldest blocked for > 47406.514226 secs
2018-02-15 09:20:17.042607 osd.3 [WRN] slow request 7680.141415 seconds old, received at 2018-02-15 07:12:16.898480: osd_op(client.10165274.0:164441637 0.2d1 0.d83416d1 (undecoded) ondisk+write+known_if_redirected e5352) currently waiting for peered
2018-02-15 09:20:17.042609 osd.3 [WRN] slow request 15360.137745 seconds old, received at 2018-02-15 05:04:16.902151: osd_op(client.11354927.0:16076274 0.267 0.802d9a67 (undecoded) ondisk+write+known_if_redirected e5126) currently waiting for peered
2018-02-15 09:20:17.042611 osd.3 [WRN] slow request 483.691826 seconds old, received at 2018-02-15 09:12:13.348069: osd_op(client.10165274.0:164443076 0.2d1 0.d83416d1 (undecoded) ondisk+write+known_if_redirected e5566) currently waiting for peered
2018-02-15 09:20:17.042625 osd.3 [WRN] slow request 482.645457 seconds old, received at 2018-02-15 09:12:14.394438: osd_op(client.11354927.0:16079249 0.267 0.802d9a67 (undecoded) ondisk+write+known_if_redirected e5566) currently waiting for peered
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored

Exception IOError's continue to flow.

Code:

# ceph osd tree
ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       18.33806 root default
-2        6.11269     host hv01
 0   ssd  0.87320         osd.0      up  1.00000 1.00000
 1   ssd  0.87320         osd.1      up  1.00000 1.00000
 8   ssd  3.49309         osd.8      up  1.00000 1.00000
 9   ssd  0.87320         osd.9      up  1.00000 1.00000
-3        6.11269     host hv02
 2   ssd  0.87320         osd.2      up  1.00000 1.00000
 3   ssd  0.87320         osd.3      up  1.00000 1.00000
 6   ssd  0.87320         osd.6      up  1.00000 1.00000
10   ssd  3.49309         osd.10     up  1.00000 1.00000
-4        6.11269     host hv03
 4   ssd  0.87320         osd.4      up  1.00000 1.00000
 5   ssd  0.87320         osd.5      up  1.00000 1.00000
 7   ssd  0.87320         osd.7      up  1.00000 1.00000
11   ssd  3.49309         osd.11     up  1.00000 1.00000

No errors in SMART, no jumbo frames. Anyone have any ideas?

fabian · Feb 15, 2018

you probably bumped the PG num while it was still rebalancing, which can trigger problematic behaviour.

see
http://tracker.ceph.com/issues/22440
https://www.spinics.net/lists/ceph-users/msg42258.html
https://www.spinics.net/lists/ceph-users/msg41231.html

for how to get out of the mess

yavuz · Feb 15, 2018

you probably bumped the PG num while it was still rebalancing, which can trigger problematic behaviour.

Yes I did. Now I know I shouldn't have.

Thank you, almost finished with recovery so I will wait a couple of hours more.

yavuz · Feb 15, 2018

After adding updating mon_max_pg_per_osd = 1000 in global section of /etc/ceph/ceph.conf and restarting all monitors and osd's problem has been resolved.

fabian · Feb 16, 2018

yavuz said:
After adding updating mon_max_pg_per_osd = 1000 in global section of /etc/ceph/ceph.conf and restarting all monitors and osd's problem has been resolved.

you probably want to lower that again now if your PGs are properly distributed.

yavuz · Feb 16, 2018

Unfortunately I can't get back to the default level of 200 I guess:

Code:

# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
 0   ssd 0.87320  1.00000   894G  317G   576G 35.50 0.96 106
 1   ssd 0.87320  1.00000   894G  341G   552G 38.19 1.03 118
 8   ssd 3.49309  1.00000  3576G 1334G  2242G 37.31 1.01 459
 9   ssd 0.87320  1.00000   894G  328G   566G 36.68 0.99 117
 2   ssd 0.87320  1.00000   894G  342G   551G 38.30 1.03 120
 3   ssd 0.87320  1.00000   894G  342G   551G 38.29 1.03 120
 6   ssd 0.87320  1.00000   894G  372G   522G 41.61 1.12 126
10   ssd 3.49309  1.00000  3576G 1264G  2312G 35.34 0.95 434
 4   ssd 0.87320  1.00000   894G  292G   601G 32.68 0.88 100
 5   ssd 0.87320  1.00000   894G  395G   498G 44.27 1.19 132
 7   ssd 0.87320  1.00000   894G  335G   558G 37.57 1.01 117
11   ssd 3.49309  1.00000  3576G 1298G  2278G 36.31 0.98 451
                    TOTAL 18778G 6965G 11813G 37.09
MIN/MAX VAR: 0.88/1.19  STDDEV: 2.92

So I will lower it to 500 for the time being, and when more OSD's are added I will keep an eye on it.

Jarek · Feb 16, 2018

My 2 cents:
1. pg_num should be a power of 2 (in this case, 1024).
2. You did too many jobs as once. You should:
a) add first osd;
b) wait for HEALTH_OK;
c) add second osd;
d) wait for HEALTH_OK;
...
z) increase pg_num.
3. The 'too many pg per osd' warning warns you about real problem, you should keep pg per osd near 100 to avoid recovery problems.

Search

Search

[SOLVED] Added disks, updated pg_num, recovery going very slow

yavuz

Renowned Member

fabian

Proxmox Staff Member

yavuz

Renowned Member

yavuz

Renowned Member

fabian

Proxmox Staff Member

yavuz

Renowned Member

Jarek

Well-Known Member

We value your privacy