[SOLVED] Added disks, updated pg_num, recovery going very slow

yavuz

Renowned Member
Jun 22, 2014
24
1
68
Hi,

Yesterday we added 2 new disks to our cluster. Immediately afterwards it started rebalancing. According to pgcalc I decided to also update the pg_num and pgp_num from 512 to 800 (which is the max for my setup according to the warning).

Recovery is busy but it's very slow. It has been recovering since 14-02-2018 around 20:00 CET and it still isn't finished.

Setup:
2 x 10Gbps in LACP for CEPH with Intel X520-DA2 adapter

Code:
# uname -a
Linux hv02 4.13.13-5-pve #1 SMP PVE 4.13.13-36 (Mon, 15 Jan 2018 12:36:49 +0100) x86_64 GNU/Linux

Code:
# dpkg -l |grep pve-
ii  libpve-access-control                5.0-7                          amd64        Proxmox VE access control library
ii  libpve-common-perl                   5.0-25                         all          Proxmox VE base library
ii  libpve-guest-common-perl             2.0-14                         all          Proxmox VE common guest-related modules
ii  libpve-http-server-perl              2.0-8                          all          Proxmox Asynchrounous HTTP Server Implementation
ii  libpve-storage-perl                  5.0-17                         all          Proxmox VE storage management library
ii  pve-cluster                          5.0-19                         amd64        Cluster Infrastructure for Proxmox Virtual Environment
ii  pve-container                        2.0-18                         all          Proxmox VE Container management tool
ii  pve-docs                             5.1-16                         all          Proxmox VE Documentation
ii  pve-firewall                         3.0-5                          amd64        Proxmox VE Firewall
ii  pve-firmware                         2.0-3                          all          Binary firmware code for the pve-kernel
ii  pve-ha-manager                       2.0-4                          amd64        Proxmox VE HA Manager
ii  pve-kernel-4.13.13-5-pve             4.13.13-38                     amd64        The Proxmox PVE Kernel Image
ii  pve-libspice-server1                 0.12.8-3                       amd64        SPICE remote display system server library
ii  pve-manager                          5.1-43                         amd64        Proxmox Virtual Environment Management Tools
ii  pve-qemu-kvm                         2.9.1-6                        amd64        Full virtualization on x86 hardware
ii  pve-xtermjs                          1.0-2                          amd64        HTML/JS Shell client

Code:
# ceph -s
  cluster:
    id:     550fea30-6116-4190-abcf-c29882bdb9af
    health: HEALTH_ERR
            93506/2046162 objects misplaced (4.570%)
            Reduced data availability: 142 pgs inactive
            Degraded data redundancy: 220 pgs unclean
            2081 slow requests are blocked > 32 sec
            45469 stuck requests are blocked > 4096 sec

  services:
    mon: 3 daemons, quorum hv01,1,2
    mgr: 1(active), standbys: hv01, 2
    osd: 12 osds: 12 up, 12 in; 211 remapped pgs

  data:
    pools:   1 pools, 800 pgs
    objects: 666k objects, 2535 GB
    usage:   6955 GB used, 11822 GB / 18778 GB avail
    pgs:     1.125% pgs unknown
             16.625% pgs not active
             93506/2046162 objects misplaced (4.570%)
             580 active+clean
             133 activating+remapped
             77  active+remapped+backfill_wait
             9   unknown
             1   active+remapped+backfilling

  io:
    recovery: 23526 kB/s, 6 objects/s

Code:
# ceph -w | head -40
  cluster:
    id:     550fea30-6116-4190-abcf-c29882bdb9af
    health: HEALTH_ERR
            92876/2046162 objects misplaced (4.539%)
            Reduced data availability: 142 pgs inactive
            Degraded data redundancy: 220 pgs unclean
            2080 slow requests are blocked > 32 sec
            45510 stuck requests are blocked > 4096 sec

  services:
    mon: 3 daemons, quorum hv01,1,2
    mgr: 1(active), standbys: hv01, 2
    osd: 12 osds: 12 up, 12 in; 211 remapped pgs

  data:
    pools:   1 pools, 800 pgs
    objects: 666k objects, 2535 GB
    usage:   6957 GB used, 11820 GB / 18778 GB avail
    pgs:     1.125% pgs unknown
             16.625% pgs not active
             92876/2046162 objects misplaced (4.539%)
             580 active+clean
             133 activating+remapped
             77  active+remapped+backfill_wait
             9   unknown
             1   active+remapped+backfilling

  io:
    recovery: 21822 kB/s, 6 objects/s


2018-02-15 09:20:20.903510 mon.hv01 [ERR] Health check update: 45504 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
2018-02-15 09:20:25.903899 mon.hv01 [WRN] Health check update: 92876/2046162 objects misplaced (4.539%) (OBJECT_MISPLACED)
2018-02-15 09:20:25.903940 mon.hv01 [WRN] Health check update: 2080 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-15 09:20:25.903956 mon.hv01 [ERR] Health check update: 45510 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
2018-02-15 09:20:17.042601 osd.3 [WRN] 18962 slow requests, 5 included below; oldest blocked for > 47406.514226 secs
2018-02-15 09:20:17.042607 osd.3 [WRN] slow request 7680.141415 seconds old, received at 2018-02-15 07:12:16.898480: osd_op(client.10165274.0:164441637 0.2d1 0.d83416d1 (undecoded) ondisk+write+known_if_redirected e5352) currently waiting for peered
2018-02-15 09:20:17.042609 osd.3 [WRN] slow request 15360.137745 seconds old, received at 2018-02-15 05:04:16.902151: osd_op(client.11354927.0:16076274 0.267 0.802d9a67 (undecoded) ondisk+write+known_if_redirected e5126) currently waiting for peered
2018-02-15 09:20:17.042611 osd.3 [WRN] slow request 483.691826 seconds old, received at 2018-02-15 09:12:13.348069: osd_op(client.10165274.0:164443076 0.2d1 0.d83416d1 (undecoded) ondisk+write+known_if_redirected e5566) currently waiting for peered
2018-02-15 09:20:17.042625 osd.3 [WRN] slow request 482.645457 seconds old, received at 2018-02-15 09:12:14.394438: osd_op(client.11354927.0:16079249 0.267 0.802d9a67 (undecoded) ondisk+write+known_if_redirected e5566) currently waiting for peered
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored

Exception IOError's continue to flow.

Code:
# ceph osd tree
ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       18.33806 root default
-2        6.11269     host hv01
 0   ssd  0.87320         osd.0      up  1.00000 1.00000
 1   ssd  0.87320         osd.1      up  1.00000 1.00000
 8   ssd  3.49309         osd.8      up  1.00000 1.00000
 9   ssd  0.87320         osd.9      up  1.00000 1.00000
-3        6.11269     host hv02
 2   ssd  0.87320         osd.2      up  1.00000 1.00000
 3   ssd  0.87320         osd.3      up  1.00000 1.00000
 6   ssd  0.87320         osd.6      up  1.00000 1.00000
10   ssd  3.49309         osd.10     up  1.00000 1.00000
-4        6.11269     host hv03
 4   ssd  0.87320         osd.4      up  1.00000 1.00000
 5   ssd  0.87320         osd.5      up  1.00000 1.00000
 7   ssd  0.87320         osd.7      up  1.00000 1.00000
11   ssd  3.49309         osd.11     up  1.00000 1.00000

No errors in SMART, no jumbo frames. Anyone have any ideas?
 
you probably bumped the PG num while it was still rebalancing, which can trigger problematic behaviour.

Yes I did. Now I know I shouldn't have.

Thank you, almost finished with recovery so I will wait a couple of hours more.
 
After adding updating mon_max_pg_per_osd = 1000 in global section of /etc/ceph/ceph.conf and restarting all monitors and osd's problem has been resolved.
 
After adding updating mon_max_pg_per_osd = 1000 in global section of /etc/ceph/ceph.conf and restarting all monitors and osd's problem has been resolved.

you probably want to lower that again now if your PGs are properly distributed.
 
Unfortunately I can't get back to the default level of 200 I guess:

Code:
# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
 0   ssd 0.87320  1.00000   894G  317G   576G 35.50 0.96 106
 1   ssd 0.87320  1.00000   894G  341G   552G 38.19 1.03 118
 8   ssd 3.49309  1.00000  3576G 1334G  2242G 37.31 1.01 459
 9   ssd 0.87320  1.00000   894G  328G   566G 36.68 0.99 117
 2   ssd 0.87320  1.00000   894G  342G   551G 38.30 1.03 120
 3   ssd 0.87320  1.00000   894G  342G   551G 38.29 1.03 120
 6   ssd 0.87320  1.00000   894G  372G   522G 41.61 1.12 126
10   ssd 3.49309  1.00000  3576G 1264G  2312G 35.34 0.95 434
 4   ssd 0.87320  1.00000   894G  292G   601G 32.68 0.88 100
 5   ssd 0.87320  1.00000   894G  395G   498G 44.27 1.19 132
 7   ssd 0.87320  1.00000   894G  335G   558G 37.57 1.01 117
11   ssd 3.49309  1.00000  3576G 1298G  2278G 36.31 0.98 451
                    TOTAL 18778G 6965G 11813G 37.09
MIN/MAX VAR: 0.88/1.19  STDDEV: 2.92

So I will lower it to 500 for the time being, and when more OSD's are added I will keep an eye on it.
 
My 2 cents:
1. pg_num should be a power of 2 (in this case, 1024).
2. You did too many jobs as once. You should:
a) add first osd;
b) wait for HEALTH_OK;
c) add second osd;
d) wait for HEALTH_OK;
...
z) increase pg_num.
3. The 'too many pg per osd' warning warns you about real problem, you should keep pg per osd near 100 to avoid recovery problems.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!