Hi,
Yesterday we added 2 new disks to our cluster. Immediately afterwards it started rebalancing. According to pgcalc I decided to also update the pg_num and pgp_num from 512 to 800 (which is the max for my setup according to the warning).
Recovery is busy but it's very slow. It has been recovering since 14-02-2018 around 20:00 CET and it still isn't finished.
Setup:
2 x 10Gbps in LACP for CEPH with Intel X520-DA2 adapter
	
	
	
		
	
	
	
		
	
	
	
		
	
	
	
		
Exception IOError's continue to flow.
	
	
	
		
No errors in SMART, no jumbo frames. Anyone have any ideas?
				
			Yesterday we added 2 new disks to our cluster. Immediately afterwards it started rebalancing. According to pgcalc I decided to also update the pg_num and pgp_num from 512 to 800 (which is the max for my setup according to the warning).
Recovery is busy but it's very slow. It has been recovering since 14-02-2018 around 20:00 CET and it still isn't finished.
Setup:
2 x 10Gbps in LACP for CEPH with Intel X520-DA2 adapter
		Code:
	
	# uname -a
Linux hv02 4.13.13-5-pve #1 SMP PVE 4.13.13-36 (Mon, 15 Jan 2018 12:36:49 +0100) x86_64 GNU/Linux
		Code:
	
	# dpkg -l |grep pve-
ii  libpve-access-control                5.0-7                          amd64        Proxmox VE access control library
ii  libpve-common-perl                   5.0-25                         all          Proxmox VE base library
ii  libpve-guest-common-perl             2.0-14                         all          Proxmox VE common guest-related modules
ii  libpve-http-server-perl              2.0-8                          all          Proxmox Asynchrounous HTTP Server Implementation
ii  libpve-storage-perl                  5.0-17                         all          Proxmox VE storage management library
ii  pve-cluster                          5.0-19                         amd64        Cluster Infrastructure for Proxmox Virtual Environment
ii  pve-container                        2.0-18                         all          Proxmox VE Container management tool
ii  pve-docs                             5.1-16                         all          Proxmox VE Documentation
ii  pve-firewall                         3.0-5                          amd64        Proxmox VE Firewall
ii  pve-firmware                         2.0-3                          all          Binary firmware code for the pve-kernel
ii  pve-ha-manager                       2.0-4                          amd64        Proxmox VE HA Manager
ii  pve-kernel-4.13.13-5-pve             4.13.13-38                     amd64        The Proxmox PVE Kernel Image
ii  pve-libspice-server1                 0.12.8-3                       amd64        SPICE remote display system server library
ii  pve-manager                          5.1-43                         amd64        Proxmox Virtual Environment Management Tools
ii  pve-qemu-kvm                         2.9.1-6                        amd64        Full virtualization on x86 hardware
ii  pve-xtermjs                          1.0-2                          amd64        HTML/JS Shell client
		Code:
	
	# ceph -s
  cluster:
    id:     550fea30-6116-4190-abcf-c29882bdb9af
    health: HEALTH_ERR
            93506/2046162 objects misplaced (4.570%)
            Reduced data availability: 142 pgs inactive
            Degraded data redundancy: 220 pgs unclean
            2081 slow requests are blocked > 32 sec
            45469 stuck requests are blocked > 4096 sec
  services:
    mon: 3 daemons, quorum hv01,1,2
    mgr: 1(active), standbys: hv01, 2
    osd: 12 osds: 12 up, 12 in; 211 remapped pgs
  data:
    pools:   1 pools, 800 pgs
    objects: 666k objects, 2535 GB
    usage:   6955 GB used, 11822 GB / 18778 GB avail
    pgs:     1.125% pgs unknown
             16.625% pgs not active
             93506/2046162 objects misplaced (4.570%)
             580 active+clean
             133 activating+remapped
             77  active+remapped+backfill_wait
             9   unknown
             1   active+remapped+backfilling
  io:
    recovery: 23526 kB/s, 6 objects/s
		Code:
	
	# ceph -w | head -40
  cluster:
    id:     550fea30-6116-4190-abcf-c29882bdb9af
    health: HEALTH_ERR
            92876/2046162 objects misplaced (4.539%)
            Reduced data availability: 142 pgs inactive
            Degraded data redundancy: 220 pgs unclean
            2080 slow requests are blocked > 32 sec
            45510 stuck requests are blocked > 4096 sec
  services:
    mon: 3 daemons, quorum hv01,1,2
    mgr: 1(active), standbys: hv01, 2
    osd: 12 osds: 12 up, 12 in; 211 remapped pgs
  data:
    pools:   1 pools, 800 pgs
    objects: 666k objects, 2535 GB
    usage:   6957 GB used, 11820 GB / 18778 GB avail
    pgs:     1.125% pgs unknown
             16.625% pgs not active
             92876/2046162 objects misplaced (4.539%)
             580 active+clean
             133 activating+remapped
             77  active+remapped+backfill_wait
             9   unknown
             1   active+remapped+backfilling
  io:
    recovery: 21822 kB/s, 6 objects/s
2018-02-15 09:20:20.903510 mon.hv01 [ERR] Health check update: 45504 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
2018-02-15 09:20:25.903899 mon.hv01 [WRN] Health check update: 92876/2046162 objects misplaced (4.539%) (OBJECT_MISPLACED)
2018-02-15 09:20:25.903940 mon.hv01 [WRN] Health check update: 2080 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-15 09:20:25.903956 mon.hv01 [ERR] Health check update: 45510 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
2018-02-15 09:20:17.042601 osd.3 [WRN] 18962 slow requests, 5 included below; oldest blocked for > 47406.514226 secs
2018-02-15 09:20:17.042607 osd.3 [WRN] slow request 7680.141415 seconds old, received at 2018-02-15 07:12:16.898480: osd_op(client.10165274.0:164441637 0.2d1 0.d83416d1 (undecoded) ondisk+write+known_if_redirected e5352) currently waiting for peered
2018-02-15 09:20:17.042609 osd.3 [WRN] slow request 15360.137745 seconds old, received at 2018-02-15 05:04:16.902151: osd_op(client.11354927.0:16076274 0.267 0.802d9a67 (undecoded) ondisk+write+known_if_redirected e5126) currently waiting for peered
2018-02-15 09:20:17.042611 osd.3 [WRN] slow request 483.691826 seconds old, received at 2018-02-15 09:12:13.348069: osd_op(client.10165274.0:164443076 0.2d1 0.d83416d1 (undecoded) ondisk+write+known_if_redirected e5566) currently waiting for peered
2018-02-15 09:20:17.042625 osd.3 [WRN] slow request 482.645457 seconds old, received at 2018-02-15 09:12:14.394438: osd_op(client.11354927.0:16079249 0.267 0.802d9a67 (undecoded) ondisk+write+known_if_redirected e5566) currently waiting for peered
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignoredException IOError's continue to flow.
		Code:
	
	# ceph osd tree
ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       18.33806 root default
-2        6.11269     host hv01
 0   ssd  0.87320         osd.0      up  1.00000 1.00000
 1   ssd  0.87320         osd.1      up  1.00000 1.00000
 8   ssd  3.49309         osd.8      up  1.00000 1.00000
 9   ssd  0.87320         osd.9      up  1.00000 1.00000
-3        6.11269     host hv02
 2   ssd  0.87320         osd.2      up  1.00000 1.00000
 3   ssd  0.87320         osd.3      up  1.00000 1.00000
 6   ssd  0.87320         osd.6      up  1.00000 1.00000
10   ssd  3.49309         osd.10     up  1.00000 1.00000
-4        6.11269     host hv03
 4   ssd  0.87320         osd.4      up  1.00000 1.00000
 5   ssd  0.87320         osd.5      up  1.00000 1.00000
 7   ssd  0.87320         osd.7      up  1.00000 1.00000
11   ssd  3.49309         osd.11     up  1.00000 1.00000No errors in SMART, no jumbo frames. Anyone have any ideas?
 
	 
	 
 
		