Hi,
Yesterday we added 2 new disks to our cluster. Immediately afterwards it started rebalancing. According to pgcalc I decided to also update the pg_num and pgp_num from 512 to 800 (which is the max for my setup according to the warning).
Recovery is busy but it's very slow. It has been recovering since 14-02-2018 around 20:00 CET and it still isn't finished.
Setup:
2 x 10Gbps in LACP for CEPH with Intel X520-DA2 adapter
Exception IOError's continue to flow.
No errors in SMART, no jumbo frames. Anyone have any ideas?
Yesterday we added 2 new disks to our cluster. Immediately afterwards it started rebalancing. According to pgcalc I decided to also update the pg_num and pgp_num from 512 to 800 (which is the max for my setup according to the warning).
Recovery is busy but it's very slow. It has been recovering since 14-02-2018 around 20:00 CET and it still isn't finished.
Setup:
2 x 10Gbps in LACP for CEPH with Intel X520-DA2 adapter
Code:
# uname -a
Linux hv02 4.13.13-5-pve #1 SMP PVE 4.13.13-36 (Mon, 15 Jan 2018 12:36:49 +0100) x86_64 GNU/Linux
Code:
# dpkg -l |grep pve-
ii libpve-access-control 5.0-7 amd64 Proxmox VE access control library
ii libpve-common-perl 5.0-25 all Proxmox VE base library
ii libpve-guest-common-perl 2.0-14 all Proxmox VE common guest-related modules
ii libpve-http-server-perl 2.0-8 all Proxmox Asynchrounous HTTP Server Implementation
ii libpve-storage-perl 5.0-17 all Proxmox VE storage management library
ii pve-cluster 5.0-19 amd64 Cluster Infrastructure for Proxmox Virtual Environment
ii pve-container 2.0-18 all Proxmox VE Container management tool
ii pve-docs 5.1-16 all Proxmox VE Documentation
ii pve-firewall 3.0-5 amd64 Proxmox VE Firewall
ii pve-firmware 2.0-3 all Binary firmware code for the pve-kernel
ii pve-ha-manager 2.0-4 amd64 Proxmox VE HA Manager
ii pve-kernel-4.13.13-5-pve 4.13.13-38 amd64 The Proxmox PVE Kernel Image
ii pve-libspice-server1 0.12.8-3 amd64 SPICE remote display system server library
ii pve-manager 5.1-43 amd64 Proxmox Virtual Environment Management Tools
ii pve-qemu-kvm 2.9.1-6 amd64 Full virtualization on x86 hardware
ii pve-xtermjs 1.0-2 amd64 HTML/JS Shell client
Code:
# ceph -s
cluster:
id: 550fea30-6116-4190-abcf-c29882bdb9af
health: HEALTH_ERR
93506/2046162 objects misplaced (4.570%)
Reduced data availability: 142 pgs inactive
Degraded data redundancy: 220 pgs unclean
2081 slow requests are blocked > 32 sec
45469 stuck requests are blocked > 4096 sec
services:
mon: 3 daemons, quorum hv01,1,2
mgr: 1(active), standbys: hv01, 2
osd: 12 osds: 12 up, 12 in; 211 remapped pgs
data:
pools: 1 pools, 800 pgs
objects: 666k objects, 2535 GB
usage: 6955 GB used, 11822 GB / 18778 GB avail
pgs: 1.125% pgs unknown
16.625% pgs not active
93506/2046162 objects misplaced (4.570%)
580 active+clean
133 activating+remapped
77 active+remapped+backfill_wait
9 unknown
1 active+remapped+backfilling
io:
recovery: 23526 kB/s, 6 objects/s
Code:
# ceph -w | head -40
cluster:
id: 550fea30-6116-4190-abcf-c29882bdb9af
health: HEALTH_ERR
92876/2046162 objects misplaced (4.539%)
Reduced data availability: 142 pgs inactive
Degraded data redundancy: 220 pgs unclean
2080 slow requests are blocked > 32 sec
45510 stuck requests are blocked > 4096 sec
services:
mon: 3 daemons, quorum hv01,1,2
mgr: 1(active), standbys: hv01, 2
osd: 12 osds: 12 up, 12 in; 211 remapped pgs
data:
pools: 1 pools, 800 pgs
objects: 666k objects, 2535 GB
usage: 6957 GB used, 11820 GB / 18778 GB avail
pgs: 1.125% pgs unknown
16.625% pgs not active
92876/2046162 objects misplaced (4.539%)
580 active+clean
133 activating+remapped
77 active+remapped+backfill_wait
9 unknown
1 active+remapped+backfilling
io:
recovery: 21822 kB/s, 6 objects/s
2018-02-15 09:20:20.903510 mon.hv01 [ERR] Health check update: 45504 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
2018-02-15 09:20:25.903899 mon.hv01 [WRN] Health check update: 92876/2046162 objects misplaced (4.539%) (OBJECT_MISPLACED)
2018-02-15 09:20:25.903940 mon.hv01 [WRN] Health check update: 2080 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-15 09:20:25.903956 mon.hv01 [ERR] Health check update: 45510 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
2018-02-15 09:20:17.042601 osd.3 [WRN] 18962 slow requests, 5 included below; oldest blocked for > 47406.514226 secs
2018-02-15 09:20:17.042607 osd.3 [WRN] slow request 7680.141415 seconds old, received at 2018-02-15 07:12:16.898480: osd_op(client.10165274.0:164441637 0.2d1 0.d83416d1 (undecoded) ondisk+write+known_if_redirected e5352) currently waiting for peered
2018-02-15 09:20:17.042609 osd.3 [WRN] slow request 15360.137745 seconds old, received at 2018-02-15 05:04:16.902151: osd_op(client.11354927.0:16076274 0.267 0.802d9a67 (undecoded) ondisk+write+known_if_redirected e5126) currently waiting for peered
2018-02-15 09:20:17.042611 osd.3 [WRN] slow request 483.691826 seconds old, received at 2018-02-15 09:12:13.348069: osd_op(client.10165274.0:164443076 0.2d1 0.d83416d1 (undecoded) ondisk+write+known_if_redirected e5566) currently waiting for peered
2018-02-15 09:20:17.042625 osd.3 [WRN] slow request 482.645457 seconds old, received at 2018-02-15 09:12:14.394438: osd_op(client.11354927.0:16079249 0.267 0.802d9a67 (undecoded) ondisk+write+known_if_redirected e5566) currently waiting for peered
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError: (32, 'Broken pipe') in 'rados.__monitor_callback2' ignored
Exception IOError's continue to flow.
Code:
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 18.33806 root default
-2 6.11269 host hv01
0 ssd 0.87320 osd.0 up 1.00000 1.00000
1 ssd 0.87320 osd.1 up 1.00000 1.00000
8 ssd 3.49309 osd.8 up 1.00000 1.00000
9 ssd 0.87320 osd.9 up 1.00000 1.00000
-3 6.11269 host hv02
2 ssd 0.87320 osd.2 up 1.00000 1.00000
3 ssd 0.87320 osd.3 up 1.00000 1.00000
6 ssd 0.87320 osd.6 up 1.00000 1.00000
10 ssd 3.49309 osd.10 up 1.00000 1.00000
-4 6.11269 host hv03
4 ssd 0.87320 osd.4 up 1.00000 1.00000
5 ssd 0.87320 osd.5 up 1.00000 1.00000
7 ssd 0.87320 osd.7 up 1.00000 1.00000
11 ssd 3.49309 osd.11 up 1.00000 1.00000
No errors in SMART, no jumbo frames. Anyone have any ideas?