ceph gone near full and cant start any vm now

hassoon · Jan 27, 2020

Hi I have 4 nodes of pves and they are quite identical in configs.
each of them have has 6x drives that.
but I have noticed that one drive failed and as the vms kept writing data it became near full
here is the output of ceph -s
root@pve2:~# ceph -s
cluster:
id: a9926f78-4366-4be5-a77c-7db26a419e86
health: HEALTH_ERR
Reduced data availability: 434 pgs inactive, 434 pgs peering
920 stuck requests are blocked > 4096 sec. Implicated osds 6,7,8,9,10,12,20,21,22,23

services:
mon: 4 daemons, quorum pve1,pve2,pve3,pve4
mgr: pve1(active), standbys: pve2, pve4, pve3
osd: 24 osds: 23 up, 23 in; 15 remapped pgs

data:
pools: 4 pools, 832 pgs
objects: 419k objects, 1575 GB
usage: 4792 GB used, 2003 GB / 6795 GB avail
pgs: 52.163% pgs not active
419 peering
398 active+clean
15 remapped+peering

as you can see we do have lots of available space but I dont know what went wrong. is it the failed drive or what exactly?
is there a way to get things re sorted??

here is a extract of latest logs
2020-01-27 07:00:00.000267 mon.pve1 mon.0 10.10.10.11:6789/0 7895062 : cluster [WRN] overall HEALTH_WARN 1 backfillfull osd(s); 1 nearfull osd(s); 4 pool(s) backfillfull
2020-01-27 08:00:00.000180 mon.pve1 mon.0 10.10.10.11:6789/0 7897154 : cluster [WRN] overall HEALTH_WARN 1 backfillfull osd(s); 1 nearfull osd(s); 4 pool(s) backfillfull
2020-01-27 09:00:00.000129 mon.pve1 mon.0 10.10.10.11:6789/0 7899344 : cluster [WRN] overall HEALTH_WARN 1 backfillfull osd(s); 1 nearfull osd(s); 4 pool(s) backfillfull
2020-01-27 09:58:49.569805 mon.pve1 mon.0 10.10.10.11:6789/0 7901486 : cluster [WRN] Health check failed: 1 slow requests are blocked > 32 sec. Implicated osds 18 (REQUEST_SLOW)
2020-01-27 09:58:54.618459 mon.pve1 mon.0 10.10.10.11:6789/0 7901509 : cluster [WRN] Health check update: 2 slow requests are blocked > 32 sec. Implicated osds 10,18 (REQUEST_SLOW)
2020-01-27 09:59:11.953819 mon.pve1 mon.0 10.10.10.11:6789/0 7901526 : cluster [WRN] Health check update: 3 slow requests are blocked > 32 sec. Implicated osds 8,10,18 (REQUEST_SLOW)
2020-01-27 09:59:34.272930 mon.pve1 mon.0 10.10.10.11:6789/0 7901543 : cluster [WRN] Health check update: 4 slow requests are blocked > 32 sec. Implicated osds 8,10,18,21 (REQUEST_SLOW)
2020-01-27 10:00:00.000200 mon.pve1 mon.0 10.10.10.11:6789/0 7901568 : cluster [WRN] overall HEALTH_WARN 1 backfillfull osd(s); 1 nearfull osd(s); 4 pool(s) backfillfull; 4 slow requests are blocked > 32 sec. Implicated osds 8,10,18,21
2020-01-27 10:00:10.700322 mon.pve1 mon.0 10.10.10.11:6789/0 7901578 : cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec. Implicated osds 8,10,18,21 (REQUEST_SLOW)
2020-01-27 10:00:24.757803 mon.pve1 mon.0 10.10.10.11:6789/0 7901594 : cluster [WRN] Health check update: 7 slow requests are blocked > 32 sec. Implicated osds 8,10,18,21,23 (REQUEST_SLOW)
2020-01-27 10:00:44.928742 mon.pve1 mon.0 10.10.10.11:6789/0 7901605 : cluster [WRN] Health check update: 12 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:00:51.428743 mon.pve1 mon.0 10.10.10.11:6789/0 7901610 : cluster [WRN] Health check update: 14 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:00:58.856057 mon.pve1 mon.0 10.10.10.11:6789/0 7901613 : cluster [WRN] Health check update: 17 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:01:33.662488 mon.pve1 mon.0 10.10.10.11:6789/0 7901648 : cluster [WRN] Health check update: 18 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:01:51.480059 mon.pve1 mon.0 10.10.10.11:6789/0 7901663 : cluster [WRN] Health check update: 20 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:02:56.537558 mon.pve1 mon.0 10.10.10.11:6789/0 7901701 : cluster [WRN] Health check update: 22 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:03:20.659258 mon.pve1 mon.0 10.10.10.11:6789/0 7901721 : cluster [WRN] Health check update: 23 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:03:29.051215 mon.pve1 mon.0 10.10.10.11:6789/0 7901731 : cluster [WRN] Health check update: 24 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:03:50.807869 mon.pve1 mon.0 10.10.10.11:6789/0 7901752 : cluster [WRN] Health check update: 26 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:04:35.452388 mon.pve1 mon.0 10.10.10.11:6789/0 7901787 : cluster [WRN] Health check update: 27 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:04:51.427521 mon.pve1 mon.0 10.10.10.11:6789/0 7901798 : cluster [WRN] Health check update: 29 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:05:33.483019 mon.pve1 mon.0 10.10.10.11:6789/0 7901835 : cluster [WRN] Health check update: 30 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:05:39.786841 mon.pve1 mon.0 10.10.10.11:6789/0 7901841 : cluster [WRN] Health check update: 35 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:05:51.889322 mon.pve1 mon.0 10.10.10.11:6789/0 7901855 : cluster [WRN] Health check update: 37 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:06:50.338031 mon.pve1 mon.0 10.10.10.11:6789/0 7901897 : cluster [WRN] Health check update: 39 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:07:28.856721 mon.pve1 mon.0 10.10.10.11:6789/0 7901922 : cluster [WRN] Health check update: 40 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:07:50.854370 mon.pve1 mon.0 10.10.10.11:6789/0 7901938 : cluster [WRN] Health check update: 42 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:08:17.174128 mon.pve1 mon.0 10.10.10.11:6789/0 7901955 : cluster [WRN] Health check update: 43 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:08:39.751586 mon.pve1 mon.0 10.10.10.11:6789/0 7901972 : cluster [WRN] Health check update: 44 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:08:51.876339 mon.pve1 mon.0 10.10.10.11:6789/0 7901982 : cluster [WRN] Health check update: 46 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:09:28.717695 mon.pve1 mon.0 10.10.10.11:6789/0 7902012 : cluster [WRN] Health check update: 47 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:09:52.040808 mon.pve1 mon.0 10.10.10.11:6789/0 7902032 : cluster [WRN] Health check update: 49 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:10:24.348908 mon.pve1 mon.0 10.10.10.11:6789/0 7902058 : cluster [WRN] Health check update: 50 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:10:30.408543 mon.pve1 mon.0 10.10.10.11:6789/0 7902060 : cluster [WRN] Health check update: 51 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:10:51.358077 mon.pve1 mon.0 10.10.10.11:6789/0 7902077 : cluster [WRN] Health check update: 53 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:10:58.912309 mon.pve1 mon.0 10.10.10.11:6789/0 7902079 : cluster [WRN] Health check update: 60 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:11:04.877804 mon.pve1 mon.0 10.10.10.11:6789/0 7902088 : cluster [WRN] Health check update: 61 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:11:25.444557 mon.pve1 mon.0 10.10.10.11:6789/0 7902103 : cluster [WRN] Health check update: 63 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:11:34.880921 mon.pve1 mon.0 10.10.10.11:6789/0 7902110 : cluster [WRN] Health check update: 64 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:11:51.527392 mon.pve1 mon.0 10.10.10.11:6789/0 7902124 : cluster [WRN] Health check update: 66 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:11:59.883620 mon.pve1 mon.0 10.10.10.11:6789/0 7902135 : cluster [WRN] Health check update: 67 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:12:10.056527 mon.pve1 mon.0 10.10.10.11:6789/0 7902144 : cluster [WRN] Health check update: 68 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)

sg90 · Jan 28, 2020

Nowhere does it state in the errors that an OSD is full, you just have a large amount of PG peering.

Does this progress at all? Or does it show the same amount of PG's in the same state?

hassoon · Jan 28, 2020

I have managed to replace drive, and make osd after it was failing.
and osd3 is back
now here is what im getting
its been 2 hour like that
Reduced data availability: 298 pgs inactive, 298 pgs peering

pg 4.d3 is stuck peering for 14410.764777, current state remapped+peering, last acting [23,3,17]
pg 4.d8 is stuck peering for 14829.650075, current state remapped+peering, last acting [10,0]
pg 4.da is stuck peering for 15383.092685, current state peering, last acting [19,17,9]
pg 4.e0 is stuck peering for 14410.954324, current state peering, last acting [23,15,6]
pg 4.e6 is stuck peering for 15382.753628, current state peering, last acting [22,3,7]
pg 4.e8 is stuck peering for 14829.629297, current state peering, last acting [10,23,3]
pg 4.ea is stuck peering for 15382.755377, current state peering, last acting [22,13,7]
pg 4.ec is stuck peering for 14410.982316, current state remapped+peering, last acting [22,3,11]
pg 4.f2 is stuck peering for 14829.668551, current state remapped+peering, last acting [7,1]
pg 4.f3 is stuck peering for 14829.628602, current state peering, last acting [12,19,15]
pg 4.f5 is stuck peering for 15383.094561, current state peering, last acting [19,14,10]
pg 4.f6 is stuck peering for 15382.754040, current state peering, last acting [22,8,17]
pg 4.f7 is stuck peering for 14829.622334, current state peering, last acting [9,18,14]
pg 4.f8 is stuck peering for 15382.754727, current state peering, last acting [22,14,9]
pg 4.fc is stuck peering for 15382.768575, current state peering, last acting [20,4,12]
pg 4.ff is stuck peering for 15382.755662, current state peering, last acting [22,8,17]
pg 5.d3 is stuck inactive for 15381.997538, current state peering, last acting [19,12,13]
pg 5.d4 is stuck peering for 14829.665244, current state peering, last acting [9,4,21]
pg 5.d7 is stuck peering for 15383.092859, current state peering, last acting [19,9,1]
pg 5.db is stuck peering for 14829.644457, current state peering, last acting [9,18,11]
pg 5.de is stuck peering for 14410.954617, current state peering, last acting [23,6,14]
pg 5.e0 is stuck peering for 14829.669181, current state peering, last acting [9,0,22]
pg 5.e4 is stuck peering for 14410.985542, current state peering, last acting [20,6,11]
pg 5.e9 is stuck peering for 14829.668217, current state peering, last acting [7,19,5]
pg 5.ea is stuck peering for 15383.042571, current state peering, last acting [23,17,8]
pg 5.ec is stuck peering for 14829.662013, current state peering, last acting [6,5,22]
pg 5.ed is stuck peering for 14410.991642, current state peering, last acting [21,17,6]
pg 5.ef is stuck peering for 14829.664410, current state peering, last acting [7,2,23]
pg 5.f3 is stuck peering for 14829.628689, current state peering, last acting [12,19,11]
pg 5.f4 is stuck peering for 14829.664044, current state peering, last acting [7,2,23]
pg 5.f5 is stuck peering for 15383.040456, current state peering, last acting [23,15,12]
pg 5.ff is stuck peering for 14829.661330, current state peering, last acting [8,21,1]
pg 8.d2 is stuck peering for 14829.622633, current state peering, last acting [9,22,13]
pg 8.d3 is stuck peering for 14829.664954, current state remapped+peering, last acting [6,18,1]
pg 8.d7 is stuck peering for 14410.991133, current state peering, last acting [21,5,6]
pg 8.d9 is stuck peering for 15375.984101, current state remapped+peering, last acting [23,0]
pg 8.e1 is stuck peering for 14829.662031, current state peering, last acting [6,4,23]
pg 8.e4 is stuck peering for 15382.754499, current state peering, last acting [22,14,8]
pg 8.e5 is stuck peering for 15382.767053, current state peering, last acting [20,16,9]
pg 8.e6 is stuck peering for 14829.675082, current state peering, last acting [12,23,4]
pg 8.e8 is stuck peering for 14829.675050, current state peering, last acting [10,5,22]
pg 8.ea is stuck peering for 14829.628134, current state remapped+peering, last acting [8,13]
pg 8.eb is stuck peering for 14829.664656, current state peering, last acting [6,19,2]
pg 8.ed is stuck peering for 15378.095744, current state remapped+peering, last acting [21,3,4]
pg 8.ee is stuck peering for 14829.628179, current state remapped+peering, last acting [7,13]
pg 8.ef is stuck peering for 15383.094191, current state peering, last acting [19,16,10]
pg 8.f1 is stuck peering for 14410.981549, current state peering, last acting [22,6,16]
pg 8.f4 is stuck peering for 15382.759103, current state peering, last acting [21,10,17]
pg 8.f6 is stuck peering for 14829.635972, current state remapped+peering, last acting [10,1]
pg 8.f7 is stuck peering for 14829.670302, current state peering, last acting [10,0,23]
pg 8.f8 is stuck peering for 14829.633940, current state peering, last acting [10,16,20]

and

root@pve1:~# ceph -s
cluster:
id: a9926f78-4366-4be5-a77c-7db26a419e86
health: HEALTH_WARN
Reduced data availability: 298 pgs inactive, 298 pgs peering

services:
mon: 4 daemons, quorum pve1,pve2,pve3,pve4
mgr: pve1(active), standbys: pve2, pve4, pve3
osd: 24 osds: 24 up, 24 in; 48 remapped pgs

data:
pools: 4 pools, 832 pgs
objects: 419k objects, 1575 GB
usage: 4862 GB used, 2212 GB / 7075 GB avail
pgs: 35.817% pgs not active
534 active+clean
250 peering
48 remapped+peering

root@pve1:~#

sg90 · Jan 28, 2020

Can all your servers talk to all your other servers? No network issues or changes?

hassoon · Jan 28, 2020

sg90 said:
Can all your servers talk to all your other servers? No network issues or changes?

yes of course all up and running no issue, the only problem is the failed drive which I replaced already. what worries me is that data gone over its level coz of this failed drive but eventhough the usage was just 71% which I dont think its critical.
now its 69% after replacing the drive

sg90 · Jan 28, 2020

What is your pool setup?

hassoon · Jan 28, 2020

sg90 said:
What is your pool setup?

attached pool/osds/ceph

hassoon · Jan 28, 2020

sg90 said:
What is your pool setup?

My main problem is that vms stopped starting at all
here is the error thrown

TASK ERROR: start failed: command '/usr/bin/kvm -id 888 -name gocserver -chardev 'socket,id=qmp,path=/var/run/qemu-server/888.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/888.pid -daemonize -smbios 'type=1,uuid=f5dbdd1e-d9fa-4b3d-b98a-2870d47fd3d2' -smp '8,sockets=2,cores=4,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/888.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 16512 -vnc 0.0.0.0:100 -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -chardev 'socket,path=/var/run/qemu-server/888.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:6b8e42dcd33e' -drive 'file=rbd:wins/vm-888-disk-2:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/wins_vm.keyring,if=none,id=drive-virtio0,cache=writeback,format=raw,aio=threads,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap888i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=00:1C:C4:57:69:86,netdev=net0,bus=pci.0,addr=0x12,id=net0' -rtc 'base=localtime'' failed: got timeout

sg90 · Jan 28, 2020

Yeah they wont start whilst you have so many PGs down. You could try restarting all the OSD servers as to be honest I cant see anything sticking out for why they arent peering.

hassoon · Jan 28, 2020

I kept them 8 hours, nothing is changing, do I need to wait longer??
numbers are quite the same its npt decreasing

sg90 · Jan 28, 2020

Quite the same or exactly the same?

hassoon · Jan 28, 2020

sg90 said:
Yeah they wont start whilst you have so many PGs down. You could try restarting all the OSD servers as to be honest I cant see anything sticking out for why they arent peering.

regarding the restart I have done restart for all pves 4 of them one after the other one, but no big changes, is there any recommended way to restart servers?

hassoon · Jan 28, 2020

here is what it was 8 hours ago
root@pve1:~# ceph -s
cluster:
id: a9926f78-4366-4be5-a77c-7db26a419e86
health: HEALTH_WARN
Reduced data availability: 298 pgs inactive, 298 pgs peering

services:
mon: 4 daemons, quorum pve1,pve2,pve3,pve4
mgr: pve1(active), standbys: pve2, pve4, pve3
osd: 24 osds: 24 up, 24 in; 48 remapped pgs

data:
pools: 4 pools, 832 pgs
objects: 419k objects, 1575 GB
usage: 4862 GB used, 2212 GB / 7075 GB avail
pgs: 35.817% pgs not active
534 active+clean
250 peering
48 remapped+peering

root@pve1:~#

now after few restarts and some osd restart

root@pve2:~# ceph -s
cluster:
id: a9926f78-4366-4be5-a77c-7db26a419e86
health: HEALTH_WARN
Reduced data availability: 292 pgs inactive, 292 pgs peering

services:
mon: 4 daemons, quorum pve1,pve2,pve3,pve4
mgr: pve1(active), standbys: pve2, pve4, pve3
osd: 24 osds: 24 up, 24 in; 42 remapped pgs

data:
pools: 4 pools, 832 pgs
objects: 419k objects, 1575 GB
usage: 4864 GB used, 2210 GB / 7075 GB avail
pgs: 35.096% pgs not active
540 active+clean
250 peering
42 remapped+peering

keep in mind the number went down only after osd restarts etc. not after 8 hours.

the odd thing is that Im still seeing below, but status is 0 iops/ 0 read /0 write
I cant see activity at all.

Reduced data availability: 292 pgs inactive, 292 pgs peeringpg 4.d1 is stuck inactive for 5474.558184, current state peering, last acting [12,2,22]
pg 4.d3 is stuck peering for 1636.985160, current state remapped+peering, last acting [23,3,17]
pg 4.d8 is stuck peering for 30728.217833, current state peering, last acting [10,16,18]
pg 4.da is stuck peering for 5004.454692, current state peering, last acting [19,17,9]
pg 4.e0 is stuck peering for 1636.985740, current state peering, last acting [23,15,6]
pg 4.e6 is stuck peering for 4051.095440, current state remapped+peering, last acting [22,3,13]
pg 4.e8 is stuck peering for 5305.753685, current state remapped+peering, last acting [23,3,17]
pg 4.ea is stuck peering for 4051.097889, current state peering, last acting [22,13,7]
pg 4.ec is stuck peering for 1636.991486, current state remapped+peering, last acting [22,3,11]
pg 4.f2 is stuck peering for 4051.104535, current state remapped+peering, last acting [23,1]
pg 4.f3 is stuck peering for 30728.196360, current state peering, last acting [12,19,15]
pg 4.f5 is stuck peering for 5305.757496, current state peering, last acting [19,14,10]
pg 4.f6 is stuck peering for 1301.424070, current state peering, last acting [22,8,17]
pg 4.f7 is stuck peering for 30728.190092, current state peering, last acting [9,18,14]
pg 4.f8 is stuck peering for 5004.447666, current state peering, last acting [22,14,9]
pg 4.fc is stuck peering for 4224.253849, current state peering, last acting [20,4,12]
pg 4.ff is stuck peering for 1301.424894, current state peering, last acting [22,8,17]
pg 5.d3 is stuck peering for 4220.669064, current state peering, last acting [19,12,13]
pg 5.d4 is stuck peering for 30728.233002, current state peering, last acting [9,4,21]
pg 5.d7 is stuck peering for 5004.454947, current state peering, last acting [19,9,1]
pg 5.db is stuck peering for 30728.212215, current state peering, last acting [9,18,11]
pg 5.de is stuck peering for 1636.985484, current state peering, last acting [23,6,14]
pg 5.e0 is stuck peering for 5497.629245, current state peering, last acting [9,0,22]
pg 5.e4 is stuck peering for 1636.996902, current state peering, last acting [20,6,11]
pg 5.e9 is stuck peering for 30728.235975, current state peering, last acting [7,19,5]
pg 5.ea is stuck peering for 1301.437421, current state peering, last acting [23,17,8]
pg 5.ec is stuck peering for 5497.631794, current state peering, last acting [6,5,22]
pg 5.ed is stuck peering for 1637.001455, current state peering, last acting [21,17,6]
pg 5.ef is stuck peering for 30728.232167, current state peering, last acting [7,2,23]
pg 5.f3 is stuck peering for 30728.196446, current state peering, last acting [12,19,11]
pg 5.f4 is stuck peering for 30728.231802, current state peering, last acting [7,2,23]
pg 5.f5 is stuck peering for 4224.245381, current state peering, last acting [23,15,12]
pg 5.ff is stuck peering for 30728.229087, current state peering, last acting [8,21,1]
pg 8.d2 is stuck peering for 5497.629135, current state peering, last acting [9,22,13]
pg 8.d3 is stuck peering for 30728.232711, current state peering, last acting [6,2,18]
pg 8.d7 is stuck peering for 1637.000891, current state peering, last acting [21,5,6]
pg 8.de is stuck peering for 30728.201681, current state peering, last acting [10,19,14]
pg 8.e1 is stuck peering for 30728.229788, current state peering, last acting [6,4,23]
pg 8.e4 is stuck peering for 1301.424550, current state peering, last acting [22,14,8]
pg 8.e5 is stuck peering for 5004.454727, current state peering, last acting [20,16,9]
pg 8.e6 is stuck peering for 30728.242840, current state peering, last acting [12,23,4]
pg 8.e8 is stuck peering for 5497.577531, current state peering, last acting [10,5,22]
pg 8.ea is stuck peering for 1286.480822, current state remapped+peering, last acting [21,13]
pg 8.eb is stuck peering for 30728.232414, current state peering, last acting [6,19,2]
pg 8.ee is stuck peering for 4051.103116, current state remapped+peering, last acting [23,13]
pg 8.ef is stuck peering for 5305.758101, current state peering, last acting [19,16,10]
pg 8.f1 is stuck peering for 1636.990732, current state peering, last acting [22,6,16]
pg 8.f4 is stuck peering for 5305.592275, current state peering, last acting [21,10,17]
pg 8.f6 is stuck peering for 5305.752111, current state remapped+peering, last acting [23,11,1]
pg 8.f7 is stuck peering for 30728.238060, current state peering, last acting [10,0,23]
pg 8.f8 is stuck peering for 30728.201698, current state peering, last acting [10,16,20]

hassoon · Jan 28, 2020

quick note I have found two snapshots and I want to delete them to free diskspace, is there a clean and proper way to delete snapshot to free up space while pg is peering without affecting the re build up

Alwin · Jan 28, 2020

As @sg90 said, are you sure you don't have an network issue?

hassoon said:
quick note I have found two snapshots and I want to delete them to free diskspace, is there a clean and proper way to delete snapshot to free up space while pg is peering without affecting the re build up

The deletion should work while the PGs are peering.

EDIT: You can try to restart the OSDs in question. This may help for peering PGs, but the space issue will still remain.

hassoon · Jan 28, 2020

Alwin said:
As @sg90 said, are you sure you don't have an network issue?

The deletion should work while the PGs are peering.

EDIT: You can try to restart the OSDs in question. This may help for peering PGs, but the space issue will still remain.

Can you share info on how to properly delete snapshots as the deletoin via gui timed out.
I have done restart to all osds but didnt notice big difference.
My question remain here: I cant see any read nor write ops is this normal??? or I shall see read and writes on ceph being active while peering???

hassoon · Jan 28, 2020

Alwin said:
As @sg90 said, are you sure you don't have an network issue?

The deletion should work while the PGs are peering.

EDIT: You can try to restart the OSDs in question. This may help for peering PGs, but the space issue will still remain.

just to recap on the net issues
here is screenshot of the net settings
not sure how it used to be though.

adding all nodes net settings

hassoon · Jan 28, 2020

All local bonded interfaces are pingable I just did a ping test

Alwin · Jan 28, 2020

hassoon said:
All local bonded interfaces are pingable I just did a ping test

What type of bond is configured? And are all nodes running on the same switch? What are the MTU sizes of the bonds?

Can you please post a ceph osd df tree, the ceph.conf and the crush map (best get the later from the GUI)?

hassoon said:
My question remain here: I cant see any read nor write ops is this normal??? or I shall see read and writes on ceph being active while peering???

What do the logs tell? It seems that client traffic is stalled.

hassoon · Jan 28, 2020

any clue???
another odd thing is doing a std list of the pool froze nodes this is bit odd

ceph gone near full and cant start any vm now

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Attachments

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Active Member

Active Member

Proxmox Retired Staff

Active Member

Active Member

Attachments

Active Member

Proxmox Retired Staff

Active Member