ceph gone near full and cant start any vm now

hassoon

Active Member
Jan 27, 2020
72
1
28
64
Hi I have 4 nodes of pves and they are quite identical in configs.
each of them have has 6x drives that.
but I have noticed that one drive failed and as the vms kept writing data it became near full
here is the output of ceph -s
root@pve2:~# ceph -s
cluster:
id: a9926f78-4366-4be5-a77c-7db26a419e86
health: HEALTH_ERR
Reduced data availability: 434 pgs inactive, 434 pgs peering
920 stuck requests are blocked > 4096 sec. Implicated osds 6,7,8,9,10,12,20,21,22,23

services:
mon: 4 daemons, quorum pve1,pve2,pve3,pve4
mgr: pve1(active), standbys: pve2, pve4, pve3
osd: 24 osds: 23 up, 23 in; 15 remapped pgs

data:
pools: 4 pools, 832 pgs
objects: 419k objects, 1575 GB
usage: 4792 GB used, 2003 GB / 6795 GB avail
pgs: 52.163% pgs not active
419 peering
398 active+clean
15 remapped+peering

as you can see we do have lots of available space but I dont know what went wrong. is it the failed drive or what exactly?
is there a way to get things re sorted??

here is a extract of latest logs
2020-01-27 07:00:00.000267 mon.pve1 mon.0 10.10.10.11:6789/0 7895062 : cluster [WRN] overall HEALTH_WARN 1 backfillfull osd(s); 1 nearfull osd(s); 4 pool(s) backfillfull
2020-01-27 08:00:00.000180 mon.pve1 mon.0 10.10.10.11:6789/0 7897154 : cluster [WRN] overall HEALTH_WARN 1 backfillfull osd(s); 1 nearfull osd(s); 4 pool(s) backfillfull
2020-01-27 09:00:00.000129 mon.pve1 mon.0 10.10.10.11:6789/0 7899344 : cluster [WRN] overall HEALTH_WARN 1 backfillfull osd(s); 1 nearfull osd(s); 4 pool(s) backfillfull
2020-01-27 09:58:49.569805 mon.pve1 mon.0 10.10.10.11:6789/0 7901486 : cluster [WRN] Health check failed: 1 slow requests are blocked > 32 sec. Implicated osds 18 (REQUEST_SLOW)
2020-01-27 09:58:54.618459 mon.pve1 mon.0 10.10.10.11:6789/0 7901509 : cluster [WRN] Health check update: 2 slow requests are blocked > 32 sec. Implicated osds 10,18 (REQUEST_SLOW)
2020-01-27 09:59:11.953819 mon.pve1 mon.0 10.10.10.11:6789/0 7901526 : cluster [WRN] Health check update: 3 slow requests are blocked > 32 sec. Implicated osds 8,10,18 (REQUEST_SLOW)
2020-01-27 09:59:34.272930 mon.pve1 mon.0 10.10.10.11:6789/0 7901543 : cluster [WRN] Health check update: 4 slow requests are blocked > 32 sec. Implicated osds 8,10,18,21 (REQUEST_SLOW)
2020-01-27 10:00:00.000200 mon.pve1 mon.0 10.10.10.11:6789/0 7901568 : cluster [WRN] overall HEALTH_WARN 1 backfillfull osd(s); 1 nearfull osd(s); 4 pool(s) backfillfull; 4 slow requests are blocked > 32 sec. Implicated osds 8,10,18,21
2020-01-27 10:00:10.700322 mon.pve1 mon.0 10.10.10.11:6789/0 7901578 : cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec. Implicated osds 8,10,18,21 (REQUEST_SLOW)
2020-01-27 10:00:24.757803 mon.pve1 mon.0 10.10.10.11:6789/0 7901594 : cluster [WRN] Health check update: 7 slow requests are blocked > 32 sec. Implicated osds 8,10,18,21,23 (REQUEST_SLOW)
2020-01-27 10:00:44.928742 mon.pve1 mon.0 10.10.10.11:6789/0 7901605 : cluster [WRN] Health check update: 12 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:00:51.428743 mon.pve1 mon.0 10.10.10.11:6789/0 7901610 : cluster [WRN] Health check update: 14 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:00:58.856057 mon.pve1 mon.0 10.10.10.11:6789/0 7901613 : cluster [WRN] Health check update: 17 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:01:33.662488 mon.pve1 mon.0 10.10.10.11:6789/0 7901648 : cluster [WRN] Health check update: 18 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:01:51.480059 mon.pve1 mon.0 10.10.10.11:6789/0 7901663 : cluster [WRN] Health check update: 20 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:02:56.537558 mon.pve1 mon.0 10.10.10.11:6789/0 7901701 : cluster [WRN] Health check update: 22 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:03:20.659258 mon.pve1 mon.0 10.10.10.11:6789/0 7901721 : cluster [WRN] Health check update: 23 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:03:29.051215 mon.pve1 mon.0 10.10.10.11:6789/0 7901731 : cluster [WRN] Health check update: 24 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:03:50.807869 mon.pve1 mon.0 10.10.10.11:6789/0 7901752 : cluster [WRN] Health check update: 26 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:04:35.452388 mon.pve1 mon.0 10.10.10.11:6789/0 7901787 : cluster [WRN] Health check update: 27 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:04:51.427521 mon.pve1 mon.0 10.10.10.11:6789/0 7901798 : cluster [WRN] Health check update: 29 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:05:33.483019 mon.pve1 mon.0 10.10.10.11:6789/0 7901835 : cluster [WRN] Health check update: 30 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:05:39.786841 mon.pve1 mon.0 10.10.10.11:6789/0 7901841 : cluster [WRN] Health check update: 35 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:05:51.889322 mon.pve1 mon.0 10.10.10.11:6789/0 7901855 : cluster [WRN] Health check update: 37 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:06:50.338031 mon.pve1 mon.0 10.10.10.11:6789/0 7901897 : cluster [WRN] Health check update: 39 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:07:28.856721 mon.pve1 mon.0 10.10.10.11:6789/0 7901922 : cluster [WRN] Health check update: 40 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:07:50.854370 mon.pve1 mon.0 10.10.10.11:6789/0 7901938 : cluster [WRN] Health check update: 42 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:08:17.174128 mon.pve1 mon.0 10.10.10.11:6789/0 7901955 : cluster [WRN] Health check update: 43 slow requests are blocked > 32 sec. Implicated osds 8,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:08:39.751586 mon.pve1 mon.0 10.10.10.11:6789/0 7901972 : cluster [WRN] Health check update: 44 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:08:51.876339 mon.pve1 mon.0 10.10.10.11:6789/0 7901982 : cluster [WRN] Health check update: 46 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:09:28.717695 mon.pve1 mon.0 10.10.10.11:6789/0 7902012 : cluster [WRN] Health check update: 47 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:09:52.040808 mon.pve1 mon.0 10.10.10.11:6789/0 7902032 : cluster [WRN] Health check update: 49 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:10:24.348908 mon.pve1 mon.0 10.10.10.11:6789/0 7902058 : cluster [WRN] Health check update: 50 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:10:30.408543 mon.pve1 mon.0 10.10.10.11:6789/0 7902060 : cluster [WRN] Health check update: 51 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:10:51.358077 mon.pve1 mon.0 10.10.10.11:6789/0 7902077 : cluster [WRN] Health check update: 53 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:10:58.912309 mon.pve1 mon.0 10.10.10.11:6789/0 7902079 : cluster [WRN] Health check update: 60 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:11:04.877804 mon.pve1 mon.0 10.10.10.11:6789/0 7902088 : cluster [WRN] Health check update: 61 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:11:25.444557 mon.pve1 mon.0 10.10.10.11:6789/0 7902103 : cluster [WRN] Health check update: 63 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:11:34.880921 mon.pve1 mon.0 10.10.10.11:6789/0 7902110 : cluster [WRN] Health check update: 64 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:11:51.527392 mon.pve1 mon.0 10.10.10.11:6789/0 7902124 : cluster [WRN] Health check update: 66 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:11:59.883620 mon.pve1 mon.0 10.10.10.11:6789/0 7902135 : cluster [WRN] Health check update: 67 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
2020-01-27 10:12:10.056527 mon.pve1 mon.0 10.10.10.11:6789/0 7902144 : cluster [WRN] Health check update: 68 slow requests are blocked > 32 sec. Implicated osds 8,9,10,18,20,21,23 (REQUEST_SLOW)
 
Nowhere does it state in the errors that an OSD is full, you just have a large amount of PG peering.

Does this progress at all? Or does it show the same amount of PG's in the same state?
 
I have managed to replace drive, and make osd after it was failing.
and osd3 is back
now here is what im getting
its been 2 hour like that
Reduced data availability: 298 pgs inactive, 298 pgs peering

pg 4.d3 is stuck peering for 14410.764777, current state remapped+peering, last acting [23,3,17]
pg 4.d8 is stuck peering for 14829.650075, current state remapped+peering, last acting [10,0]
pg 4.da is stuck peering for 15383.092685, current state peering, last acting [19,17,9]
pg 4.e0 is stuck peering for 14410.954324, current state peering, last acting [23,15,6]
pg 4.e6 is stuck peering for 15382.753628, current state peering, last acting [22,3,7]
pg 4.e8 is stuck peering for 14829.629297, current state peering, last acting [10,23,3]
pg 4.ea is stuck peering for 15382.755377, current state peering, last acting [22,13,7]
pg 4.ec is stuck peering for 14410.982316, current state remapped+peering, last acting [22,3,11]
pg 4.f2 is stuck peering for 14829.668551, current state remapped+peering, last acting [7,1]
pg 4.f3 is stuck peering for 14829.628602, current state peering, last acting [12,19,15]
pg 4.f5 is stuck peering for 15383.094561, current state peering, last acting [19,14,10]
pg 4.f6 is stuck peering for 15382.754040, current state peering, last acting [22,8,17]
pg 4.f7 is stuck peering for 14829.622334, current state peering, last acting [9,18,14]
pg 4.f8 is stuck peering for 15382.754727, current state peering, last acting [22,14,9]
pg 4.fc is stuck peering for 15382.768575, current state peering, last acting [20,4,12]
pg 4.ff is stuck peering for 15382.755662, current state peering, last acting [22,8,17]
pg 5.d3 is stuck inactive for 15381.997538, current state peering, last acting [19,12,13]
pg 5.d4 is stuck peering for 14829.665244, current state peering, last acting [9,4,21]
pg 5.d7 is stuck peering for 15383.092859, current state peering, last acting [19,9,1]
pg 5.db is stuck peering for 14829.644457, current state peering, last acting [9,18,11]
pg 5.de is stuck peering for 14410.954617, current state peering, last acting [23,6,14]
pg 5.e0 is stuck peering for 14829.669181, current state peering, last acting [9,0,22]
pg 5.e4 is stuck peering for 14410.985542, current state peering, last acting [20,6,11]
pg 5.e9 is stuck peering for 14829.668217, current state peering, last acting [7,19,5]
pg 5.ea is stuck peering for 15383.042571, current state peering, last acting [23,17,8]
pg 5.ec is stuck peering for 14829.662013, current state peering, last acting [6,5,22]
pg 5.ed is stuck peering for 14410.991642, current state peering, last acting [21,17,6]
pg 5.ef is stuck peering for 14829.664410, current state peering, last acting [7,2,23]
pg 5.f3 is stuck peering for 14829.628689, current state peering, last acting [12,19,11]
pg 5.f4 is stuck peering for 14829.664044, current state peering, last acting [7,2,23]
pg 5.f5 is stuck peering for 15383.040456, current state peering, last acting [23,15,12]
pg 5.ff is stuck peering for 14829.661330, current state peering, last acting [8,21,1]
pg 8.d2 is stuck peering for 14829.622633, current state peering, last acting [9,22,13]
pg 8.d3 is stuck peering for 14829.664954, current state remapped+peering, last acting [6,18,1]
pg 8.d7 is stuck peering for 14410.991133, current state peering, last acting [21,5,6]
pg 8.d9 is stuck peering for 15375.984101, current state remapped+peering, last acting [23,0]
pg 8.e1 is stuck peering for 14829.662031, current state peering, last acting [6,4,23]
pg 8.e4 is stuck peering for 15382.754499, current state peering, last acting [22,14,8]
pg 8.e5 is stuck peering for 15382.767053, current state peering, last acting [20,16,9]
pg 8.e6 is stuck peering for 14829.675082, current state peering, last acting [12,23,4]
pg 8.e8 is stuck peering for 14829.675050, current state peering, last acting [10,5,22]
pg 8.ea is stuck peering for 14829.628134, current state remapped+peering, last acting [8,13]
pg 8.eb is stuck peering for 14829.664656, current state peering, last acting [6,19,2]
pg 8.ed is stuck peering for 15378.095744, current state remapped+peering, last acting [21,3,4]
pg 8.ee is stuck peering for 14829.628179, current state remapped+peering, last acting [7,13]
pg 8.ef is stuck peering for 15383.094191, current state peering, last acting [19,16,10]
pg 8.f1 is stuck peering for 14410.981549, current state peering, last acting [22,6,16]
pg 8.f4 is stuck peering for 15382.759103, current state peering, last acting [21,10,17]
pg 8.f6 is stuck peering for 14829.635972, current state remapped+peering, last acting [10,1]
pg 8.f7 is stuck peering for 14829.670302, current state peering, last acting [10,0,23]
pg 8.f8 is stuck peering for 14829.633940, current state peering, last acting [10,16,20]

and


root@pve1:~# ceph -s
cluster:
id: a9926f78-4366-4be5-a77c-7db26a419e86
health: HEALTH_WARN
Reduced data availability: 298 pgs inactive, 298 pgs peering

services:
mon: 4 daemons, quorum pve1,pve2,pve3,pve4
mgr: pve1(active), standbys: pve2, pve4, pve3
osd: 24 osds: 24 up, 24 in; 48 remapped pgs

data:
pools: 4 pools, 832 pgs
objects: 419k objects, 1575 GB
usage: 4862 GB used, 2212 GB / 7075 GB avail
pgs: 35.817% pgs not active
534 active+clean
250 peering
48 remapped+peering

root@pve1:~#
 
Can all your servers talk to all your other servers? No network issues or changes?
 
Can all your servers talk to all your other servers? No network issues or changes?
yes of course all up and running no issue, the only problem is the failed drive which I replaced already. what worries me is that data gone over its level coz of this failed drive but eventhough the usage was just 71% which I dont think its critical.
now its 69% after replacing the drive
 

Attachments

  • Screenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    Screenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    15.4 KB · Views: 18
  • osdsScreenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    osdsScreenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    87.6 KB · Views: 16
  • cephScreenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    cephScreenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    32.8 KB · Views: 16
What is your pool setup?
My main problem is that vms stopped starting at all
here is the error thrown

TASK ERROR: start failed: command '/usr/bin/kvm -id 888 -name gocserver -chardev 'socket,id=qmp,path=/var/run/qemu-server/888.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/888.pid -daemonize -smbios 'type=1,uuid=f5dbdd1e-d9fa-4b3d-b98a-2870d47fd3d2' -smp '8,sockets=2,cores=4,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/888.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 16512 -vnc 0.0.0.0:100 -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -chardev 'socket,path=/var/run/qemu-server/888.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:6b8e42dcd33e' -drive 'file=rbd:wins/vm-888-disk-2:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/wins_vm.keyring,if=none,id=drive-virtio0,cache=writeback,format=raw,aio=threads,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap888i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=00:1C:C4:57:69:86,netdev=net0,bus=pci.0,addr=0x12,id=net0' -rtc 'base=localtime'' failed: got timeout
 
Yeah they wont start whilst you have so many PGs down. You could try restarting all the OSD servers as to be honest I cant see anything sticking out for why they arent peering.
 
I kept them 8 hours, nothing is changing, do I need to wait longer??
numbers are quite the same its npt decreasing
 
Yeah they wont start whilst you have so many PGs down. You could try restarting all the OSD servers as to be honest I cant see anything sticking out for why they arent peering.
regarding the restart I have done restart for all pves 4 of them one after the other one, but no big changes, is there any recommended way to restart servers?
 
here is what it was 8 hours ago
root@pve1:~# ceph -s
cluster:
id: a9926f78-4366-4be5-a77c-7db26a419e86
health: HEALTH_WARN
Reduced data availability: 298 pgs inactive, 298 pgs peering

services:
mon: 4 daemons, quorum pve1,pve2,pve3,pve4
mgr: pve1(active), standbys: pve2, pve4, pve3
osd: 24 osds: 24 up, 24 in; 48 remapped pgs

data:
pools: 4 pools, 832 pgs
objects: 419k objects, 1575 GB
usage: 4862 GB used, 2212 GB / 7075 GB avail
pgs: 35.817% pgs not active
534 active+clean
250 peering
48 remapped+peering

root@pve1:~#


now after few restarts and some osd restart



root@pve2:~# ceph -s
cluster:
id: a9926f78-4366-4be5-a77c-7db26a419e86
health: HEALTH_WARN
Reduced data availability: 292 pgs inactive, 292 pgs peering

services:
mon: 4 daemons, quorum pve1,pve2,pve3,pve4
mgr: pve1(active), standbys: pve2, pve4, pve3
osd: 24 osds: 24 up, 24 in; 42 remapped pgs

data:
pools: 4 pools, 832 pgs
objects: 419k objects, 1575 GB
usage: 4864 GB used, 2210 GB / 7075 GB avail
pgs: 35.096% pgs not active
540 active+clean
250 peering
42 remapped+peering



keep in mind the number went down only after osd restarts etc. not after 8 hours.

the odd thing is that Im still seeing below, but status is 0 iops/ 0 read /0 write
I cant see activity at all.


Reduced data availability: 292 pgs inactive, 292 pgs peeringpg 4.d1 is stuck inactive for 5474.558184, current state peering, last acting [12,2,22]
pg 4.d3 is stuck peering for 1636.985160, current state remapped+peering, last acting [23,3,17]
pg 4.d8 is stuck peering for 30728.217833, current state peering, last acting [10,16,18]
pg 4.da is stuck peering for 5004.454692, current state peering, last acting [19,17,9]
pg 4.e0 is stuck peering for 1636.985740, current state peering, last acting [23,15,6]
pg 4.e6 is stuck peering for 4051.095440, current state remapped+peering, last acting [22,3,13]
pg 4.e8 is stuck peering for 5305.753685, current state remapped+peering, last acting [23,3,17]
pg 4.ea is stuck peering for 4051.097889, current state peering, last acting [22,13,7]
pg 4.ec is stuck peering for 1636.991486, current state remapped+peering, last acting [22,3,11]
pg 4.f2 is stuck peering for 4051.104535, current state remapped+peering, last acting [23,1]
pg 4.f3 is stuck peering for 30728.196360, current state peering, last acting [12,19,15]
pg 4.f5 is stuck peering for 5305.757496, current state peering, last acting [19,14,10]
pg 4.f6 is stuck peering for 1301.424070, current state peering, last acting [22,8,17]
pg 4.f7 is stuck peering for 30728.190092, current state peering, last acting [9,18,14]
pg 4.f8 is stuck peering for 5004.447666, current state peering, last acting [22,14,9]
pg 4.fc is stuck peering for 4224.253849, current state peering, last acting [20,4,12]
pg 4.ff is stuck peering for 1301.424894, current state peering, last acting [22,8,17]
pg 5.d3 is stuck peering for 4220.669064, current state peering, last acting [19,12,13]
pg 5.d4 is stuck peering for 30728.233002, current state peering, last acting [9,4,21]
pg 5.d7 is stuck peering for 5004.454947, current state peering, last acting [19,9,1]
pg 5.db is stuck peering for 30728.212215, current state peering, last acting [9,18,11]
pg 5.de is stuck peering for 1636.985484, current state peering, last acting [23,6,14]
pg 5.e0 is stuck peering for 5497.629245, current state peering, last acting [9,0,22]
pg 5.e4 is stuck peering for 1636.996902, current state peering, last acting [20,6,11]
pg 5.e9 is stuck peering for 30728.235975, current state peering, last acting [7,19,5]
pg 5.ea is stuck peering for 1301.437421, current state peering, last acting [23,17,8]
pg 5.ec is stuck peering for 5497.631794, current state peering, last acting [6,5,22]
pg 5.ed is stuck peering for 1637.001455, current state peering, last acting [21,17,6]
pg 5.ef is stuck peering for 30728.232167, current state peering, last acting [7,2,23]
pg 5.f3 is stuck peering for 30728.196446, current state peering, last acting [12,19,11]
pg 5.f4 is stuck peering for 30728.231802, current state peering, last acting [7,2,23]
pg 5.f5 is stuck peering for 4224.245381, current state peering, last acting [23,15,12]
pg 5.ff is stuck peering for 30728.229087, current state peering, last acting [8,21,1]
pg 8.d2 is stuck peering for 5497.629135, current state peering, last acting [9,22,13]
pg 8.d3 is stuck peering for 30728.232711, current state peering, last acting [6,2,18]
pg 8.d7 is stuck peering for 1637.000891, current state peering, last acting [21,5,6]
pg 8.de is stuck peering for 30728.201681, current state peering, last acting [10,19,14]
pg 8.e1 is stuck peering for 30728.229788, current state peering, last acting [6,4,23]
pg 8.e4 is stuck peering for 1301.424550, current state peering, last acting [22,14,8]
pg 8.e5 is stuck peering for 5004.454727, current state peering, last acting [20,16,9]
pg 8.e6 is stuck peering for 30728.242840, current state peering, last acting [12,23,4]
pg 8.e8 is stuck peering for 5497.577531, current state peering, last acting [10,5,22]
pg 8.ea is stuck peering for 1286.480822, current state remapped+peering, last acting [21,13]
pg 8.eb is stuck peering for 30728.232414, current state peering, last acting [6,19,2]
pg 8.ee is stuck peering for 4051.103116, current state remapped+peering, last acting [23,13]
pg 8.ef is stuck peering for 5305.758101, current state peering, last acting [19,16,10]
pg 8.f1 is stuck peering for 1636.990732, current state peering, last acting [22,6,16]
pg 8.f4 is stuck peering for 5305.592275, current state peering, last acting [21,10,17]
pg 8.f6 is stuck peering for 5305.752111, current state remapped+peering, last acting [23,11,1]
pg 8.f7 is stuck peering for 30728.238060, current state peering, last acting [10,0,23]
pg 8.f8 is stuck peering for 30728.201698, current state peering, last acting [10,16,20]
 
Last edited:
quick note I have found two snapshots and I want to delete them to free diskspace, is there a clean and proper way to delete snapshot to free up space while pg is peering without affecting the re build up
 
As @sg90 said, are you sure you don't have an network issue?

quick note I have found two snapshots and I want to delete them to free diskspace, is there a clean and proper way to delete snapshot to free up space while pg is peering without affecting the re build up
The deletion should work while the PGs are peering.

EDIT: You can try to restart the OSDs in question. This may help for peering PGs, but the space issue will still remain.
 
As @sg90 said, are you sure you don't have an network issue?


The deletion should work while the PGs are peering.

EDIT: You can try to restart the OSDs in question. This may help for peering PGs, but the space issue will still remain.
Can you share info on how to properly delete snapshots as the deletoin via gui timed out.
I have done restart to all osds but didnt notice big difference.
My question remain here: I cant see any read nor write ops is this normal??? or I shall see read and writes on ceph being active while peering???
 
As @sg90 said, are you sure you don't have an network issue?


The deletion should work while the PGs are peering.

EDIT: You can try to restart the OSDs in question. This may help for peering PGs, but the space issue will still remain.
just to recap on the net issues
here is screenshot of the net settings
not sure how it used to be though.

adding all nodes net settings
 

Attachments

  • netScreenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    netScreenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    23.4 KB · Views: 4
  • net2Screenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    net2Screenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    23.7 KB · Views: 4
  • net3Screenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    net3Screenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    22.6 KB · Views: 3
  • net4Screenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    net4Screenshot_2020-01-28 pve1 - Proxmox Virtual Environment.png
    33.5 KB · Views: 4
Last edited:
All local bonded interfaces are pingable I just did a ping test
What type of bond is configured? And are all nodes running on the same switch? What are the MTU sizes of the bonds?

Can you please post a ceph osd df tree, the ceph.conf and the crush map (best get the later from the GUI)?

My question remain here: I cant see any read nor write ops is this normal??? or I shall see read and writes on ceph being active while peering???
What do the logs tell? It seems that client traffic is stalled.
 
any clue???
another odd thing is doing a std list of the pool froze nodes this is bit odd
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!