proxmox 7.1-8
yesterday i executed a large delete operation on the ceph-fs pool (around 2 TB of data)
the operation ended withing few seconds successful (without any noticeable errors).
and then the following problem occurred:
7 out of 32 osds went to down and out.
trying to set them in and up did not work (in worked, but they didn't go up).
so i removed the osd and created them.
but one of the osd remove failed and i could resolve only with full server reboot and manual osd remove from DISK->LVM->More->Destroy via gui
7/7 are up and ceph finished balancing.
but i have the following problems.
cluster:
the nodes are marked gray, and no lxc\vm are up, because all of them are stored on ceph-fs.
the quorum is OK
ceph:
some commands i run on shell and its output:
any ideas what i can do ?
yesterday i executed a large delete operation on the ceph-fs pool (around 2 TB of data)
the operation ended withing few seconds successful (without any noticeable errors).
and then the following problem occurred:
7 out of 32 osds went to down and out.
trying to set them in and up did not work (in worked, but they didn't go up).
so i removed the osd and created them.
but one of the osd remove failed and i could resolve only with full server reboot and manual osd remove from DISK->LVM->More->Destroy via gui
7/7 are up and ceph finished balancing.
but i have the following problems.
cluster:
the nodes are marked gray, and no lxc\vm are up, because all of them are stored on ceph-fs.
the quorum is OK
ceph:
- 1 MDSs report slow metadata IOsmds.pve-srv3(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 8317 secs
- Reduced data availability: 15 pgs inactivepg 2.b is stuck inactive for 2h, current state unknown, last acting []
pg 2.d is stuck inactive for 2h, current state unknown, last acting []
pg 2.2b is stuck inactive for 2h, current state unknown, last acting []
pg 2.2d is stuck inactive for 2h, current state unknown, last acting []
pg 2.4b is stuck inactive for 2h, current state unknown, last acting []
pg 2.4d is stuck inactive for 2h, current state unknown, last acting []
pg 2.6b is stuck inactive for 2h, current state unknown, last acting []
pg 2.6d is stuck inactive for 2h, current state unknown, last acting []
pg 8.8d is stuck inactive for 2h, current state unknown, last acting []
pg 8.185 is stuck inactive for 2h, current state unknown, last acting []
pg 8.1e3 is stuck inactive for 2h, current state unknown, last acting []
pg 14.15 is stuck inactive for 2h, current state unknown, last acting []
pg 14.35 is stuck inactive for 2h, current state unknown, last acting []
pg 14.55 is stuck inactive for 2h, current state unknown, last acting []
pg 14.75 is stuck inactive for 2h, current state unknown, last acting []
- 296 slow ops, oldest one blocked for 7841 sec, daemons [osd.14,osd.23,osd.27,osd.30,osd.31,osd.7] have slow ops.
some commands i run on shell and its output:
Code:
ceph pg dump_stuck
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
8.185 unknown [] -1 [] -1
2.6d unknown [] -1 [] -1
14.35 unknown [] -1 [] -1
8.8d unknown [] -1 [] -1
14.75 unknown [] -1 [] -1
2.2d unknown [] -1 [] -1
2.6b unknown [] -1 [] -1
2.4d unknown [] -1 [] -1
8.1e3 unknown [] -1 [] -1
14.15 unknown [] -1 [] -1
14.55 unknown [] -1 [] -1
2.4b unknown [] -1 [] -1
2.d unknown [] -1 [] -1
2.2b unknown [] -1 [] -1
2.b unknown [] -1 [] -1
ok
Code:
ceph pg 2.b mark_unfound_lost revert
Error ENOENT: i don't have pgid 2.b
ceph pg 2.b mark_unfound_lost delete
Error ENOENT: i don't have pgid 2.b
ceph pg map 2.b
osdmap e136516 pg 2.b (2.b) -> up [27,7,31] acting [27,7,31]
Code:
ceph -s
cluster:
id: 8ebca482-f985-4e74-9ff8-35e03a1af15e
health: HEALTH_WARN
1 MDSs report slow metadata IOs
Reduced data availability: 15 pgs inactive
347 slow ops, oldest one blocked for 8271 sec, daemons [osd.14,osd.23,osd.27,osd.30,osd.31,osd.7] have slow ops.
services:
mon: 3 daemons, quorum pve-srv2,pve-srv3,pve-srv4 (age 42m)
mgr: pve-srv3(active, since 44m), standbys: pve-srv4, pve-srv2
mds: 2/2 daemons up, 1 standby
osd: 33 osds: 32 up (since 18m), 32 in (since 37m); 1 remapped pgs
data:
volumes: 2/2 healthy
pools: 6 pools, 1393 pgs
objects: 15.64M objects, 39 TiB
usage: 116 TiB used, 65 TiB / 182 TiB avail
pgs: 1.077% pgs unknown
1378 active+clean
15 unknown
io:
client: 8.0 KiB/s wr, 0 op/s rd, 1 op/s wr
any ideas what i can do ?
Last edited: