Hi,
here I describe 1 of the 2 major issues I'm currently facing in my 8 node ceph cluster (2x MDS, 6x ODS).
The issue is that I cannot start any virtual machine KVM or container LXC; the boot process just hangs after a few seconds.
All these KVMs and LXCs have in common that their virtual disks reside in the same pool: hdd
This pool hdd is relatively small compared to the largest pool: hdb_backup
root@ld3955:~# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY
UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
backup 0 B 0 0 0
0 0 0 0 0 B 0 0 B 0
B 0 B
hdb_backup 589 TiB 51262212 0 153786636
0 0 124895 12266095 4.3 TiB 247132863 463 TiB 0
B 0 B
hdd 3.2 TiB 281884 6568 845652
0 0 1658 275277357 16 TiB 208213922 10 TiB 0
B 0 B
pve_cephfs_data 955 GiB 91832 0 275496
0 0 3038 2103 1021 MiB 102170 318 GiB 0
B 0 B
pve_cephfs_metadata 486 MiB 62 0 186
0 0 7 860 1.4 GiB 12393 166 MiB 0
B 0 B
total_objects 51635990
total_used 597 TiB
total_avail 522 TiB
total_space 1.1 PiB
This is the current health status of the ceph cluster:
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_ERR
1 MDSs report slow metadata IOs
78 nearfull osd(s)
1 pool(s) nearfull
Reduced data availability: 2 pgs inactive, 2 pgs peering
Degraded data redundancy: 304136/153251136 objects degraded (0.198%), 57 pgs degraded, 57 pgs undersized
Degraded data redundancy (low space): 265 pgs backfill_toofull
3 pools have too many placement groups
75 slow requests are blocked > 32 sec
78 stuck requests are blocked > 4096 sec
services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 94m)
mgr: ld5505(active, since 3d), standbys: ld5506, ld5507
mds: pve_cephfs:1 {0=ld3976=up:active} 1 up:standby
osd: 368 osds: 368 up, 367 in; 303 remapped pgs
data:
pools: 5 pools, 8868 pgs
objects: 51.08M objects, 195 TiB
usage: 590 TiB used, 563 TiB / 1.1 PiB avail
pgs: 0.023% pgs not active
304136/153251136 objects degraded (0.198%)
1673548/153251136 objects misplaced (1.092%)
8563 active+clean
195 active+remapped+backfill_toofull
57 active+undersized+degraded+remapped+backfill_toofull
36 active+remapped+backfill_wait
13 active+remapped+backfill_wait+backfill_toofull
2 active+remapped+backfilling
2 peering
io:
client: 264 KiB/s wr, 0 op/s rd, 0 op/s wr
recovery: 18 MiB/s, 4 objects/s
I believe the cluster is busy with rebalancing pool hdb_backup.
I set the balance mode upmap recently after the 589TB data was written.
root@ld3955:~# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}
In order to resolve the issue with pool hdd I started some investigation.
First step was to install drivers for the NIC provided Mellanox.
Then I configured some kernel parameters recommended
<https://community.mellanox.com/s/article/linux-sysctl-tuning> by Mellanox.
However this didn't fix the issue.
In my opinion I must get rid of all "slow requests are blocked".
When I check the output of ceph health detail any OSD listed under REQUEST_SLOW points to an OSD that belongs to pool hdd.
This means none of the disks belonging to pool hdb_backup is showing a comparable behavior.
Then I checked the running processes on the different OSD nodes; I use tool "glances" here.
Here I can see single processes that are running for hours and consuming much CPU, e.g.
63.9 0.5 4.95G 3.81G 14894 ceph 6h14:22 58 0 S 16M 293K /usr/bin/ceph-osd -f --cluster ceph --id 8 --setuser ceph --setgroup ceph
Similar processes are running on 4 OSD nodes.
All processes have in common that the relevant OSD belongs to pool hdd.
What can / should I do now?
Kill the long running processes?
Stop the relevant OSDs?
Please advise?
THX
Thomas
here I describe 1 of the 2 major issues I'm currently facing in my 8 node ceph cluster (2x MDS, 6x ODS).
The issue is that I cannot start any virtual machine KVM or container LXC; the boot process just hangs after a few seconds.
All these KVMs and LXCs have in common that their virtual disks reside in the same pool: hdd
This pool hdd is relatively small compared to the largest pool: hdb_backup
root@ld3955:~# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY
UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
backup 0 B 0 0 0
0 0 0 0 0 B 0 0 B 0
B 0 B
hdb_backup 589 TiB 51262212 0 153786636
0 0 124895 12266095 4.3 TiB 247132863 463 TiB 0
B 0 B
hdd 3.2 TiB 281884 6568 845652
0 0 1658 275277357 16 TiB 208213922 10 TiB 0
B 0 B
pve_cephfs_data 955 GiB 91832 0 275496
0 0 3038 2103 1021 MiB 102170 318 GiB 0
B 0 B
pve_cephfs_metadata 486 MiB 62 0 186
0 0 7 860 1.4 GiB 12393 166 MiB 0
B 0 B
total_objects 51635990
total_used 597 TiB
total_avail 522 TiB
total_space 1.1 PiB
This is the current health status of the ceph cluster:
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_ERR
1 MDSs report slow metadata IOs
78 nearfull osd(s)
1 pool(s) nearfull
Reduced data availability: 2 pgs inactive, 2 pgs peering
Degraded data redundancy: 304136/153251136 objects degraded (0.198%), 57 pgs degraded, 57 pgs undersized
Degraded data redundancy (low space): 265 pgs backfill_toofull
3 pools have too many placement groups
75 slow requests are blocked > 32 sec
78 stuck requests are blocked > 4096 sec
services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 94m)
mgr: ld5505(active, since 3d), standbys: ld5506, ld5507
mds: pve_cephfs:1 {0=ld3976=up:active} 1 up:standby
osd: 368 osds: 368 up, 367 in; 303 remapped pgs
data:
pools: 5 pools, 8868 pgs
objects: 51.08M objects, 195 TiB
usage: 590 TiB used, 563 TiB / 1.1 PiB avail
pgs: 0.023% pgs not active
304136/153251136 objects degraded (0.198%)
1673548/153251136 objects misplaced (1.092%)
8563 active+clean
195 active+remapped+backfill_toofull
57 active+undersized+degraded+remapped+backfill_toofull
36 active+remapped+backfill_wait
13 active+remapped+backfill_wait+backfill_toofull
2 active+remapped+backfilling
2 peering
io:
client: 264 KiB/s wr, 0 op/s rd, 0 op/s wr
recovery: 18 MiB/s, 4 objects/s
I believe the cluster is busy with rebalancing pool hdb_backup.
I set the balance mode upmap recently after the 589TB data was written.
root@ld3955:~# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}
In order to resolve the issue with pool hdd I started some investigation.
First step was to install drivers for the NIC provided Mellanox.
Then I configured some kernel parameters recommended
<https://community.mellanox.com/s/article/linux-sysctl-tuning> by Mellanox.
However this didn't fix the issue.
In my opinion I must get rid of all "slow requests are blocked".
When I check the output of ceph health detail any OSD listed under REQUEST_SLOW points to an OSD that belongs to pool hdd.
This means none of the disks belonging to pool hdb_backup is showing a comparable behavior.
Then I checked the running processes on the different OSD nodes; I use tool "glances" here.
Here I can see single processes that are running for hours and consuming much CPU, e.g.
63.9 0.5 4.95G 3.81G 14894 ceph 6h14:22 58 0 S 16M 293K /usr/bin/ceph-osd -f --cluster ceph --id 8 --setuser ceph --setgroup ceph
Similar processes are running on 4 OSD nodes.
All processes have in common that the relevant OSD belongs to pool hdd.
What can / should I do now?
Kill the long running processes?
Stop the relevant OSDs?
Please advise?
THX
Thomas