I have been searching far and wide for this. But I can't seem to find a solution.
In one of our testing clusters we're experimenting with Ceph. But so far the journey has been bumpy at best.
The actual trigger for this post is the fact that one of our VMs locked up and I can't seem to get it to reboot.
qm just times out and I can't seem to get any useful logging from the cluster or ceph.
To outline our setup:
We're running 3 nodes with a Ryzen 7950x and 128GB of RAM.
All nodes have both a 1Gbit/s and 10Gbit/s NIC of which the latter is used for Ceph and migration traffic. The 10Gbit network has an MTU of 9000.
All storage is NVMe (PCIe4.0). Each node contains 3x 2TB NVMe drives (WD).
ceph health detail
ceph -s
ceph df
ceph osd df
ceph osd tree
I'm fairly new to ceph and to be honest I'm not entirely sure how to debug this. I was hoping someone could guide me along
In one of our testing clusters we're experimenting with Ceph. But so far the journey has been bumpy at best.
The actual trigger for this post is the fact that one of our VMs locked up and I can't seem to get it to reboot.
qm just times out and I can't seem to get any useful logging from the cluster or ceph.
To outline our setup:
We're running 3 nodes with a Ryzen 7950x and 128GB of RAM.
All nodes have both a 1Gbit/s and 10Gbit/s NIC of which the latter is used for Ceph and migration traffic. The 10Gbit network has an MTU of 9000.
All storage is NVMe (PCIe4.0). Each node contains 3x 2TB NVMe drives (WD).
ceph health detail
Code:
HEALTH_WARN Reduced data availability: 1 pg inactive; 256 slow ops, oldest one blocked for 607418 sec, osd.8 has slow ops
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
pg 1.0 is stuck inactive for 7d, current state unknown, last acting []
[WRN] SLOW_OPS: 256 slow ops, oldest one blocked for 607418 sec, osd.8 has slow ops
ceph -s
Code:
root@pve001:~# ceph -s
cluster:
id: 45c6e495-0fd6-48fe-8df9-90018537a237
health: HEALTH_WARN
Reduced data availability: 1 pg inactive
256 slow ops, oldest one blocked for 607439 sec, osd.8 has slow ops
services:
mon: 3 daemons, quorum pve001,pve002,pve003 (age 7d)
mgr: pve002(active, since 7d), standbys: pve001, pve003
osd: 9 osds: 9 up (since 7d), 9 in (since 5w)
data:
pools: 2 pools, 129 pgs
objects: 523.53k objects, 2.0 TiB
usage: 4.0 TiB used, 12 TiB / 16 TiB avail
pgs: 0.775% pgs unknown
128 active+clean
1 unknown
ceph df
Code:
root@pve001:~# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
nvme 16 TiB 12 TiB 4.0 TiB 4.0 TiB 24.21
TOTAL 16 TiB 12 TiB 4.0 TiB 4.0 TiB 24.21
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 0 B 0 0 B 0 5.3 TiB
ceph 4 128 2.0 TiB 523.53k 3.9 TiB 26.96 5.3 TiB
ceph osd df
Code:
root@pve001:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
6 nvme 1.81940 1.00000 1.8 TiB 397 GiB 395 GiB 21 KiB 2.2 GiB 1.4 TiB 21.32 0.88 25 up
7 nvme 1.81940 1.00000 1.8 TiB 522 GiB 521 GiB 21 KiB 1.7 GiB 1.3 TiB 28.04 1.16 33 up
8 nvme 1.81940 1.00000 1.8 TiB 430 GiB 428 GiB 28 KiB 1.6 GiB 1.4 TiB 23.07 0.95 27 up
3 nvme 1.81940 1.00000 1.8 TiB 475 GiB 473 GiB 22 KiB 1.4 GiB 1.4 TiB 25.48 1.05 30 up
4 nvme 1.81940 1.00000 1.8 TiB 444 GiB 442 GiB 27 KiB 1.7 GiB 1.4 TiB 23.84 0.98 28 up
5 nvme 1.81940 1.00000 1.8 TiB 461 GiB 459 GiB 22 KiB 1.7 GiB 1.4 TiB 24.72 1.02 29 up
0 nvme 1.81940 1.00000 1.8 TiB 413 GiB 412 GiB 18 KiB 1.5 GiB 1.4 TiB 22.19 0.92 26 up
1 nvme 1.81940 1.00000 1.8 TiB 553 GiB 550 GiB 27 KiB 2.1 GiB 1.3 TiB 29.66 1.23 35 up
2 nvme 1.81940 1.00000 1.8 TiB 365 GiB 363 GiB 23 KiB 1.7 GiB 1.5 TiB 19.58 0.81 23 up
TOTAL 16 TiB 4.0 TiB 3.9 TiB 214 KiB 16 GiB 12 TiB 24.21
MIN/MAX VAR: 0.81/1.23 STDDEV: 3.01
ceph osd tree
Code:
root@pve001:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
6 nvme 1.81940 1.00000 1.8 TiB 397 GiB 395 GiB 21 KiB 2.2 GiB 1.4 TiB 21.32 0.88 25 up
7 nvme 1.81940 1.00000 1.8 TiB 522 GiB 521 GiB 21 KiB 1.7 GiB 1.3 TiB 28.04 1.16 33 up
8 nvme 1.81940 1.00000 1.8 TiB 430 GiB 428 GiB 28 KiB 1.6 GiB 1.4 TiB 23.07 0.95 27 up
3 nvme 1.81940 1.00000 1.8 TiB 475 GiB 473 GiB 22 KiB 1.4 GiB 1.4 TiB 25.48 1.05 30 up
4 nvme 1.81940 1.00000 1.8 TiB 444 GiB 442 GiB 27 KiB 1.7 GiB 1.4 TiB 23.84 0.98 28 up
5 nvme 1.81940 1.00000 1.8 TiB 461 GiB 459 GiB 22 KiB 1.7 GiB 1.4 TiB 24.72 1.02 29 up
0 nvme 1.81940 1.00000 1.8 TiB 413 GiB 412 GiB 18 KiB 1.5 GiB 1.4 TiB 22.19 0.92 26 up
1 nvme 1.81940 1.00000 1.8 TiB 553 GiB 550 GiB 27 KiB 2.1 GiB 1.3 TiB 29.66 1.23 35 up
2 nvme 1.81940 1.00000 1.8 TiB 365 GiB 363 GiB 23 KiB 1.7 GiB 1.5 TiB 19.58 0.81 23 up
TOTAL 16 TiB 4.0 TiB 3.9 TiB 214 KiB 16 GiB 12 TiB 24.21
MIN/MAX VAR: 0.81/1.23 STDDEV: 3.01
I'm fairly new to ceph and to be honest I'm not entirely sure how to debug this. I was hoping someone could guide me along