Server crash, all vm's shut off, ceph messages.

300cpilot · Feb 2, 2023

Only error I found is the scrubs running. It keeps repeating the same for one node. Thanks for any thoughts.

Error:
2023-02-01T18:09:19.419960-0700 mgr.Node-C (mgr.34310) 1363602 : cluster [DBG] pgmap v1363335: 513 pgs: 513 active+clean; 2.6 TiB data, 5.1 TiB used, 8.4 TiB / 14 TiB avail; 17 MiB/s rd, 19 MiB/s wr, 203 op/s

Node A & B report overall health ok. Node C is going through the simular entries above 1000's of times.

--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 11 TiB 6.8 TiB 4.1 TiB 4.1 TiB 37.47
ssd 2.6 TiB 1.6 TiB 1.0 TiB 1.0 TiB 38.94
TOTAL 14 TiB 8.4 TiB 5.1 TiB 5.1 TiB 37.76

--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 70 MiB 19 209 MiB 0 2.5 TiB
Main 2 512 2.5 TiB 677.10k 5.1 TiB 40.57 3.7 TiB
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
20 hdd 0.90970 1.00000 932 GiB 348 GiB 347 GiB 230 KiB 903 MiB 583 GiB 37.39 0.99 69 up
21 hdd 0.90970 1.00000 932 GiB 346 GiB 345 GiB 90 KiB 870 MiB 586 GiB 37.14 0.98 68 up
22 hdd 0.90970 1.00000 932 GiB 353 GiB 352 GiB 222 KiB 859 MiB 579 GiB 37.89 1.00 69 up
23 hdd 0.90970 1.00000 932 GiB 327 GiB 326 GiB 226 KiB 868 MiB 605 GiB 35.08 0.93 64 up
18 ssd 0.43660 1.00000 447 GiB 163 GiB 163 GiB 123 KiB 375 MiB 284 GiB 36.45 0.97 32 up
19 ssd 0.43660 1.00000 447 GiB 178 GiB 177 GiB 87 KiB 444 MiB 269 GiB 39.74 1.05 35 up
14 hdd 0.90970 1.00000 932 GiB 342 GiB 341 GiB 224 KiB 1.6 GiB 589 GiB 36.76 0.97 67 up
15 hdd 0.90970 1.00000 932 GiB 342 GiB 341 GiB 224 KiB 962 MiB 590 GiB 36.69 0.97 67 up
16 hdd 0.90970 1.00000 932 GiB 342 GiB 341 GiB 199 KiB 1.2 GiB 590 GiB 36.70 0.97 67 up
17 hdd 0.90970 1.00000 932 GiB 360 GiB 359 GiB 92 KiB 1.4 GiB 572 GiB 38.64 1.02 70 up
12 ssd 0.43660 1.00000 447 GiB 178 GiB 177 GiB 132 KiB 1.2 GiB 269 GiB 39.84 1.06 35 up
13 ssd 0.43660 1.00000 447 GiB 174 GiB 173 GiB 66 KiB 1.3 GiB 273 GiB 38.96 1.03 35 up
8 hdd 0.90970 1.00000 932 GiB 354 GiB 353 GiB 164 KiB 1.1 GiB 578 GiB 37.96 1.01 69 up
9 hdd 0.90970 1.00000 932 GiB 358 GiB 356 GiB 106 KiB 1.7 GiB 574 GiB 38.40 1.02 70 up
10 hdd 0.90970 1.00000 932 GiB 359 GiB 358 GiB 104 KiB 1.4 GiB 572 GiB 38.57 1.02 70 up
11 hdd 0.90970 1.00000 932 GiB 358 GiB 357 GiB 101 KiB 1.1 GiB 574 GiB 38.42 1.02 71 up
6 ssd 0.43660 1.00000 447 GiB 178 GiB 177 GiB 124 KiB 1.3 GiB 269 GiB 39.91 1.06 35 up
7 ssd 0.43660 1.00000 447 GiB 173 GiB 172 GiB 120 KiB 1.3 GiB 274 GiB 38.76 1.03 34 up
TOTAL 14 TiB 5.1 TiB 5.1 TiB 2.6 MiB 20 GiB 8.4 TiB 37.76
MIN/MAX VAR: 0.93/1.06 STDDEV: 1.30
cluster:
id: 1d2525d6-e7ac-4e17-8053-70a9dc21043d
health: HEALTH_OK

services:
mon: 3 daemons, quorum Node-A,Node-B,Node-C (age 50m)
mgr: Node-C(active, since 4w), standbys: Node-B, Node-A
osd: 18 osds: 18 up (since 55m), 18 in (since 4w)

data:
pools: 2 pools, 513 pgs
objects: 677.12k objects, 2.6 TiB
usage: 5.1 TiB used, 8.4 TiB / 14 TiB avail
pgs: 512 active+clean
1 active+clean+scrubbing+deep

io:
client: 24 KiB/s rd, 65 KiB/s wr, 5 op/s rd, 10 op/s wr

HEALTH_OK
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

gurubert · Feb 2, 2023

What you see is not an error but a debug message. If they show up too often (more than once a minute) then try to restart mgr.Node-C.

300cpilot · Feb 4, 2023

I restarted Node C and then the error stated Node B, so I restarted Node B. Now it states Node A, after restarting A it now states Node C.
Originally there were no VM's or CT's on Node C but now I have them spread out across all the nodes. Just restarting the the manager under Ceph, Monitor did the same thing.

300cpilot · Feb 7, 2023

Thought I had gotten this fixed, but now nothing shows in ceph. I can reboot vm's and CT's, but not all of them.

Some help would be appreciated, thank you gurubert.

300cpilot · Feb 7, 2023

ceph health detail never returns with any results. ceph simply does not respond.

300cpilot · Feb 7, 2023

also some ct's are displaying a lock. when I try to unlock them, it states that the ct doesn't exist.

300cpilot · Feb 7, 2023

for those wondering I was able to edit the config files and remove the locks and the deleted snapshot. I will have to figure out how to get rid of orphaned snapshots later. For now after rebooting all nodes, everything is back up.

Server crash, all vm's shut off, ceph messages.

300cpilot

Well-Known Member

gurubert

Distinguished Member

300cpilot

Well-Known Member

300cpilot

Well-Known Member

300cpilot

Well-Known Member

300cpilot

Well-Known Member

300cpilot

Well-Known Member

We value your privacy