[SOLVED] Ceph data not available

Ax2020

Member
Jan 8, 2021
3
0
6
37
Hello,

we had a strange issue, we got a call asking for assistance on a proxmox cluster, apparently today the system restarted (apparently one node was already down for "problems" and the 2 other node seems to have gone down around the same time).
When they restarted the system they tried to start again the VM but after a few seconds the console timed out and even if the vm was started nothing seems to work, trying to navigate inside the ceph storage result in another timeout.

Here the result of the oh ceph -s:
Bash:
root@SV3:~# ceph -s

  cluster:

    id:     89fd82e2-031d-4309-bbf9-454dcc2a4021

    health: HEALTH_WARN

            Reduced data availability: 345 pgs inactive

Degraded data redundancy: 5956939/13902540 objects degraded (42.848%), 1003 pgs degraded, 1003 pgs undersized

            1023 pgs not deep-scrubbed in time

            1023 pgs not scrubbed in time



  services:

    mon: 3 daemons, quorum SV1,SV2,SV3 (age 90m)

    mgr: SV2(active, since 90m), standbys: SV3

    osd: 18 osds: 18 up (since 88m), 18 in (since 115m); 1003 remapped pgs



  data:

    pools:   1 pools, 1024 pgs

    objects: 4.63M objects, 18 TiB

    usage:   47 TiB used, 51 TiB / 98 TiB avail

    pgs:     33.691% pgs not active

5956939/13902540 objects degraded (42.848%)

             656 active+undersized+degraded+remapped+backfill_wait

             344 undersized+degraded+remapped+backfill_wait+peered

             21  active+clean

             2   active+undersized+degraded+remapped+backfilling

             1   undersized+degraded+remapped+backfilling+peered



  io:

    recovery: 43 MiB/s, 10 objects/s



Bash:
root@SV3:~# ceph osd tree

ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF

-1 98.24387 root default

-3 32.74796 host SV1

0 hdd 5.45799 osd.0 up 1.00000 1.00000

1 hdd 5.45799 osd.1 up 1.00000 1.00000

2 hdd 5.45799 osd.2 up 1.00000 1.00000

3 hdd 5.45799 osd.3 up 1.00000 1.00000

4 hdd 5.45799 osd.4 up 1.00000 1.00000

15 hdd 5.45799 osd.15 up 1.00000 1.00000

-5 32.74796 host SV2

5 hdd 5.45799 osd.5 up 1.00000 1.00000

6 hdd 5.45799 osd.6 up 1.00000 1.00000

7 hdd 5.45799 osd.7 up 1.00000 1.00000

8 hdd 5.45799 osd.8 up 1.00000 1.00000

9 hdd 5.45799 osd.9 up 1.00000 1.00000

16 hdd 5.45799 osd.16 up 1.00000 1.00000

-7 32.74796 host SV3

10 hdd 5.45799 osd.10 up 1.00000 1.00000

11 hdd 5.45799 osd.11 up 1.00000 1.00000

12 hdd 5.45799 osd.12 up 1.00000 1.00000

13 hdd 5.45799 osd.13 up 1.00000 1.00000

14 hdd 5.45799 osd.14 up 1.00000 1.00000

17 hdd 5.45799 osd.17 up 1.00000 1.00000

not the best situation, but they absolutely need to access the file inside of the storage:

Bash:
root@SV3:~# rbd list Storage

vm-101-disk-0

vm-101-disk-1

vm-101-disk-2

vm-102-disk-0

vm-102-disk-1

vm-103-disk-0

vm-103-disk-1

vm-103-disk-2

but it is not possible to do a backup or a move. Any ideas?

Thank you
 
Last edited:
we had a strange issue, we got a call asking for assistance on a proxmox cluster, apparently today the system restarted (apparently one node was already down for "problems" and the 2 other node seems to have gone down around the same time).
Disable HA, remove VM/CT services and restart the pve-ha-lrm service to disarm the watchdog. Otherwise the load on the cluster might cause other resets. And as a first guess, corosync shares the network with ceph?

And also @Christian St. advice. :)
 
Hi,

yes, corosync and ceph shares the network... Thank you for all your suggestion, i was able to retrieve the file they needed, i've removed the VM services and waited.
After that i've shutted down the cluster deleted everything, reinstalled proxmox and i've separated corosync and ceph network...
An interesting weekend but now ceph is working fine and the cluster is stable, probably the issue reported on node 1 were caused by the congestion of the network, corosync was not happy about it.
Sorry if i wasn't able to update the ticket on time but i've just finished right now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!