[SOLVED] Ceph data not available

Ax2020 · Jan 8, 2021

Hello,

we had a strange issue, we got a call asking for assistance on a proxmox cluster, apparently today the system restarted (apparently one node was already down for "problems" and the 2 other node seems to have gone down around the same time).
When they restarted the system they tried to start again the VM but after a few seconds the console timed out and even if the vm was started nothing seems to work, trying to navigate inside the ceph storage result in another timeout.

Here the result of the oh ceph -s:

Bash:

root@SV3:~# ceph -s

  cluster:

    id:     89fd82e2-031d-4309-bbf9-454dcc2a4021

    health: HEALTH_WARN

            Reduced data availability: 345 pgs inactive

Degraded data redundancy: 5956939/13902540 objects degraded (42.848%), 1003 pgs degraded, 1003 pgs undersized

            1023 pgs not deep-scrubbed in time

            1023 pgs not scrubbed in time



  services:

    mon: 3 daemons, quorum SV1,SV2,SV3 (age 90m)

    mgr: SV2(active, since 90m), standbys: SV3

    osd: 18 osds: 18 up (since 88m), 18 in (since 115m); 1003 remapped pgs



  data:

    pools:   1 pools, 1024 pgs

    objects: 4.63M objects, 18 TiB

    usage:   47 TiB used, 51 TiB / 98 TiB avail

    pgs:     33.691% pgs not active

5956939/13902540 objects degraded (42.848%)

             656 active+undersized+degraded+remapped+backfill_wait

             344 undersized+degraded+remapped+backfill_wait+peered

             21  active+clean

             2   active+undersized+degraded+remapped+backfilling

             1   undersized+degraded+remapped+backfilling+peered



  io:

    recovery: 43 MiB/s, 10 objects/s

Bash:

root@SV3:~# ceph osd tree

ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF

-1 98.24387 root default

-3 32.74796 host SV1

0 hdd 5.45799 osd.0 up 1.00000 1.00000

1 hdd 5.45799 osd.1 up 1.00000 1.00000

2 hdd 5.45799 osd.2 up 1.00000 1.00000

3 hdd 5.45799 osd.3 up 1.00000 1.00000

4 hdd 5.45799 osd.4 up 1.00000 1.00000

15 hdd 5.45799 osd.15 up 1.00000 1.00000

-5 32.74796 host SV2

5 hdd 5.45799 osd.5 up 1.00000 1.00000

6 hdd 5.45799 osd.6 up 1.00000 1.00000

7 hdd 5.45799 osd.7 up 1.00000 1.00000

8 hdd 5.45799 osd.8 up 1.00000 1.00000

9 hdd 5.45799 osd.9 up 1.00000 1.00000

16 hdd 5.45799 osd.16 up 1.00000 1.00000

-7 32.74796 host SV3

10 hdd 5.45799 osd.10 up 1.00000 1.00000

11 hdd 5.45799 osd.11 up 1.00000 1.00000

12 hdd 5.45799 osd.12 up 1.00000 1.00000

13 hdd 5.45799 osd.13 up 1.00000 1.00000

14 hdd 5.45799 osd.14 up 1.00000 1.00000

17 hdd 5.45799 osd.17 up 1.00000 1.00000

not the best situation, but they absolutely need to access the file inside of the storage:

Bash:

root@SV3:~# rbd list Storage

vm-101-disk-0

vm-101-disk-1

vm-101-disk-2

vm-102-disk-0

vm-102-disk-1

vm-103-disk-0

vm-103-disk-1

vm-103-disk-2

but it is not possible to do a backup or a move. Any ideas?

Thank you

Christian St. · Jan 9, 2021

Now all nodes are up again?
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#failures-osd-peering
They become incactive if the placement group has not been active for too long (i.e., it hasn’t been able to service read/write requests).
Maybe because 2 nodes went down?

Try to restart Ceph's MONs services one at a time and see if the state changes.

Alwin · Jan 11, 2021

Ax2020 said:
we had a strange issue, we got a call asking for assistance on a proxmox cluster, apparently today the system restarted (apparently one node was already down for "problems" and the 2 other node seems to have gone down around the same time).

Disable HA, remove VM/CT services and restart the pve-ha-lrm service to disarm the watchdog. Otherwise the load on the cluster might cause other resets. And as a first guess, corosync shares the network with ceph?

And also @Christian St. advice.

Ax2020 · Jan 11, 2021

Hi,

yes, corosync and ceph shares the network... Thank you for all your suggestion, i was able to retrieve the file they needed, i've removed the VM services and waited.
After that i've shutted down the cluster deleted everything, reinstalled proxmox and i've separated corosync and ceph network...
An interesting weekend but now ceph is working fine and the cluster is stable, probably the issue reported on node 1 were caused by the congestion of the network, corosync was not happy about it.
Sorry if i wasn't able to update the ticket on time but i've just finished right now.

Christian St. · Jan 11, 2021

Good that you could solve the problem. Great that they have now seperated links.
Could you mark the thread as solves?

Ax2020 · Jan 12, 2021

Yes, after reinstalling everything now it works. Thank you all

Search

Search

[SOLVED] Ceph data not available

Ax2020

Member

Christian St.

Active Member

Alwin

Proxmox Retired Staff

Ax2020

Member

Christian St.

Active Member

Ax2020

Member