Lost data in ceph

miro-zamiro

New Member
Aug 18, 2025
10
0
1
Hi!

I've used ceph in my proxmox cluster successfully for some time, but now I'm having some problems.
All of a sudden my nextcloud server couldn't mount cephfs, after tinkering a bit I've managed to mount it (I had to reauthorize or something). But all data is gone! There used to be a directory called nextcloud in that cephfs and it's gone! Also when I go proxmox dashboard and go to nextcloud pool storage I get this error: mount error: Job failed. See "journalctl -xe" for details. (500).After running systemctl status I see this:
Code:
root@4-dell:~# systemctl status mnt-pve-nextcloud.mount
× mnt-pve-nextcloud.mount - /mnt/pve/nextcloud
     Loaded: loaded (/run/systemd/system/mnt-pve-nextcloud.mount; static)
     Active: failed (Result: exit-code) since Mon 2025-08-18 15:44:52 CEST; 2s ago
      Where: /mnt/pve/nextcloud
       What: 10.10.10.4,10.20.10.1,10.20.10.2:/
        CPU: 45ms

Aug 18 15:44:52 4-dell systemd[1]: Mounting mnt-pve-nextcloud.mount - /mnt/pve/nextcloud...
Aug 18 15:44:52 4-dell mount[55978]: mount error: no mds (Metadata Server) is up. The cluster might be laggy, or you may not be authorized
Aug 18 15:44:52 4-dell systemd[1]: mnt-pve-nextcloud.mount: Mount process exited, code=exited, status=32/n/a
Aug 18 15:44:52 4-dell systemd[1]: mnt-pve-nextcloud.mount: Failed with result 'exit-code'.
Aug 18 15:44:52 4-dell systemd[1]: Failed to mount mnt-pve-nextcloud.mount - /mnt/pve/nextcloud.
root@4-dell:~#
When I go to ceph/CephFS/Metadata servers I can see that one with nextcloud has replay status.
Can anyone help me out with this problem?
Best,
Miro
 
Code:
root@4-dell:~# ceph status
  cluster:
    id:     8bc79f4e-ce8c-4036-9384-9d9b6212d60e
    health: HEALTH_WARN
            Reduced data availability: 50 pgs inactive
            Degraded data redundancy: 142142/1128696 objects degraded (12.593%), 50 pgs degraded, 50 pgs undersized
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum 1-asus,2-hp,4-dell (age 99m)
    mgr: 1-asus(active, since 100m), standbys: 2-hp, 4-dell
    mds: 2/2 daemons up, 1 standby
    osd: 6 osds: 6 up (since 50m), 6 in (since 51m); 268 remapped pgs
 
  data:
    volumes: 2/2 healthy
    pools:   5 pools, 456 pgs
    objects: 564.35k objects, 784 GiB
    usage:   1.3 TiB used, 2.4 TiB / 3.7 TiB avail
    pgs:     10.965% pgs not active
             142142/1128696 objects degraded (12.593%)
             702337/1128696 objects misplaced (62.226%)
             188 active+clean
             118 active+remapped+backfill_wait
             100 active+clean+remapped
             47  undersized+degraded+remapped+backfill_wait+peered
             3   undersized+degraded+remapped+backfilling+peered
 
  io:
    recovery: 36 MiB/s, 16 objects/s
 
root@4-dell:~#
 
To help effectively, we need a concise summary of your current state:


  1. Cluster Health & Status
    • Output from ceph -s (are there degraded, undersized, or inactive PGs?).
    • Any warnings such as “inactive” or “incomplete” placement groups?
  2. Cluster Topology
    • Number of OSDs up/in vs. down/out.
    • Monitor (MON) quorum state.
    • Pool configuration (e.g., size, min_size).
  3. Error Messaging
    • Errors when listing or mapping RBDs (rbd ls)?
    • Behavior of MON, MGR, and OSD services?
  4. Previous Interventions
    • Have you attempted recovery already (e.g., ceph-volume lvm activate --all, rebuilding MON store, replacing disks, restoring configs)?
 
Sorry, I'll post as much information as I can get.
6 osds, all up and in
How to check monitor quorum state?
5 pools:
proxmox-pools.png
I have no rbd
How to check behavior of MON, MGR and OSD services?
Running ceph-volume lvm activate --allhelped in mounting nextcloud pool in proxmox, but still the data is lost (mainly nextcloud directory)
 
Whats the output of ceph osd df tree and ceph health detail?
Also ceph pg dump_stuck would be useful.
 
Here you are:

Code:
root@4-dell:~# ceph osd df tree
ID   CLASS   WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP    META      AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME     
 -1          3.66045         -  3.7 TiB  1.4 TiB  1.4 TiB  96 MiB   5.6 GiB  2.3 TiB  38.02  1.00    -          root default   
 -5          0.68230         -  699 GiB  195 GiB  194 GiB  26 KiB   731 MiB  504 GiB  27.88  0.73    -              host 1-asus
  1  cephfs  0.68230   1.00000  699 GiB  195 GiB  194 GiB  26 KiB   731 MiB  504 GiB  27.88  0.73  172      up          osd.1 
 -3          1.84119         -  1.8 TiB  575 GiB  574 GiB  24 MiB   1.6 GiB  1.3 TiB  30.51  0.80    -              host 2-hp 
  2  cephfs  0.90970   1.00000  932 GiB  176 GiB  175 GiB  24 KiB   601 MiB  756 GiB  18.84  0.50  168      up          osd.2 
  0      nc  0.93149   0.40015  954 GiB  400 GiB  399 GiB  24 MiB  1011 MiB  554 GiB  41.91  1.10  154      up          osd.0 
-13                0         -      0 B      0 B      0 B     0 B       0 B      0 B      0     0    -              host 3-dell
-10          1.13696         -  1.1 TiB  655 GiB  652 GiB  72 MiB   3.3 GiB  509 GiB  56.25  1.48    -              host 4-dell
  3      nc  0.45479   0.03357  466 GiB  263 GiB  262 GiB  21 MiB   786 MiB  203 GiB  56.41  1.48  115      up          osd.3 
  4      nc  0.45479   0.00778  466 GiB  324 GiB  324 GiB  22 MiB   799 MiB  141 GiB  69.64  1.83  133      up          osd.4 
  5      nc  0.22739   1.00000  233 GiB   68 GiB   66 GiB  30 MiB   1.7 GiB  165 GiB  29.14  0.77  138      up          osd.5 
                         TOTAL  3.7 TiB  1.4 TiB  1.4 TiB  96 MiB   5.6 GiB  2.3 TiB  38.02                                   
MIN/MAX VAR: 0.50/1.83  STDDEV: 12.92
root@4-dell:~#
 
Code:
root@4-dell:~# ceph health detail
HEALTH_WARN 1 OSD(s) experiencing slow operations in BlueStore; Reduced data availability: 32 pgs inactive; Degraded data redundancy: 94640/1128696 objects degraded (8.385%), 32 pgs degraded, 32 pgs undersized; 1 daemons have recently crashed
[WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow operations in BlueStore
     osd.1 observed slow operation indications in BlueStore
[WRN] PG_AVAILABILITY: Reduced data availability: 32 pgs inactive
    pg 3.0 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.2 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [3]
    pg 3.5 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.8 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.e is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.16 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.17 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.1e is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.1f is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.28 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.2e is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.32 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.36 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.3f is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.40 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.49 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [3]
    pg 3.4e is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.50 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.52 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.56 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.5a is stuck inactive for 29h, current state undersized+degraded+remapped+backfilling+peered, last acting [1]
    pg 3.60 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.61 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.65 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.68 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.6c is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.70 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [3]
    pg 3.76 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.77 is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.7a is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [1]
    pg 3.7c is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.7e is stuck inactive for 29h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
[WRN] PG_DEGRADED: Degraded data redundancy: 94640/1128696 objects degraded (8.385%), 32 pgs degraded, 32 pgs undersized
    pg 3.0 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.2 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [3]
    pg 3.5 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.8 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.e is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.16 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.17 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.1e is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.1f is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.28 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.2e is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.32 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.36 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.3f is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.40 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.49 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [3]
    pg 3.4e is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.50 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.52 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.56 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.5a is stuck undersized for 2h, current state undersized+degraded+remapped+backfilling+peered, last acting [1]
    pg 3.60 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.61 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.65 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.68 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.6c is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.70 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [3]
    pg 3.76 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
    pg 3.77 is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.7a is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [1]
    pg 3.7c is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [4]
    pg 3.7e is stuck undersized for 2h, current state undersized+degraded+remapped+backfill_wait+peered, last acting [0]
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    osd.2 crashed on host 2-hp at 2025-08-16T22:05:00.815251Z
root@4-dell:~#
 
Code:
root@4-dell:~# ceph pg dump_stuck
PG_STAT  STATE                                              UP     UP_PRIMARY  ACTING  ACTING_PRIMARY
3.7e     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
3.7f                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.7c     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
4.7b                         active+remapped+backfill_wait    [5]           5   [0,3]               0
3.7d                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.7a     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [1]               1
4.7d                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.7b                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.78                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.79                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.76     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
3.77     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
3.74                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.75                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.72                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.73                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.70     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [3]               3
3.71                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.6f                         active+remapped+backfill_wait    [5]           5   [4,0]               4
4.68                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.6c     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
4.6b                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.6d                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.31                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.33                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.32     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
3.35                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.34                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.37                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.36     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
3.39                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.38                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.3b                         active+remapped+backfill_wait    [5]           5   [3,4]               3
4.3d                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.3d                         active+remapped+backfill_wait    [5]           5   [4,0]               4
4.3b                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.3f     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
3.3e                         active+remapped+backfill_wait    [5]           5   [0,3]               0
3.41                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.40     undersized+degraded+remapped+backfill_wait+peered  [0,5]           0     [0]               0
3.43                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.42                         active+remapped+backfill_wait    [5]           5   [3,0]               3
4.42                         active+remapped+backfill_wait    [5]           5   [0,3]               0
3.44                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.47                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.46                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.49     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [3]               3
4.4e                         active+remapped+backfill_wait  [5,0]           5   [0,4]               0
3.4b                         active+remapped+backfill_wait    [5]           5   [3,4]               3
4.4c                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.4a                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.4d                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.5a       undersized+degraded+remapped+backfilling+peered    [5]           5     [1]               1
4.5d                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.5f                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.5c                         active+remapped+backfill_wait    [5]           5   [4,0]               4
4.5b                         active+remapped+backfill_wait    [5]           5   [0,3]               0
3.48                         active+remapped+backfill_wait    [5]           5   [4,0]               4
4.82                         active+remapped+backfill_wait    [5]           5   [0,3]               0
3.4                          active+remapped+backfill_wait    [5]           5   [3,4]               3
4.48                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.4f                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.56     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
3.53                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
4.9d                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.19                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.5d                         active+remapped+backfill_wait    [5]           5   [4,0]               4
4.9b                         active+remapped+backfill_wait    [5]           5   [0,3]               0
3.23                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.51                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.5b                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.58                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.59                         active+remapped+backfill_wait    [5]           5   [3,0]               3
4.8b                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.13                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.57                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.54                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.55                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
4.88                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.e      undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
3.52     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
3.50     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
4.8c                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.a                          active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.4e     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
4.8e                         active+remapped+backfill_wait  [5,0]           5   [0,4]               0
3.8      undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
4.4b                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.4c                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.2e     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
4.28                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.2f                         active+remapped+backfill_wait    [5]           5   [4,0]               4
4.2b                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.2c                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.2d                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.2a                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
4.2c                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.2b                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.28     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
4.2e                         active+remapped+backfill_wait  [5,0]           5   [0,4]               0
3.29                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.26                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.27                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.24                         active+remapped+backfill_wait    [5]           5   [3,4]               3
4.22                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.22                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.20                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.21                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.18                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.1b                         active+remapped+backfill_wait    [5]           5   [3,4]               3
4.1d                         active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.15                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.14                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.17     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
3.16     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
3.11                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.d                          active+remapped+backfill_wait    [5]           5   [3,4]               3
4.b                          active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.c                          active+remapped+backfill_wait    [5]           5   [4,0]               4
4.8                          active+remapped+backfill_wait  [0,5]           0   [0,4]               0
3.f                          active+remapped+backfill_wait    [5]           5   [4,0]               4
3.2      undersized+degraded+remapped+backfill_wait+peered    [5]           5     [3]               3
3.1e     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
3.3                          active+remapped+backfill_wait    [5]           5   [4,0]               4
3.0      undersized+degraded+remapped+backfill_wait+peered  [0,5]           0     [0]               0
3.1                          active+remapped+backfill_wait  [0,5]           0   [0,3]               0
3.6                          active+remapped+backfill_wait    [5]           5   [4,0]               4
3.7                          active+remapped+backfill_wait    [5]           5   [4,0]               4
4.2                          active+remapped+backfill_wait    [5]           5   [3,0]               3
3.5      undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
4.c                          active+remapped+backfill_wait    [5]           5   [3,4]               3
3.b                          active+remapped+backfill_wait    [5]           5   [3,4]               3
4.e                          active+remapped+backfill_wait  [5,0]           5   [0,4]               0
3.9                          active+remapped+backfill_wait    [5]           5   [3,0]               3
3.1f     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
4.1b                         active+remapped+backfill_wait    [5]           5   [3,0]               3
3.61     undersized+degraded+remapped+backfill_wait+peered  [0,5]           0     [0]               0
3.60     undersized+degraded+remapped+backfill_wait+peered  [0,5]           0     [0]               0
3.63                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.62                         active+remapped+backfill_wait    [5]           5   [3,0]               3
4.62                         active+remapped+backfill_wait    [5]           5   [0,3]               0
3.65     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [0]               0
3.64                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.67                         active+remapped+backfill_wait    [5]           5   [4,0]               4
3.66                         active+remapped+backfill_wait    [5]           5   [4,0]               4
4.6e                         active+remapped+backfill_wait  [5,0]           5   [0,4]               0
3.68     undersized+degraded+remapped+backfill_wait+peered    [5]           5     [4]               4
4.6c                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.6b                         active+remapped+backfill_wait    [5]           5   [3,4]               3
3.6a                         active+remapped+backfill_wait  [0,5]           0   [0,3]               0
ok
root@4-dell:~#
 
Interesting, even though you set a size/min_size of 2/2, (better would be 3/2, but needs more space), many PGs currently only have on replica o_O.
All affected PGs want to be on OSD 5 with one replica, but apparently can't.

Have you tried restarting OSD 5?

Then, what is it with the different device classes? And what about host 3-dell? Is that one gone for good or expected to be up?
 
I restarted
Interesting, even though you set a size/min_size of 2/2, (better would be 3/2, but needs more space), many PGs currently only have on replica o_O.
All affected PGs want to be on OSD 5 with one replica, but apparently can't.

Have you tried restarting OSD 5?
I restarted it but how did you found that out?
Did you adjust REWEIGHT? Normally that is 1.0 for all drives... .007 and .03 seems quite small.
Yes I did. The thing is I have 3.6Tb of storage, only 40% is used but all pools are nearfull! I've spent much time looking for an answer how to fix it. It seems that when I add too small disk it makes it even worse. I thought that reweighing is the solution...
could be interesting to see your custom crush rules
Here they are:
Code:
root@3-dell:~# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 1,
        "rule_name": "nc-rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -15,
                "item_name": "default~nc"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 2,
        "rule_name": "cephfs-rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -14,
                "item_name": "default~cephfs"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

root@3-dell:~#
 
This is after fixed data redundacy:
Code:
root@3-dell:~# ceph -s
  cluster:
    id:     8bc79f4e-ce8c-4036-9384-9d9b6212d60e
    health: HEALTH_WARN
            1 nearfull osd(s)
            Low space hindering backfill (add storage if this doesn't resolve itself): 12 pgs backfill_toofull
            3 pool(s) nearfull
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum 1-asus,2-hp,4-dell (age 80m)
    mgr: 1-asus(active, since 82m), standbys: 2-hp, 4-dell
    mds: 2/2 daemons up, 1 standby
    osd: 6 osds: 6 up (since 79m), 6 in (since 5h); 275 remapped pgs
 
  data:
    volumes: 2/2 healthy
    pools:   5 pools, 452 pgs
    objects: 564.40k objects, 784 GiB
    usage:   1.5 TiB used, 2.1 TiB / 3.7 TiB avail
    pgs:     527705/1128792 objects misplaced (46.750%)
             263 active+clean+remapped
             177 active+clean
             12  active+remapped+backfill_toofull
 
root@3-dell:~#
 
I restarted it but how did you found that out?
In the ceph pg dump_stuck output you have the columns ACTING_PRIMARY, where the replica(s) currently are. and UP_PRIMARY, where they should be.
 
thought that reweighing is the solution
As I recall REWEIGHT is a temporary value and resets at reboot...

Setting OSD4 to reweight 0.007 says to use under 1% of that drive. Or 3% of OSD 3. I'd think Ceph is trying to move the contents of OSDs 3 and 4 to OSD 5 which is your smallest drive, and it can't fit. You should have approximately the same amount of Ceph storage in each node. I don't think intentionally limiting OSD usage is helping you... I'd set those REWEIGHT back to 1.0.

Setting min size to 1 means that only 1 copy of each block is required, so if that OSD fails then you have guaranteed data loss of those blocks. This is why 3/2 is the default. If you are running out of space I suggest adding more OSDs/space instead of lowering the data reliability/replicas.