Ceph Cluster - loss of one host caused storage to go offline

breakaway9000

Renowned Member
Dec 20, 2015
91
22
73
Hi There,

We've got a 5-node ceph cluster with 10gbps networking. All hosts have 3 x HDDs in them for "slow" storage, and all hosts but 3 have SSDs in them for fast storage.

  • Hosts 1,2,3,4,5 all have 3 x HDDs in them - we used this for anything that needs "Slow" storage.
  • Hosts 1,2,4,5 all have 1.2 TB enterprise SSDs in them. Only one SSD per host.

There are two replicated rules - one for HDDs and one for SSDs. We then have pools added that use either rule depending on what kind of storage we need.

Recently, we had to to take a node #2 offline for maintenance. When we did this, all hosts lost access to the SSD backed pools. HDD pools were still working. All the VMs/CTs on the pool were totally locked up. I couldn't display the contents of the storage from the web interface (but the HDD backed pools were working great).

If it matters, the person in charge of the maintenance forgot to set noout.

Why did this happen? It was my understanding that our cluster can tolerate a 'host' level failure - our size/min is set to 3/2.

Let me know if you need anymore info to help figure this out.
 
This depends on the crush rules and how the pool is configured. Run a ceph osd df tree, ceph osd dump, ceph osd crush rule dump to see if the pool and the rules are in check.
 
Output of commands below. the ceph osd crush rule dump shows 3 rules total, but the first (default one that was "included" when ceph was set up) isn't in use anymore.

Bash:
# ceph osd df tree
ID  CLASS WEIGHT   REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS TYPE NAME
 -1       86.25343        - 86.3TiB 48.5TiB 37.7TiB 56.27 1.00   - root default
 -6       17.46986        - 17.5TiB 11.7TiB 5.80TiB 66.80 1.19   -     host host1
  0   hdd  5.45799  1.00000 5.46TiB 3.54TiB 1.92TiB 64.89 1.15 118         osd.0
  1   hdd  5.45799  1.00000 5.46TiB 3.00TiB 2.46TiB 54.98 0.98 100         osd.1
  2   hdd  5.45799  0.90002 5.46TiB 4.53TiB  948GiB 83.04 1.48 151         osd.2
  9   ssd  1.09589  0.95001 1.10TiB  610GiB  512GiB 54.33 0.97 189         osd.9
-13       17.46986        - 17.5TiB 11.4TiB 6.08TiB 65.20 1.16   -     host host2
 17   hdd  5.45799  1.00000 5.46TiB 3.47TiB 1.99TiB 63.62 1.13 116         osd.17
 18   hdd  5.45799  1.00000 5.46TiB 3.62TiB 1.84TiB 66.31 1.18 121         osd.18
 19   hdd  5.45799  0.90002 5.46TiB 3.68TiB 1.77TiB 67.48 1.20 123         osd.19
 10   ssd  1.09589  1.00000 1.10TiB  629GiB  493GiB 56.09 1.00 200         osd.10
 -3       16.37398        - 16.4TiB 10.9TiB 5.44TiB 66.79 1.19   -     host host3
  3   hdd  5.45799  0.90002 5.46TiB 4.14TiB 1.32TiB 75.82 1.35 138         osd.3
  4   hdd  5.45799  1.00000 5.46TiB 3.48TiB 1.98TiB 63.80 1.13 116         osd.4
  5   hdd  5.45799  1.00000 5.46TiB 3.32TiB 2.14TiB 60.74 1.08 110         osd.5
-16       17.46986        - 17.5TiB 4.73TiB 12.7TiB 27.09 0.48   -     host host4
 13   hdd  5.45799  1.00000 5.46TiB 1.98TiB 3.48TiB 36.27 0.64  65         osd.13
 14   hdd  5.45799  1.00000 5.46TiB 1.07TiB 4.39TiB 19.56 0.35  35         osd.14
 15   hdd  5.45799  1.00000 5.46TiB 1.12TiB 4.34TiB 20.57 0.37  36         osd.15
 11   ssd  1.09589  0.95001 1.10TiB  576GiB  546GiB 51.32 0.91 183         osd.11
 -5       17.46986        - 17.5TiB 9.81TiB 7.66TiB 56.16 1.00   -     host host5
  6   hdd  5.45799  0.90002 5.46TiB 3.25TiB 2.21TiB 59.46 1.06 108         osd.6
  7   hdd  5.45799  1.00000 5.46TiB 3.03TiB 2.42TiB 55.58 0.99 101         osd.7
  8   hdd  5.45799  1.00000 5.46TiB 2.94TiB 2.52TiB 53.85 0.96  98         osd.8
 12   ssd  1.09589  0.95001 1.10TiB  607GiB  516GiB 54.05 0.96 193         osd.12
                      TOTAL 86.3TiB 48.5TiB 37.7TiB 56.27
MIN/MAX VAR: 0.35/1.48  STDDEV: 15.58

Bash:
# ceph osd dump
epoch 11782
fsid 5526ca13-287b-46ff-a302-9b1853a5fb25
created 2018-03-17 17:03:08.615625
modified 2020-10-28 13:12:50.793120
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 203
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
pool 9 'ceph_hdd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 last_change 10523 flags hashpspool stripe_width 0 application rbd
        removed_snaps [1~3,9~4,f~114,124~14]
pool 20 'ceph_pci_SSD' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 256 pgp_num 256 last_change 10319 flags hashpspool stripe_width 0 application rbd
        removed_snaps [1~3]
max_osd 20
osd.0 up   in  weight 1 up_from 11388 up_thru 11768 down_at 11134 last_clean_interval [10052,11133) 10.10.10.1:6808/3856 10.10.10.1:6809/3856 10.10.10.1:6810/3856 10.10.10.1:6811/3856 exists,up 94be3a75-ffdf-42ca-86de-1e0c16f8bf76
osd.1 up   in  weight 1 up_from 11386 up_thru 11768 down_at 11134 last_clean_interval [10050,11133) 10.10.10.1:6812/3872 10.10.10.1:6813/3872 10.10.10.1:6814/3872 10.10.10.1:6815/3872 exists,up 7e0dcd82-a602-43bc-af8e-e3158ae7332a
osd.2 up   in  weight 0.900024 up_from 11391 up_thru 11781 down_at 11134 last_clean_interval [10058,11133) 10.10.10.1:6804/3768 10.10.10.1:6805/3768 10.10.10.1:6806/3768
10.10.10.1:6807/3768 exists,up b93a6236-5e9c-4ed5-a3ad-1e267a4b7bef
osd.3 up   in  weight 0.900024 up_from 11000 up_thru 11768 down_at 10967 last_clean_interval [5658,10997) 10.10.10.3:6801/1549034 10.10.10.3:6802/5549034 10.10.10.3:6803/5549034 10.10.10.3:6804/5549034 exists,up b9003899-49cf-41bf-bfb0-f32a2c3fa1ea
osd.4 up   in  weight 1 up_from 10998 up_thru 11773 down_at 10982 last_clean_interval [4662,10997) 10.10.10.3:6800/248112 10.10.10.3:6806/7248112 10.10.10.3:6807/7248112
10.10.10.3:6808/7248112 exists,up 8729d950-d123-4a2f-aa30-2c2674a495cc
osd.5 up   in  weight 1 up_from 10998 up_thru 11779 down_at 10995 last_clean_interval [4653,10997) 10.10.10.3:6805/1959155 10.10.10.3:6809/7959155 10.10.10.3:6810/7959155 10.10.10.3:6811/7959155 exists,up e83ca331-3404-4501-b70c-89fe0f62c6d9
osd.6 up   in  weight 0.900024 up_from 11556 up_thru 11768 down_at 11550 last_clean_interval [10484,11554) 10.10.10.5:6801/490864 10.10.10.5:6814/1490864 10.10.10.5:6811/1490864 10.10.10.5:6817/1490864 exists,up 4138a4f0-f450-4f32-8f3c-3d31b068de6b
osd.7 up   in  weight 1 up_from 11555 up_thru 11769 down_at 11552 last_clean_interval [5689,11554) 10.10.10.5:6804/3143472 10.10.10.5:6813/35143472 10.10.10.5:6816/35143472 10.10.10.5:6818/35143472 exists,up eed5435a-a696-4e37-bb4f-31b6e871b486
osd.8 up   in  weight 1 up_from 11554 up_thru 11769 down_at 11552 last_clean_interval [4731,11553) 10.10.10.5:6800/2569799 10.10.10.5:6807/34569799 10.10.10.5:6808/34569799 10.10.10.5:6812/34569799 exists,up bc84595f-b8e6-4d9c-8e10-570c9550d041
osd.9 up   in  weight 0.950012 up_from 11383 up_thru 11768 down_at 11134 last_clean_interval [10047,11133) 10.10.10.1:6800/3486 10.10.10.1:6801/3486 10.10.10.1:6802/3486
10.10.10.1:6803/3486 exists,up 7b3e9982-9ae3-4261-a293-850b390480ff
osd.10 up   in  weight 1 up_from 10945 up_thru 11768 down_at 10524 last_clean_interval [10503,10523) 10.10.10.2:6800/4593 10.10.10.2:6801/4593 10.10.10.2:6802/4593 10.10.10.2:6803/4593 exists,up 90a25d6e-c8f7-4a14-854f-f0a888ff5bf1
osd.11 up   in  weight 0.950012 up_from 11768 up_thru 11768 down_at 11766 last_clean_interval [5613,11767) 10.10.10.4:6800/4028539 10.10.10.4:6806/5028539 10.10.10.4:6807/5028539 10.10.10.4:6818/5028539 exists,up 900bb32c-1c92-4436-b69f-2130de3d3592
osd.12 up   in  weight 0.950012 up_from 11557 up_thru 11768 down_at 11549 last_clean_interval [5621,11554) 10.10.10.5:6806/2556763 10.10.10.5:6803/34556763 10.10.10.5:6805/34556763 10.10.10.5:6809/34556763 exists,up f44f2f65-f9e1-48ab-ab7c-d9967437b81f
osd.13 up   in  weight 1 up_from 11768 up_thru 11768 down_at 11764 last_clean_interval [11444,11767) 10.10.10.4:6804/1063316 10.10.10.4:6813/2063316 10.10.10.4:6814/2063316 10.10.10.4:6815/2063316 exists,up 413991f7-27c3-48aa-909c-f1e1c861b671
osd.14 up   in  weight 1 up_from 11768 up_thru 11768 down_at 11764 last_clean_interval [11454,11767) 10.10.10.4:6808/1066946 10.10.10.4:6805/2066946 10.10.10.4:6816/2066946 10.10.10.4:6817/2066946 exists,up a48bc9b1-8b28-43c7-a2cb-5545bad1c2a0
osd.15 up   in  weight 1 up_from 11768 up_thru 11777 down_at 11764 last_clean_interval [11462,11767) 10.10.10.4:6812/1069748 10.10.10.4:6809/2069748 10.10.10.4:6810/2069748 10.10.10.4:6811/2069748 exists,up 2d12772e-4586-4677-8f99-39648a7e41ab
osd.17 up   in  weight 1 up_from 10947 up_thru 11768 down_at 10524 last_clean_interval [10503,10523) 10.10.10.2:6812/5096 10.10.10.2:6813/5096 10.10.10.2:6814/5096 10.10.10.2:6815/5096 exists,up 33795bd6-6cc5-427d-accf-877fd10e6d53
osd.18 up   in  weight 1 up_from 10949 up_thru 11768 down_at 10524 last_clean_interval [10508,10523) 10.10.10.2:6808/5093 10.10.10.2:6809/5093 10.10.10.2:6810/5093 10.10.10.2:6811/5093 exists,up 72de0b32-b405-40d8-81f8-dcbe36813e38
osd.19 up   in  weight 0.900024 up_from 10957 up_thru 11768 down_at 10524 last_clean_interval [10505,10523) 10.10.10.2:6804/4865 10.10.10.2:6805/4865 10.10.10.2:6806/4865 10.10.10.2:6807/4865 exists,up 8f70789d-1cb0-45fc-99d0-6e0db78b6bda
pg_temp 9.0 [17,4,0]
pg_temp 9.1 [3,2,18]
pg_temp 9.2 [18,3,2]
pg_temp 9.3 [0,19,4]
pg_temp 9.5 [0,7,19]
pg_temp 9.8 [17,6,0]
pg_temp 9.9 [0,5,18]
pg_temp 9.e [3,2,17]
pg_temp 9.f [19,6,5]
pg_temp 9.11 [7,19,3]
pg_temp 9.13 [5,6,18]
pg_temp 9.14 [5,6,1]
pg_temp 9.16 [4,6,0]
pg_temp 9.17 [2,5,17]
pg_temp 9.19 [19,3,8]
pg_temp 9.1c [8,2,19]
pg_temp 9.1d [3,8,0]
pg_temp 9.1e [5,8,1]
pg_temp 9.1f [17,7,3]
pg_temp 9.20 [4,17,0]
pg_temp 9.28 [6,18,3]
pg_temp 9.2a [19,7,4]
pg_temp 9.2d [5,6,1]
pg_temp 9.32 [4,0,19]
pg_temp 9.33 [3,17,0]
pg_temp 9.34 [5,6,17]
pg_temp 9.35 [17,4,2]
pg_temp 9.36 [4,19,0]
pg_temp 9.38 [3,8,2]
pg_temp 9.3d [2,8,5]
pg_temp 9.3f [3,0,17]
pg_temp 9.43 [3,18,1]
pg_temp 9.44 [3,1,19]
pg_temp 9.45 [8,3,1]
pg_temp 9.46 [19,6,2]
pg_temp 9.48 [7,2,3]
pg_temp 9.49 [3,6,2]
pg_temp 9.4a [3,8,19]
pg_temp 9.4b [7,18,2]
pg_temp 9.5d [0,4,18]
pg_temp 9.5e [5,7,2]
pg_temp 9.62 [8,2,19]
pg_temp 9.69 [6,1,4]
pg_temp 9.6a [2,7,19]
pg_temp 9.6b [19,2,4]
pg_temp 9.6c [8,2,4]
pg_temp 9.6d [18,1,3]
pg_temp 9.70 [6,2,5]
pg_temp 9.71 [7,0,18]
pg_temp 9.72 [8,5,17]
pg_temp 9.76 [3,18,1]
pg_temp 9.7d [8,3,2]
pg_temp 9.82 [7,3,0]
pg_temp 9.86 [8,1,3]
pg_temp 9.89 [18,7,5]
pg_temp 9.8c [4,7,18]
pg_temp 9.8f [7,3,19]
pg_temp 9.90 [6,18,2]
pg_temp 9.92 [4,18,1]
pg_temp 9.94 [6,1,4]
pg_temp 9.97 [2,6,17]
pg_temp 9.9b [0,3,19]
pg_temp 9.9f [7,1,19]
pg_temp 9.a1 [4,17,0]
pg_temp 9.a9 [2,7,17]
pg_temp 9.ad [8,2,19]
pg_temp 9.ae [0,7,4]
pg_temp 9.af [7,2,5]
pg_temp 9.b2 [7,2,5]
pg_temp 9.b4 [17,0,6]
pg_temp 9.b6 [7,3,19]
pg_temp 9.bb [19,8,0]
pg_temp 9.c2 [6,4,2]
pg_temp 9.c3 [8,3,17]
pg_temp 9.c6 [8,18,2]
pg_temp 9.c8 [5,6,17]
pg_temp 9.c9 [2,3,17]
pg_temp 9.cb [6,0,19]
pg_temp 9.cd [7,17,5]
pg_temp 9.d3 [18,6,2]
pg_temp 9.da [6,19,5]
pg_temp 9.db [18,7,3]
pg_temp 9.dc [1,6,18]
pg_temp 9.e6 [18,7,2]
pg_temp 9.eb [3,1,19]
pg_temp 9.ef [2,18,3]
pg_temp 9.fc [3,7,17]
pg_temp 9.fd [2,7,18]
pg_temp 9.100 [2,4,18]
pg_temp 9.105 [19,4,6]
pg_temp 9.109 [2,7,3]
pg_temp 9.10a [19,3,1]
pg_temp 9.10b [3,2,18]
pg_temp 9.10f [8,3,19]
pg_temp 9.110 [18,5,0]
pg_temp 9.111 [19,1,3]
pg_temp 9.116 [17,1,3]
pg_temp 9.117 [4,7,19]
pg_temp 9.11a [7,5,17]
pg_temp 9.11c [5,8,2]
pg_temp 9.11d [4,18,8]
pg_temp 9.11f [4,6,19]
pg_temp 9.120 [18,6,4]
pg_temp 9.122 [18,2,3]
pg_temp 9.12b [18,7,1]
pg_temp 9.12c [7,4,0]
pg_temp 9.12d [5,8,17]
pg_temp 9.12f [17,5,8]
pg_temp 9.131 [18,5,2]
pg_temp 9.132 [4,7,2]
pg_temp 9.137 [5,7,1]
pg_temp 9.138 [6,19,4]
pg_temp 9.13d [4,0,18]
pg_temp 9.140 [3,1,18]
pg_temp 9.141 [19,4,8]
pg_temp 9.145 [0,6,18]
pg_temp 9.146 [8,19,2]
pg_temp 9.149 [7,0,17]
pg_temp 9.14c [4,1,17]
pg_temp 9.14e [8,19,5]
pg_temp 9.154 [7,3,2]
pg_temp 9.157 [8,0,19]
pg_temp 9.15a [17,3,2]
pg_temp 9.15c [17,2,8]
pg_temp 9.15e [18,3,2]
pg_temp 9.160 [18,2,6]
pg_temp 9.162 [0,5,17]
pg_temp 9.16d [3,1,18]
pg_temp 9.16e [5,8,19]
pg_temp 9.170 [7,5,2]
pg_temp 9.171 [2,8,3]
pg_temp 9.172 [3,17,2]
pg_temp 9.174 [17,1,3]
pg_temp 9.175 [6,5,19]
pg_temp 9.176 [7,19,2]
pg_temp 9.178 [17,0,6]
pg_temp 9.17a [8,0,17]
pg_temp 9.17b [3,17,1]
pg_temp 9.17f [3,2,19]
pg_temp 9.187 [17,7,3]
pg_temp 9.18c [4,6,18]
pg_temp 9.18d [19,8,4]
pg_temp 9.192 [7,17,4]
pg_temp 9.194 [2,8,19]
pg_temp 9.195 [1,8,18]
pg_temp 9.197 [17,6,4]
pg_temp 9.199 [6,2,4]
pg_temp 9.19b [3,7,0]
pg_temp 9.19c [7,18,2]
pg_temp 9.19f [7,2,19]
pg_temp 9.1a1 [4,8,17]
pg_temp 9.1a2 [6,18,5]
pg_temp 9.1a4 [18,7,2]
pg_temp 9.1a5 [8,19,0]
pg_temp 9.1a7 [3,6,17]
pg_temp 9.1aa [3,19,2]
pg_temp 9.1b0 [6,2,5]
pg_temp 9.1b1 [18,6,3]
pg_temp 9.1b3 [17,6,3]
pg_temp 9.1b5 [3,2,17]
pg_temp 9.1ba [4,18,2]
pg_temp 9.1bc [18,1,4]
pg_temp 9.1c3 [18,8,3]
pg_temp 9.1c4 [8,1,19]
pg_temp 9.1c9 [0,4,19]
pg_temp 9.1ca [7,0,3]
pg_temp 9.1cc [8,3,19]
pg_temp 9.1d2 [6,2,3]
pg_temp 9.1d3 [4,19,0]
pg_temp 9.1d4 [3,8,0]
pg_temp 9.1d7 [0,8,5]
pg_temp 9.1da [7,0,3]
pg_temp 9.1dd [4,0,19]
pg_temp 9.1e0 [6,17,2]
pg_temp 9.1e1 [6,4,0]
pg_temp 9.1e2 [6,19,5]
pg_temp 9.1e3 [2,5,17]
pg_temp 9.1e7 [7,0,19]
pg_temp 9.1e8 [3,7,17]
pg_temp 9.1ea [19,1,5]
pg_temp 9.1ed [8,0,4]
pg_temp 9.1f0 [7,2,19]
pg_temp 9.1f2 [8,1,5]
pg_temp 9.1f3 [17,7,0]
pg_temp 9.1f5 [6,19,4]
pg_temp 9.1f8 [0,6,17]
pg_temp 9.1f9 [3,8,17]
pg_temp 9.1fb [6,0,3]
pg_temp 9.1fd [6,2,19]

Bash:
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 1,
        "rule_name": "replicated-hdd",
        "ruleset": 1,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -8,
                "item_name": "default~hdd"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 2,
        "rule_name": "replicated-ssd",
        "ruleset": 2,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -12,
                "item_name": "default~ssd"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]
 
Have you tried to up the min-size for your pools? I think the issue is that you have min-size set to 1, meaning only 1 copy of the data will be saved. Had the same issue some time ago with a test-pool and as soon as I set the min-size to 2, it wasn't an issue anymore.
 
That is not my problem. I have size 3 min 2.


Code:
pool 9 'ceph_hdd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 last_change 10523 flags hashpspool stripe_width 0 application rbd
        removed_snaps [1~3,9~4,f~114,124~14]
pool 20 'ceph_pci_SSD' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 256 pgp_num 256 last_change 10319 flags hashpspool stripe_width 0 application rbd
        removed_snaps [1~3]
 
Last edited:
Sorry I misread! To me it looks like a problem with the crush rule you are using... Did you try to use the default one?
 
Hi Alwin, as below

Bash:
# ceph versions
{
    "mon": {
        "ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)": 1,
        "ceph version 12.2.13 (98af9a6b9a46b2d562a0de4b09263d70aeb1c9dd) luminous (stable)": 2
    },
    "mgr": {
        "ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)": 1,
        "ceph version 12.2.13 (98af9a6b9a46b2d562a0de4b09263d70aeb1c9dd) luminous (stable)": 2
    },
    "osd": {
        "ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)": 3,
        "ceph version 12.2.13 (98af9a6b9a46b2d562a0de4b09263d70aeb1c9dd) luminous (stable)": 16
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)": 5,
        "ceph version 12.2.13 (98af9a6b9a46b2d562a0de4b09263d70aeb1c9dd) luminous (stable)": 20
    }
}

Bash:
# pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.4.98-2-pve: 4.4.98-101
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-52
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2
 
Proxmox VE 5.4 and Ceph Luminous are EoL. Please upgrade, we may just hunt something that has been fixed already.
https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0
10 ssd 1.09589 1.00000 1.10TiB 629GiB 493GiB 56.09 1.00 200 osd.10
200 PGs is a bit to much, the optimal target is around 100 PGs per OSD. With Ceph Nautilus the existing pool can be adjusted.

pg_temp 9.0 [17,4,0]
pg_temp 9.1 [3,2,18]
pg_temp 9.2 [18,3,2]
These temp PGs shouldn't exist.

But none if should affect the ability while one node / OSD is down. Best check in the ceph logs what happend during that time.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!