Ceph became disconnect suddenly

parker0909 · Mar 22, 2020

Dear All,

I got the urgent problem the ceph became unable to access with error "Error got timeout (500)". we just update the switch firmware before we facing the problem.I also checked that all node can connect to each other without any problem. Could you mind to provide some suggestion for this?

Parker

parker0909 · Mar 23, 2020

Hello all,

We have tried to recreate mon ,but we found that some osd up and some osd down.
Please find the result for node 6.
cccs06:
systemctl status ceph-osd@45
systemctl status ceph-osd@27
ceph osd tree

May i know any idea to make all osd up?

Thank you

parker0909 · Mar 23, 2020

Hi Sir,
/?
As that case some urgent. Can we get some direction to fix the issue soon ? Thanks,

tmsg · Mar 23, 2020

Providing some more details on this. What I know has happened:
- Some network issue occurred and ceph storage was taken offline
- All servers were rebooted
- Upon boot, ceph was still offline
- OSDs and Managers were running, Monitors were not
- Monitors were brought online but saw inconsistent processes on servers
- Shutdown all nodes and brought them online one at a time
- They all came up and everything has been consistent, but consistent with problems.

Problems:
- ceph -s shows hosts and osds down. This is cached/old data and actual daemons are not being checked.
- OSDs are stuck in a booting state ("state": "booting")
- pages are inactive and unknown

Any help would be appreciated

Code:

# pveversion -v
proxmox-ve: 5.4-2 (running kernel: 4.15.18-24-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-12
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-41
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-54
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Code:

# ceph -s
  cluster:
    id:     <<ID>>
    health: HEALTH_WARN
            21 osds down
            5 hosts (25 osds) down
            Reduced data availability: 1024 pgs inactive
            too few PGs per OSD (29 < min 30)
            clock skew detected on mon.cccs02

  services:
    mon: 3 daemons, quorum cccs01,cccs02,cccs06
    mgr: cccs02(active), standbys: cccs06, cccs01, cccs03, cccs04, cccs05, cccs07, cccs08, cccs09
    osd: 48 osds: 14 up, 35 in

  data:
    pools:   1 pools, 1024 pgs
    objects: 0 objects, 0B
    usage:   20.9TiB used, 18.6TiB / 39.6TiB avail
    pgs:     100.000% pgs unknown
             1024 unknown

Code:

/etc/ceph/ceph.conf
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 10.1.14.0/24
         fsid = <<ID>>
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon_allow_pool_delete = true
         osd_journal_size = 5120
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 10.1.14.0/24

[mds]
         keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring


[mon.cccs02]
         host = cccs02
         mon addr = 10.1.14.3:6789

[mon.cccs01]
         host = cccs01
         mon addr = 10.1.14.2:6789

[mon.cccs06]
         host = cccs06
         mon addr = 10.1.14.7:6789

Code:

# ceph daemon /var/run/ceph/ceph-osd.45.asok status
{
    "cluster_fsid": "<<ID>>",
    "osd_fsid": "<<ID>>",
    "whoami": 45,
    "state": "booting",
    "oldest_map": 13578,
    "newest_map": 14334,
    "num_pgs": 133
}

tmsg · Mar 23, 2020

and

Code:

# pveceph status
{
   "mgrmap" : {
      "available" : true,
      "standbys" : [
         {
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ],
            "gid" : 42994125,
            "name" : "cccs06"
         },
         {
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ],
            "gid" : 42994134,
            "name" : "cccs01"
         },
         {
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ],
            "gid" : 42995169,
            "name" : "cccs03"
         },
         {
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ],
            "gid" : 42999430,
            "name" : "cccs04"
         },
         {
            "name" : "cccs05",
            "gid" : 42999801,
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ]
         },
         {
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ],
            "gid" : 43000743,
            "name" : "cccs07"
         },
         {
            "gid" : 43004584,
            "name" : "cccs08",
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ]
         },
         {
            "gid" : 43005028,
            "name" : "cccs09",
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ]
         }
      ],
      "epoch" : 352,
      "available_modules" : [
         "balancer",
         "dashboard",
         "influx",
         "localpool",
         "prometheus",
         "restful",
         "selftest",
         "status",
         "zabbix"
      ],
      "services" : {},
      "active_name" : "cccs02",
      "modules" : [
         "balancer",
         "restful",
         "status"
      ],
      "active_gid" : 42994111,
      "active_addr" : "10.1.14.3:6820/2632"
   },
   "osdmap" : {
      "osdmap" : {
         "nearfull" : false,
         "full" : false,
         "epoch" : 14279,
         "num_remapped_pgs" : 0,
         "num_up_osds" : 14,
         "num_in_osds" : 35,
         "num_osds" : 48
      }
   },
   "monmap" : {
      "fsid" : "<<ID>>",
      "features" : {
         "optional" : [],
         "persistent" : [
            "kraken",
            "luminous"
         ]
      },
      "created" : "2019-04-04 15:46:14.879083",
      "mons" : [
         {
            "public_addr" : "10.1.14.2:6789/0",
            "addr" : "10.1.14.2:6789/0",
            "name" : "cccs01",
            "rank" : 0
         },
         {
            "public_addr" : "10.1.14.3:6789/0",
            "name" : "cccs02",
            "addr" : "10.1.14.3:6789/0",
            "rank" : 1
         },
         {
            "public_addr" : "10.1.14.7:6789/0",
            "rank" : 2,
            "name" : "cccs06",
            "addr" : "10.1.14.7:6789/0"
         }
      ],
      "epoch" : 34,
      "modified" : "2020-03-23 07:49:10.312320"
   },
   "election_epoch" : 1810,
   "quorum" : [
      0,
      1,
      2
   ],
   "fsmap" : {
      "epoch" : 17,
      "by_rank" : []
   },
   "quorum_names" : [
      "cccs01",
      "cccs02",
      "cccs06"
   ],
   "pgmap" : {
      "unknown_pgs_ratio" : 1,
      "bytes_total" : 43505946312704,
      "data_bytes" : 0,
      "num_pgs" : 1024,
      "num_pools" : 1,
      "num_objects" : 0,
      "bytes_avail" : 20500081999872,
      "pgs_by_state" : [
         {
            "state_name" : "unknown",
            "count" : 1024
         }
      ],
      "bytes_used" : 23005864312832
   },
   "health" : {
      "detail" : [
         "'ceph health' JSON format has changed in luminous. If you see this your monitoring system is scraping the wrong fields. Disable this with 'mon health preluminous compat warning = false'"
      ],
      "status" : "HEALTH_WARN",
      "checks" : {
         "OSD_DOWN" : {
            "detail" : [
               {
                  "message" : "osd.13 (root=default,host=cccs03) is down"
               },
               {
                  "message" : "osd.14 (root=default,host=cccs03) is down"
               },
...               {
                  "message" : "osd.33 (root=default,host=cccs07) is down"
               }
            ],
            "severity" : "HEALTH_WARN",
            "summary" : {
               "message" : "21 osds down"
            }
         },
         "TOO_FEW_PGS" : {
            "detail" : [],
            "summary" : {
               "message" : "too few PGs per OSD (29 < min 30)"
            },
            "severity" : "HEALTH_WARN"
         },
         "OSD_HOST_DOWN" : {
            "summary" : {
               "message" : "5 hosts (25 osds) down"
            },
            "severity" : "HEALTH_WARN",
            "detail" : [
               {
                  "message" : "host cccs01 (root=default) (5 osds) is down"
               },
               {
                  "message" : "host cccs02 (root=default) (5 osds) is down"
               },
               {
                  "message" : "host cccs03 (root=default) (5 osds) is down"
               },
               {
                  "message" : "host cccs04 (root=default) (5 osds) is down"
               },
               {
                  "message" : "host cccs05 (root=default) (5 osds) is down"
               }
            ]
         },
         "PG_AVAILABILITY" : {
            "summary" : {
               "message" : "Reduced data availability: 1024 pgs inactive"
            },
            "severity" : "HEALTH_WARN",
            "detail" : [
               {
                  "message" : "pg 8.3cd is stuck inactive for 25210.465088, current state unknown, last acting []"
               },
...
               {
                  "message" : "pg 8.3ff is stuck inactive for 25210.465088, current state unknown, last acting []"
               }
            ]
         },
         "MON_CLOCK_SKEW" : {
            "detail" : [
               {
                  "message" : "mon.cccs02 addr 10.1.14.3:6789/0 clock skew 1.08092s > max 0.05s (latency 0.000677307s)"
               }
            ],
            "summary" : {
               "message" : "clock skew detected on mon.cccs02"
            },
            "severity" : "HEALTH_WARN"
         }
      },
      "overall_status" : "HEALTH_WARN",
      "summary" : [
         {
            "severity" : "HEALTH_WARN",
            "summary" : "'ceph health' JSON format has changed in luminous. If you see this your monitoring system is scraping the wrong fields. Disable this with 'mon health preluminous compat warning = false'"
         }
      ]
   },
   "servicemap" : {
      "services" : {},
      "epoch" : 1,
      "modified" : "0.000000"
   },
   "fsid" : "<<ID>>"
}

Search

Search

Ceph became disconnect suddenly

parker0909

Well-Known Member

Attachments

parker0909

Well-Known Member

parker0909

Well-Known Member

tmsg

Renowned Member

tmsg

Renowned Member

We value your privacy