Ceph became disconnect suddenly

parker0909

Well-Known Member
Aug 5, 2019
82
0
46
36
Dear All,

I got the urgent problem the ceph became unable to access with error "Error got timeout (500)". we just update the switch firmware before we facing the problem.I also checked that all node can connect to each other without any problem. Could you mind to provide some suggestion for this?

Parker
 

Attachments

  • mon_error.png
    mon_error.png
    52.7 KB · Views: 12
Hello all,

We have tried to recreate mon ,but we found that some osd up and some osd down.
Please find the result for node 6.
cccs06:
systemctl status ceph-osd@45
systemctl status ceph-osd@27
ceph osd tree

May i know any idea to make all osd up?

Thank you
 
Hi Sir,
/?
As that case some urgent. Can we get some direction to fix the issue soon ? Thanks,
 
Providing some more details on this. What I know has happened:
- Some network issue occurred and ceph storage was taken offline
- All servers were rebooted
- Upon boot, ceph was still offline
- OSDs and Managers were running, Monitors were not
- Monitors were brought online but saw inconsistent processes on servers
- Shutdown all nodes and brought them online one at a time
- They all came up and everything has been consistent, but consistent with problems.

Problems:
- ceph -s shows hosts and osds down. This is cached/old data and actual daemons are not being checked.
- OSDs are stuck in a booting state ("state": "booting")
- pages are inactive and unknown

Any help would be appreciated


Code:
# pveversion -v
proxmox-ve: 5.4-2 (running kernel: 4.15.18-24-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-12
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-41
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-54
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2
Code:
# ceph -s
  cluster:
    id:     <<ID>>
    health: HEALTH_WARN
            21 osds down
            5 hosts (25 osds) down
            Reduced data availability: 1024 pgs inactive
            too few PGs per OSD (29 < min 30)
            clock skew detected on mon.cccs02

  services:
    mon: 3 daemons, quorum cccs01,cccs02,cccs06
    mgr: cccs02(active), standbys: cccs06, cccs01, cccs03, cccs04, cccs05, cccs07, cccs08, cccs09
    osd: 48 osds: 14 up, 35 in

  data:
    pools:   1 pools, 1024 pgs
    objects: 0 objects, 0B
    usage:   20.9TiB used, 18.6TiB / 39.6TiB avail
    pgs:     100.000% pgs unknown
             1024 unknown
Code:
/etc/ceph/ceph.conf
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 10.1.14.0/24
         fsid = <<ID>>
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon_allow_pool_delete = true
         osd_journal_size = 5120
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 10.1.14.0/24

[mds]
         keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring


[mon.cccs02]
         host = cccs02
         mon addr = 10.1.14.3:6789

[mon.cccs01]
         host = cccs01
         mon addr = 10.1.14.2:6789

[mon.cccs06]
         host = cccs06
         mon addr = 10.1.14.7:6789
Code:
# ceph daemon /var/run/ceph/ceph-osd.45.asok status
{
    "cluster_fsid": "<<ID>>",
    "osd_fsid": "<<ID>>",
    "whoami": 45,
    "state": "booting",
    "oldest_map": 13578,
    "newest_map": 14334,
    "num_pgs": 133
}
 
and
Code:
# pveceph status
{
   "mgrmap" : {
      "available" : true,
      "standbys" : [
         {
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ],
            "gid" : 42994125,
            "name" : "cccs06"
         },
         {
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ],
            "gid" : 42994134,
            "name" : "cccs01"
         },
         {
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ],
            "gid" : 42995169,
            "name" : "cccs03"
         },
         {
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ],
            "gid" : 42999430,
            "name" : "cccs04"
         },
         {
            "name" : "cccs05",
            "gid" : 42999801,
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ]
         },
         {
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ],
            "gid" : 43000743,
            "name" : "cccs07"
         },
         {
            "gid" : 43004584,
            "name" : "cccs08",
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ]
         },
         {
            "gid" : 43005028,
            "name" : "cccs09",
            "available_modules" : [
               "balancer",
               "dashboard",
               "influx",
               "localpool",
               "prometheus",
               "restful",
               "selftest",
               "status",
               "zabbix"
            ]
         }
      ],
      "epoch" : 352,
      "available_modules" : [
         "balancer",
         "dashboard",
         "influx",
         "localpool",
         "prometheus",
         "restful",
         "selftest",
         "status",
         "zabbix"
      ],
      "services" : {},
      "active_name" : "cccs02",
      "modules" : [
         "balancer",
         "restful",
         "status"
      ],
      "active_gid" : 42994111,
      "active_addr" : "10.1.14.3:6820/2632"
   },
   "osdmap" : {
      "osdmap" : {
         "nearfull" : false,
         "full" : false,
         "epoch" : 14279,
         "num_remapped_pgs" : 0,
         "num_up_osds" : 14,
         "num_in_osds" : 35,
         "num_osds" : 48
      }
   },
   "monmap" : {
      "fsid" : "<<ID>>",
      "features" : {
         "optional" : [],
         "persistent" : [
            "kraken",
            "luminous"
         ]
      },
      "created" : "2019-04-04 15:46:14.879083",
      "mons" : [
         {
            "public_addr" : "10.1.14.2:6789/0",
            "addr" : "10.1.14.2:6789/0",
            "name" : "cccs01",
            "rank" : 0
         },
         {
            "public_addr" : "10.1.14.3:6789/0",
            "name" : "cccs02",
            "addr" : "10.1.14.3:6789/0",
            "rank" : 1
         },
         {
            "public_addr" : "10.1.14.7:6789/0",
            "rank" : 2,
            "name" : "cccs06",
            "addr" : "10.1.14.7:6789/0"
         }
      ],
      "epoch" : 34,
      "modified" : "2020-03-23 07:49:10.312320"
   },
   "election_epoch" : 1810,
   "quorum" : [
      0,
      1,
      2
   ],
   "fsmap" : {
      "epoch" : 17,
      "by_rank" : []
   },
   "quorum_names" : [
      "cccs01",
      "cccs02",
      "cccs06"
   ],
   "pgmap" : {
      "unknown_pgs_ratio" : 1,
      "bytes_total" : 43505946312704,
      "data_bytes" : 0,
      "num_pgs" : 1024,
      "num_pools" : 1,
      "num_objects" : 0,
      "bytes_avail" : 20500081999872,
      "pgs_by_state" : [
         {
            "state_name" : "unknown",
            "count" : 1024
         }
      ],
      "bytes_used" : 23005864312832
   },
   "health" : {
      "detail" : [
         "'ceph health' JSON format has changed in luminous. If you see this your monitoring system is scraping the wrong fields. Disable this with 'mon health preluminous compat warning = false'"
      ],
      "status" : "HEALTH_WARN",
      "checks" : {
         "OSD_DOWN" : {
            "detail" : [
               {
                  "message" : "osd.13 (root=default,host=cccs03) is down"
               },
               {
                  "message" : "osd.14 (root=default,host=cccs03) is down"
               },
...               {
                  "message" : "osd.33 (root=default,host=cccs07) is down"
               }
            ],
            "severity" : "HEALTH_WARN",
            "summary" : {
               "message" : "21 osds down"
            }
         },
         "TOO_FEW_PGS" : {
            "detail" : [],
            "summary" : {
               "message" : "too few PGs per OSD (29 < min 30)"
            },
            "severity" : "HEALTH_WARN"
         },
         "OSD_HOST_DOWN" : {
            "summary" : {
               "message" : "5 hosts (25 osds) down"
            },
            "severity" : "HEALTH_WARN",
            "detail" : [
               {
                  "message" : "host cccs01 (root=default) (5 osds) is down"
               },
               {
                  "message" : "host cccs02 (root=default) (5 osds) is down"
               },
               {
                  "message" : "host cccs03 (root=default) (5 osds) is down"
               },
               {
                  "message" : "host cccs04 (root=default) (5 osds) is down"
               },
               {
                  "message" : "host cccs05 (root=default) (5 osds) is down"
               }
            ]
         },
         "PG_AVAILABILITY" : {
            "summary" : {
               "message" : "Reduced data availability: 1024 pgs inactive"
            },
            "severity" : "HEALTH_WARN",
            "detail" : [
               {
                  "message" : "pg 8.3cd is stuck inactive for 25210.465088, current state unknown, last acting []"
               },
...
               {
                  "message" : "pg 8.3ff is stuck inactive for 25210.465088, current state unknown, last acting []"
               }
            ]
         },
         "MON_CLOCK_SKEW" : {
            "detail" : [
               {
                  "message" : "mon.cccs02 addr 10.1.14.3:6789/0 clock skew 1.08092s > max 0.05s (latency 0.000677307s)"
               }
            ],
            "summary" : {
               "message" : "clock skew detected on mon.cccs02"
            },
            "severity" : "HEALTH_WARN"
         }
      },
      "overall_status" : "HEALTH_WARN",
      "summary" : [
         {
            "severity" : "HEALTH_WARN",
            "summary" : "'ceph health' JSON format has changed in luminous. If you see this your monitoring system is scraping the wrong fields. Disable this with 'mon health preluminous compat warning = false'"
         }
      ]
   },
   "servicemap" : {
      "services" : {},
      "epoch" : 1,
      "modified" : "0.000000"
   },
   "fsid" : "<<ID>>"
}
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!