recent update this morning breaks ceph

Gerhard W. Recher · Aug 3, 2017

Hi

I installed v5 beta and then v5 release.

i had no problems with updates so far, exept this morning.
I scanned for new updates, and alot of ceph updates popped up.... i installed them on all 4 machines ..
now i have no active mgr in gui, i suppose i shredded ceph completly....
osds an mons are running.

any help is highly appreciated

Code:

root@pve01:~# ps axuw|grep cep
ceph      2333  0.2  0.0 467544 63708 ?        Ssl  12:05   0:04 /usr/bin/ceph-mon -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
ceph      2811  1.2  0.0 1255396 407168 ?      Ssl  12:05   0:26 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph      3080  1.0  0.0 1190028 347184 ?      Ssl  12:06   0:22 /usr/bin/ceph-osd -f --cluster ceph --id 4 --setuser ceph --setgroup ceph
ceph      3303  1.4  0.0 1292596 453400 ?      Ssl  12:06   0:31 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
ceph      3523  1.1  0.0 1249440 409896 ?      Ssl  12:06   0:23 /usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph
ceph      3728  1.1  0.0 1262032 421520 ?      Ssl  12:07   0:23 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph      4060  0.8  0.0 1193312 344960 ?      Ssl  12:07   0:18 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
ceph      4240  0.9  0.0 1233052 393476 ?      Ssl  12:07   0:18 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
www-data  4649  0.0  0.0 531456 94456 ?        Ss   12:07   0:00 spiceproxy
www-data  4650  0.0  0.0 533924 99208 ?        S    12:07   0:00 spiceproxy worker

Code:

root@pve01:~# ceph -s
  cluster:
    id:     cb0aba69-bad9-4d30-b163-c19f0fd1ec53
    health: HEALTH_WARN
            no active mgr

  services:
    mon: 4 daemons, quorum 0,1,2,3
    mgr: no daemons active
    osd: 28 osds: 28 up, 28 in

  data:
    pools:   3 pools, 2112 pgs
    objects: 286k objects, 1143 GB
    usage:   2584 GB used, 34226 GB / 36811 GB avail
    pgs:     219831/878742 objects degraded (25.017%)
             1578 active+undersized+degraded
             534  active+clean

  io:
    client:   1362 B/s rd, 3405 B/s wr, 0 op/s rd, 0 op/s wr

Code:

pveceph status
{
   "monmap" : {
      "fsid" : "cb0aba69-bad9-4d30-b163-c19f0fd1ec53",
      "features" : {
         "persistent" : [
            "kraken",
            "luminous"
         ],
         "optional" : []
      },
      "epoch" : 5,
      "mons" : [
         {
            "rank" : 0,
            "public_addr" : "192.168.100.141:6789/0",
            "addr" : "192.168.100.141:6789/0",
            "name" : "0"
         },
         {
            "rank" : 1,
            "name" : "1",
            "public_addr" : "192.168.100.142:6789/0",
            "addr" : "192.168.100.142:6789/0"
         },
         {
            "rank" : 2,
            "public_addr" : "192.168.100.143:6789/0",
            "addr" : "192.168.100.143:6789/0",
            "name" : "2"
         },
         {
            "rank" : 3,
            "public_addr" : "192.168.100.144:6789/0",
            "name" : "3",
            "addr" : "192.168.100.144:6789/0"
         }
      ],
      "modified" : "2017-06-21 19:50:59.946144",
      "created" : "2017-06-21 19:36:06.835226"
   },
   "fsmap" : {
      "by_rank" : [],
      "epoch" : 1
   },
   "quorum" : [
      0,
      1,
      2,
      3
   ],
   "health" : {
      "checks" : {
         "MGR_DOWN" : {
            "severity" : "HEALTH_WARN",
            "detail" : [],
            "message" : "no active mgr"
         }
      },
      "status" : "HEALTH_WARN"
   },
   "pgmap" : {
      "bytes_avail" : 36750487359488,
      "degraded_objects" : 219831,
      "num_pools" : 3,
      "write_bytes_sec" : 3405,
      "num_objects" : 292914,
      "degraded_total" : 878742,
      "degraded_ratio" : 0.250166,
      "pgs_by_state" : [
         {
            "state_name" : "active+undersized+degraded",
            "count" : 1578
         },
         {
            "count" : 534,
            "state_name" : "active+clean"
         }
      ],
      "bytes_total" : 39526092644352,
      "bytes_used" : 2775605284864,
      "read_op_per_sec" : 0,
      "read_bytes_sec" : 1362,
      "data_bytes" : 1227584198774,
      "write_op_per_sec" : 0,
      "num_pgs" : 2112
   },
   "servicemap" : {
      "services" : {},
      "epoch" : 0,
      "modified" : "0.000000"
   },
   "osdmap" : {
      "osdmap" : {
         "nearfull" : false,
         "full" : false,
         "epoch" : 750,
         "num_in_osds" : 28,
         "num_osds" : 28,
         "num_remapped_pgs" : 0,
         "num_up_osds" : 28
      }
   },
   "fsid" : "cb0aba69-bad9-4d30-b163-c19f0fd1ec53",
   "mgrmap" : {
      "available" : false,
      "active_name" : "",
      "active_gid" : 0,
      "epoch" : 763,
      "standbys" : [],
      "modules" : [],
      "available_modules" : [],
      "active_addr" : "-"
   },
   "quorum_names" : [
      "0",
      "1",
      "2",
      "3"
   ],
   "election_epoch" : 176
}

Code:

root@pve01:~# dpkg --list ceph*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                             Version               Architecture          Description
+++-================================-=====================-=====================-=====================================================================
ii  ceph                             12.1.1-pve1           amd64                 distributed storage and file system
ii  ceph-base                        12.1.1-pve1           amd64                 common ceph daemon libraries and management tools
un  ceph-client-tools                <none>                <none>                (no description available)
ii  ceph-common                      12.1.1-pve1           amd64                 common utilities to mount and interact with a ceph storage cluster
un  ceph-fs-common                   <none>                <none>                (no description available)
un  ceph-mds                         <none>                <none>                (no description available)
ii  ceph-mgr                         12.1.1-pve1           amd64                 manager for the ceph distributed storage system
ii  ceph-mon                         12.1.1-pve1           amd64                 monitor server for the ceph storage system
ii  ceph-osd                         12.1.1-pve1           amd64                 OSD server for the ceph storage system
un  ceph-test                        <none>                <none>                (no description available)
root@pve01:~# dpkg --list pve*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                             Version               Architecture          Description
+++-================================-=====================-=====================-=====================================================================
ii  pve-cluster                      5.0-12                amd64                 Cluster Infrastructure for Proxmox Virtual Environment
ii  pve-container                    2.0-15                all                   Proxmox VE Container management tool
ii  pve-docs                         5.0-9                 all                   Proxmox VE Documentation
ii  pve-firewall                     3.0-2                 amd64                 Proxmox VE Firewall
ii  pve-firmware                     2.0-2                 all                   Binary firmware code for the pve-kernel
ii  pve-ha-manager                   2.0-2                 amd64                 Proxmox VE HA Manager
un  pve-kernel                       <none>                <none>                (no description available)
ii  pve-kernel-4.10.11-1-pve         4.10.11-9             amd64                 The Proxmox PVE Kernel Image
ii  pve-kernel-4.10.15-1-pve         4.10.15-15            amd64                 The Proxmox PVE Kernel Image
ii  pve-kernel-4.10.17-1-pve         4.10.17-18            amd64                 The Proxmox PVE Kernel Image
un  pve-kvm                          <none>                <none>                (no description available)
ii  pve-libspice-server1             0.12.8-3              amd64                 SPICE remote display system server library
ii  pve-manager                      5.0-29                amd64                 Proxmox Virtual Environment Management Tools
ii  pve-qemu-kvm                     2.9.0-2               amd64                 Full virtualization on x86 hardware
un  pve-qemu-kvm-2.6.18              <none>                <none>                (no description available)

and after manual start of ceph-mgr it's seems to be ok ... but i have no glue what happend at all !

Code:

root@pve01:~# ceph-mgr
root@pve01:~# ceph-mgr --help
usage: ceph-mgr -i <ID> [flags]

  --conf/-c FILE    read configuration from the given configuration file
  --id/-i ID        set ID portion of my name
  --name/-n TYPE.ID set name
  --cluster NAME    set cluster name (default: ceph)
  --setuser USER    set uid to user or uid (and gid to user's gid)
  --setgroup GROUP  set gid to group or gid
  --version         show version and quit

  -d                run in foreground, log to stderr.
  -f                run in foreground, log to usual location.
  --debug_ms N      set message debug level (e.g. 1)
root@pve01:~# ps axuw|grep cep
ceph      2333  0.1  0.0 484952 85240 ?        Ssl  12:05   0:07 /usr/bin/ceph-mon -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
ceph      2811  0.7  0.0 1255396 407692 ?      Ssl  12:05   0:30 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph      3080  0.6  0.0 1190028 347384 ?      Ssl  12:06   0:26 /usr/bin/ceph-osd -f --cluster ceph --id 4 --setuser ceph --setgroup ceph
ceph      3303  0.8  0.0 1292596 453616 ?      Ssl  12:06   0:36 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
ceph      3523  0.6  0.0 1249440 409924 ?      Ssl  12:06   0:28 /usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph
ceph      3728  0.6  0.0 1262032 422044 ?      Ssl  12:07   0:26 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
ceph      4060  0.5  0.0 1193312 345008 ?      Ssl  12:07   0:22 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
ceph      4240  0.5  0.0 1233052 393888 ?      Ssl  12:07   0:22 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
www-data  4649  0.0  0.0 531456 94456 ?        Ss   12:07   0:00 spiceproxy
www-data  4650  0.0  0.0 533924 99208 ?        S    12:07   0:00 spiceproxy worker
root     24457  0.4  0.0 407612 32064 ?        Ssl  13:15   0:00 ceph-mgr
root     24794  0.0  0.0  12784   944 pts/0    S+   13:16   0:00 grep cep
root@pve01:~# ceph -s
  cluster:
    id:     cb0aba69-bad9-4d30-b163-c19f0fd1ec53
    health: HEALTH_WARN
            72 pgs not deep-scrubbed for 86400
            446 pgs not scrubbed for 86400

  services:
    mon: 4 daemons, quorum 0,1,2,3
    mgr: admin(active)
    osd: 28 osds: 28 up, 28 in

  data:
    pools:   3 pools, 2112 pgs
    objects: 286k objects, 1143 GB
    usage:   3447 GB used, 45634 GB / 49082 GB avail
    pgs:     2112 active+clean

  io:
    client:   1022 B/s wr, 0 op/s rd, 0 op/s wr

dcsapak · Aug 3, 2017

we pushed an upgrade to the second release candidate for luminous, which now does not start the ceph-mgr by default

in you case, you should have already on each node where a monitor is a manager, which you just have to enable and start

Code:

systemctl start ceph-mgr@<monid>.service
systemctl enable ceph-mgr@<monid>.service

Gerhard W. Recher · Aug 3, 2017

dcsapak said:
we pushed an upgrade to the second release candidate for luminous, which now does not start the ceph-mgr by default

in you case, you should have already on each node where a monitor is a manager, which you just have to enable and start

Code:

systemctl start ceph-mgr@<monid>.service systemctl enable ceph-mgr@<monid>.service

ok i have done this on all 4 nodes now.

shall i wait for end of scrubbing ? and then reboot whole cluster ?

Code:

 ceph -s
  cluster:
    id:     cb0aba69-bad9-4d30-b163-c19f0fd1ec53
    health: HEALTH_WARN
            68 pgs not deep-scrubbed for 86400
            417 pgs not scrubbed for 86400

  services:
    mon: 4 daemons, quorum 0,1,2,3
    mgr: 2(active)
    osd: 28 osds: 28 up, 28 in

  data:
    pools:   3 pools, 2112 pgs
    objects: 286k objects, 1143 GB
    usage:   3447 GB used, 45634 GB / 49082 GB avail
    pgs:     2112 active+clean

Gerhard W. Recher · Aug 3, 2017

Gerhard W. Recher said:

ok i have done this on all 4 nodes now.

shall i wait for end of scrubbing ? and then reboot whole cluster ?

Code:

 ceph -s
  cluster:
    id:     cb0aba69-bad9-4d30-b163-c19f0fd1ec53
    health: HEALTH_WARN
            68 pgs not deep-scrubbed for 86400
            417 pgs not scrubbed for 86400

  services:
    mon: 4 daemons, quorum 0,1,2,3
    mgr: 2(active)
    osd: 28 osds: 28 up, 28 in

  data:
    pools:   3 pools, 2112 pgs
    objects: 286k objects, 1143 GB
    usage:   3447 GB used, 45634 GB / 49082 GB avail
    pgs:     2112 active+clean

just started a all scrub ... to force things to be clean ... hopefully

Code:

ceph pg dump | grep -i active+clean | awk '{print $1}' | while read i; do ceph pg deep-scrub ${i}; done

dietmar · Aug 3, 2017

BTW, using 4 mons is IMHO a bad idea (use 3 instead)

Gerhard W. Recher · Aug 3, 2017

dietmar said:
BTW, using 4 mons is IMHO a bad idea (use 3 instead)

dietmar, why not all 4 nodes ? any technical reason ?

dietmar · Aug 3, 2017

Gerhard W. Recher said:
dietmar, why not all 4 nodes ? any technical reason ?

Using 4 nodes does not help with quorum, but produce more traffic than 3 nodes.

Gerhard W. Recher · Aug 3, 2017

dietmar said:
Using 4 nodes does not help with quorum, but produce more traffic than 3 nodes.

traffic on mellanox 56GbitE is no issue i guess why should i remove redundancy ?

RobFantini · Aug 3, 2017

as far as i understand, odd numbers are better for nodes and mons.

dietmar · Aug 3, 2017

If you run 4 monitors, you need 3 nodes online to have quorum.
If you run 3 monitors, you need 2 monitor nodes online, the other 2 nodes can be offline - so this setup provides higher availability!

Search

Search

recent update this morning breaks ceph

Gerhard W. Recher

Renowned Member

dcsapak

Proxmox Staff Member

Gerhard W. Recher

Renowned Member

Gerhard W. Recher

Renowned Member

dietmar

Proxmox Staff Member

Gerhard W. Recher

Renowned Member

dietmar

Proxmox Staff Member

Gerhard W. Recher

Renowned Member

Attachments

RobFantini

Famous Member

dietmar

Proxmox Staff Member

We value your privacy