Ceph-mon down

Frank bartels · Feb 13, 2021

Hallo zusammen,

ich hoffe wir bekommen hier einen Ansatz wie wir unser Problem mit einem Ceph-mon lösen können.
Wir betreiben hier einen 3 Node Cluster mit Ceph und HA
Auf einem Node läßt sich der Ceph-mon nicht mehr starten:


pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.16-pve1
ceph-fuse: 14.2.16-pve1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-4
pve-cluster: 6.2-1
pve-container: 3.3-3
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-4
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Die Fehlermeldung:


systemctl status ceph-mon@justus.service
● ceph-mon@justus.service - Ceph cluster monitor daemon
   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)
  Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
           └─ceph-after-pve-cluster.conf
   Active: failed (Result: signal) since Sat 2021-02-13 16:09:28 CET; 47min ago
  Process: 57396 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id justus --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
Main PID: 57396 (code=killed, signal=ABRT)

Feb 13 16:09:28 justus systemd[1]: ceph-mon@justus.service: Service RestartSec=10s expired, scheduling restart.
Feb 13 16:09:28 justus systemd[1]: ceph-mon@justus.service: Scheduled restart job, restart counter is at 5.
Feb 13 16:09:28 justus systemd[1]: Stopped Ceph cluster monitor daemon.
Feb 13 16:09:28 justus systemd[1]: ceph-mon@justus.service: Start request repeated too quickly.
Feb 13 16:09:28 justus systemd[1]: ceph-mon@justus.service: Failed with result 'signal'.
Feb 13 16:09:28 justus systemd[1]: Failed to start Ceph cluster monitor daemon.
Feb 13 16:22:01 justus systemd[1]: ceph-mon@justus.service: Start request repeated too quickly.
Feb 13 16:22:01 justus systemd[1]: ceph-mon@justus.service: Failed with result 'signal'.
Feb 13 16:22:01 justus systemd[1]: Failed to start Ceph cluster monitor daemon.

Die anderen beiden Cep-mon laufen einwandfrei.

Ich habe im Monitorlog nicht wirklich viel gefunden:

Falls noch mehr Informationen benötigt werden, bitte einfach hier posten.

Vielen Dank schon einmal für die Hilfe.

Alwin Antreich · Feb 14, 2021

Frank bartels said:
Feb 13 16:22:01 justus systemd[1]: ceph-mon@justus.service: Start request repeated too quickly.

Aus welchem Grund auch immer der MON nicht von Systemd gestartet werden konnte, der Systemd hat's wiederholt zu schnell versucht und wird nun geblockt. Ein systemctl reset-failed ceph-mon@justus.service sollte einen restart des Daemons erlauben.

Frank bartels · Feb 15, 2021

Hallo,

vielen Dank für Ihre Antwort. Der Befehl wurde bereits ausgeführt und führte nicht zum gewünschten Erfolg.
Haben Sie weitere Vorschläge?

Vielen Dank für die Hilfe

aaron · Feb 15, 2021

Anything that might give you an idea in /var/log/ceph/ceph-mon....log?

Frank bartels · Feb 15, 2021

Thanks for your reply:

I cannot read an error there. I don't have "Corruption" Entrys in the Log File.


2021-02-13 16:07:15.260 7fe1df182400  4 rocksdb:                         Options.error_if_exists: 0
  -208> 2021-02-13 16:07:15.260 7fe1df182400  4 rocksdb:                         Options.error_if_exists: 0


2021-02-13 14:34:30.813 7f10053f6400 -1 *** Caught signal (Aborted) **
2: (gsignal()+0x10b) [0x7f100591f7bb]
     0> 2021-02-13 14:34:30.813 7f10053f6400 -1 *** Caught signal (Aborted) **
2: (gsignal()+0x10b) [0x7f100591f7bb]
2021-02-13 14:34:41.629 7f2fde575400 -1 *** Caught signal (Aborted) **
2: (gsignal()+0x10b) [0x7f2fdea9e7bb]
     0> 2021-02-13 14:34:41.629 7f2fde575400 -1 *** Caught signal (Aborted) **
2: (gsignal()+0x10b) [0x7f2fdea9e7bb]
2021-02-13 14:35:10.072 7fa0e7ed8400 -1 *** Caught signal (Aborted) **
2: (gsignal()+0x10b) [0x7fa0e84017bb]
     0> 2021-02-13 14:35:10.072 7fa0e7ed8400 -1 *** Caught signal (Aborted) **

root@justus:~# du -sch /var/lib/ceph/mon/ceph-justus/store.db/
38M /var/lib/ceph/mon/ceph-justus/store.db/
38M total

Size of mon db:

Frank bartels · Feb 16, 2021

I have found another Entry in the Ceph-mon Log:


2021-02-13 14:34:30.805 7f10053f6400 -1 /build/ceph/ceph-14.2.16/src/mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f10053f6400 time 2021-02-13 14:34:30.809894
/build/ceph/ceph-14.2.16/src/mon/AuthMonitor.cc: 278: FAILED ceph_assert(ret == 0)

Frank bartels · Feb 16, 2021


root@justus:/var/log/ceph# dpkg -l|grep ceph
ii  ceph                                 14.2.16-pve1                 amd64        distributed storage and file system
ii  ceph-base                            14.2.16-pve1                 amd64        common ceph daemon libraries and management tools
ii  ceph-common                          14.2.16-pve1                 amd64        common utilities to mount and interact with a ceph storage cluster
ii  ceph-fuse                            14.2.16-pve1                 amd64        FUSE-based client for the Ceph distributed file system
ii  ceph-mds                             14.2.16-pve1                 amd64        metadata server for the ceph distributed file system
ii  ceph-mgr                             14.2.16-pve1                 amd64        manager for the ceph distributed storage system
ii  ceph-mon                             14.2.16-pve1                 amd64        monitor server for the ceph storage system
ii  ceph-osd                             14.2.16-pve1                 amd64        OSD server for the ceph storage system
ii  libcephfs2                           14.2.16-pve1                 amd64        Ceph distributed file system client library
ii  python-ceph-argparse                 14.2.16-pve1                 all          Python 2 utility libraries for Ceph CLI
ii  python-cephfs                        14.2.16-pve1                 amd64        Python 2 libraries for the Ceph libcephfs library

Alwin Antreich · Feb 16, 2021

Is the time in sync on all cluster members?

Frank bartels · Feb 17, 2021

Thanks, for your reply.
Yes, the time is in sync on all Nodes.

Alwin Antreich · Feb 18, 2021

Quickest route may be to re-create the MON. Other then that, you will need to bump up the logging.
https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/#ceph-subsystems

Frank bartels · Feb 19, 2021

I have found another Log Enty:

[
root@justus:~# ceph crash info 2021-02-13_15:09:17.891220Z_66035827-6722-47ce-8137-bf05cf815342
{
    "os_version_id": "10",
    "assert_condition": "ret == 0",
    "utsname_release": "5.4.78-2-pve",
    "os_name": "Debian GNU/Linux 10 (buster)",
    "entity_name": "mon.justus",
    "assert_file": "/build/ceph/ceph-14.2.16/src/mon/AuthMonitor.cc",
    "timestamp": "2021-02-13 15:09:17.891220Z",
    "process_name": "ceph-mon",
    "utsname_machine": "x86_64",
    "assert_line": 278,
    "utsname_sysname": "Linux",
    "os_version": "10 (buster)",
    "os_id": "10",
    "assert_thread_name": "ceph-mon",
    "utsname_version": "#1 SMP PVE 5.4.78-2 (Thu, 03 Dec 2020 14:26:17 +0100)",
    "backtrace": [
        "(()+0x12730) [0x7efe55b3a730]",
        "(gsignal()+0x10b) [0x7efe5561d7bb]",
        "(abort()+0x121) [0x7efe55608535]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7efe56c6bd43]",
        "(()+0x277eca) [0x7efe56c6beca]",
        "(AuthMonitor::update_from_paxos(bool*)+0x1f3f) [0x55e9fa87480f]",
        "(PaxosService::refresh(bool*)+0x10a) [0x55e9fa90963a]",
        "(Monitor::refresh_from_paxos(bool*)+0x19c) [0x55e9fa7f3cfc]",
        "(Monitor::init_paxos()+0xfc) [0x55e9fa7f3fbc]",
        "(Monitor::preinit()+0xa08) [0x55e9fa826278]",
        "(main()+0x2614) [0x55e9fa7ae0c4]",
        "(__libc_start_main()+0xeb) [0x7efe5560a09b]",
        "(_start()+0x2a) [0x55e9fa7dd58a]"
    ],
    "utsname_hostname": "justus",
[B]    "assert_msg": "/build/ceph/ceph-14.2.16/src/mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7efe550f4400 time 2021-02-13 16:09:17.888981\n/build/ceph/ceph-14.2.16/src/mon/AuthMonitor.cc: 278: FAILED ceph_assert(ret == 0)\n",
    "crash_id": "2021-02-13_15:09:17.891220Z_66035827-6722-47ce-8137-bf05cf815342",
    "assert_func": "virtual void AuthMonitor::update_from_paxos(bool*)",
    "ceph_version": "14.2.16"[/B]

/ICODE]

Alwin Antreich · Feb 22, 2021

I can only find old references to missing auth, I suppose that can stem from some time drift. But without more logging, there is not much to go on.

Search

Search

Ceph-mon down

Frank bartels

Member

Alwin Antreich

Active Member

Frank bartels

Member

aaron

Proxmox Staff Member

Frank bartels

Member

Frank bartels

Member

Frank bartels

Member

Alwin Antreich

Active Member

Frank bartels

Member

Alwin Antreich

Active Member

Frank bartels

Member

Alwin Antreich

Active Member