Ceph-mon down

Frank bartels

Member
Mar 6, 2018
17
0
21
52
Hallo zusammen,

ich hoffe wir bekommen hier einen Ansatz wie wir unser Problem mit einem Ceph-mon lösen können.
Wir betreiben hier einen 3 Node Cluster mit Ceph und HA
Auf einem Node läßt sich der Ceph-mon nicht mehr starten:

pveversion -v proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve) pve-manager: 6.3-3 (running version: 6.3-3/eee5f901) pve-kernel-5.4: 6.3-3 pve-kernel-helper: 6.3-3 pve-kernel-5.3: 6.1-6 pve-kernel-5.0: 6.0-11 pve-kernel-5.4.78-2-pve: 5.4.78-2 pve-kernel-5.4.65-1-pve: 5.4.65-1 pve-kernel-5.3.18-3-pve: 5.3.18-3 pve-kernel-5.0.21-5-pve: 5.0.21-10 pve-kernel-5.0.15-1-pve: 5.0.15-1 ceph: 14.2.16-pve1 ceph-fuse: 14.2.16-pve1 corosync: 3.1.0-pve1 criu: 3.11-3 glusterfs-client: 5.5-3 ifupdown: 0.8.35+pve1 ksm-control-daemon: 1.3-1 libjs-extjs: 6.0.1-10 libknet1: 1.20-pve1 libproxmox-acme-perl: 1.0.7 libproxmox-backup-qemu0: 1.0.2-1 libpve-access-control: 6.1-3 libpve-apiclient-perl: 3.1-3 libpve-common-perl: 6.3-2 libpve-guest-common-perl: 3.1-4 libpve-http-server-perl: 3.1-1 libpve-storage-perl: 6.3-5 libqb0: 1.0.5-1 libspice-server1: 0.14.2-4~pve6+1 lvm2: 2.03.02-pve4 lxc-pve: 4.0.6-2 lxcfs: 4.0.6-pve1 novnc-pve: 1.1.0-1 openvswitch-switch: 2.12.0-1 proxmox-backup-client: 1.0.8-1 proxmox-mini-journalreader: 1.1-1 proxmox-widget-toolkit: 2.4-4 pve-cluster: 6.2-1 pve-container: 3.3-3 pve-docs: 6.3-1 pve-edk2-firmware: 2.20200531-1 pve-firewall: 4.1-3 pve-firmware: 3.1-3 pve-ha-manager: 3.1-1 pve-i18n: 2.2-2 pve-qemu-kvm: 5.1.0-8 pve-xtermjs: 4.7.0-3 qemu-server: 6.3-4 smartmontools: 7.1-pve2 spiceterm: 3.1-1 vncterm: 1.6-2 zfsutils-linux: 0.8.5-pve1

Die Fehlermeldung:

systemctl status ceph-mon@justus.service ● ceph-mon@justus.service - Ceph cluster monitor daemon Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d └─ceph-after-pve-cluster.conf Active: failed (Result: signal) since Sat 2021-02-13 16:09:28 CET; 47min ago Process: 57396 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id justus --setuser ceph --setgroup ceph (code=killed, signal=ABRT) Main PID: 57396 (code=killed, signal=ABRT) Feb 13 16:09:28 justus systemd[1]: ceph-mon@justus.service: Service RestartSec=10s expired, scheduling restart. Feb 13 16:09:28 justus systemd[1]: ceph-mon@justus.service: Scheduled restart job, restart counter is at 5. Feb 13 16:09:28 justus systemd[1]: Stopped Ceph cluster monitor daemon. Feb 13 16:09:28 justus systemd[1]: ceph-mon@justus.service: Start request repeated too quickly. Feb 13 16:09:28 justus systemd[1]: ceph-mon@justus.service: Failed with result 'signal'. Feb 13 16:09:28 justus systemd[1]: Failed to start Ceph cluster monitor daemon. Feb 13 16:22:01 justus systemd[1]: ceph-mon@justus.service: Start request repeated too quickly. Feb 13 16:22:01 justus systemd[1]: ceph-mon@justus.service: Failed with result 'signal'. Feb 13 16:22:01 justus systemd[1]: Failed to start Ceph cluster monitor daemon.

Die anderen beiden Cep-mon laufen einwandfrei.

Ich habe im Monitorlog nicht wirklich viel gefunden:

Falls noch mehr Informationen benötigt werden, bitte einfach hier posten.

Vielen Dank schon einmal für die Hilfe.
 
Last edited:
Hallo,

vielen Dank für Ihre Antwort. Der Befehl wurde bereits ausgeführt und führte nicht zum gewünschten Erfolg.
Haben Sie weitere Vorschläge?

Vielen Dank für die Hilfe
 
Anything that might give you an idea in /var/log/ceph/ceph-mon....log?
 
Thanks for your reply:

I cannot read an error there. I don't have "Corruption" Entrys in the Log File.


2021-02-13 16:07:15.260 7fe1df182400 4 rocksdb: Options.error_if_exists: 0 -208> 2021-02-13 16:07:15.260 7fe1df182400 4 rocksdb: Options.error_if_exists: 0

2021-02-13 14:34:30.813 7f10053f6400 -1 *** Caught signal (Aborted) ** 2: (gsignal()+0x10b) [0x7f100591f7bb] 0> 2021-02-13 14:34:30.813 7f10053f6400 -1 *** Caught signal (Aborted) ** 2: (gsignal()+0x10b) [0x7f100591f7bb] 2021-02-13 14:34:41.629 7f2fde575400 -1 *** Caught signal (Aborted) ** 2: (gsignal()+0x10b) [0x7f2fdea9e7bb] 0> 2021-02-13 14:34:41.629 7f2fde575400 -1 *** Caught signal (Aborted) ** 2: (gsignal()+0x10b) [0x7f2fdea9e7bb] 2021-02-13 14:35:10.072 7fa0e7ed8400 -1 *** Caught signal (Aborted) ** 2: (gsignal()+0x10b) [0x7fa0e84017bb] 0> 2021-02-13 14:35:10.072 7fa0e7ed8400 -1 *** Caught signal (Aborted) **

root@justus:~# du -sch /var/lib/ceph/mon/ceph-justus/store.db/
38M /var/lib/ceph/mon/ceph-justus/store.db/
38M total

Size of mon db:
 
I have found another Entry in the Ceph-mon Log:

2021-02-13 14:34:30.805 7f10053f6400 -1 /build/ceph/ceph-14.2.16/src/mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f10053f6400 time 2021-02-13 14:34:30.809894 /build/ceph/ceph-14.2.16/src/mon/AuthMonitor.cc: 278: FAILED ceph_assert(ret == 0)
 
root@justus:/var/log/ceph# dpkg -l|grep ceph ii ceph 14.2.16-pve1 amd64 distributed storage and file system ii ceph-base 14.2.16-pve1 amd64 common ceph daemon libraries and management tools ii ceph-common 14.2.16-pve1 amd64 common utilities to mount and interact with a ceph storage cluster ii ceph-fuse 14.2.16-pve1 amd64 FUSE-based client for the Ceph distributed file system ii ceph-mds 14.2.16-pve1 amd64 metadata server for the ceph distributed file system ii ceph-mgr 14.2.16-pve1 amd64 manager for the ceph distributed storage system ii ceph-mon 14.2.16-pve1 amd64 monitor server for the ceph storage system ii ceph-osd 14.2.16-pve1 amd64 OSD server for the ceph storage system ii libcephfs2 14.2.16-pve1 amd64 Ceph distributed file system client library ii python-ceph-argparse 14.2.16-pve1 all Python 2 utility libraries for Ceph CLI ii python-cephfs 14.2.16-pve1 amd64 Python 2 libraries for the Ceph libcephfs library
 
I have found another Log Enty:

[ root@justus:~# ceph crash info 2021-02-13_15:09:17.891220Z_66035827-6722-47ce-8137-bf05cf815342 { "os_version_id": "10", "assert_condition": "ret == 0", "utsname_release": "5.4.78-2-pve", "os_name": "Debian GNU/Linux 10 (buster)", "entity_name": "mon.justus", "assert_file": "/build/ceph/ceph-14.2.16/src/mon/AuthMonitor.cc", "timestamp": "2021-02-13 15:09:17.891220Z", "process_name": "ceph-mon", "utsname_machine": "x86_64", "assert_line": 278, "utsname_sysname": "Linux", "os_version": "10 (buster)", "os_id": "10", "assert_thread_name": "ceph-mon", "utsname_version": "#1 SMP PVE 5.4.78-2 (Thu, 03 Dec 2020 14:26:17 +0100)", "backtrace": [ "(()+0x12730) [0x7efe55b3a730]", "(gsignal()+0x10b) [0x7efe5561d7bb]", "(abort()+0x121) [0x7efe55608535]", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7efe56c6bd43]", "(()+0x277eca) [0x7efe56c6beca]", "(AuthMonitor::update_from_paxos(bool*)+0x1f3f) [0x55e9fa87480f]", "(PaxosService::refresh(bool*)+0x10a) [0x55e9fa90963a]", "(Monitor::refresh_from_paxos(bool*)+0x19c) [0x55e9fa7f3cfc]", "(Monitor::init_paxos()+0xfc) [0x55e9fa7f3fbc]", "(Monitor::preinit()+0xa08) [0x55e9fa826278]", "(main()+0x2614) [0x55e9fa7ae0c4]", "(__libc_start_main()+0xeb) [0x7efe5560a09b]", "(_start()+0x2a) [0x55e9fa7dd58a]" ], "utsname_hostname": "justus", [B] "assert_msg": "/build/ceph/ceph-14.2.16/src/mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7efe550f4400 time 2021-02-13 16:09:17.888981\n/build/ceph/ceph-14.2.16/src/mon/AuthMonitor.cc: 278: FAILED ceph_assert(ret == 0)\n", "crash_id": "2021-02-13_15:09:17.891220Z_66035827-6722-47ce-8137-bf05cf815342", "assert_func": "virtual void AuthMonitor::update_from_paxos(bool*)", "ceph_version": "14.2.16"[/B] /ICODE]
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!