[SOLVED] Ceph won't start anymore

intellq

Member
Sep 8, 2021
17
2
8
46
After a power outage, the main Ceph of the cluster won't start anymore (pve1).

As far I can tell, the only service failing to start is ceph-mon@pve1.service

All the others seems normal.

Nodes:
pve1 (3 osd - 1tb each)
pbs (only acting as mgr)
pve4 (only acting as mgr)
pve2 (3 osd - 1tb each)

pve1 and pve2 are "raiding 5" each other, forgot the exact name. With these 1tb disks (3tb total) I created a 2tb volume (1tb for parity), and shared it to the whole cluster. That volume is now offline.

If in the gui I select any of the nodes and click Ceph -> "got timeout (500)"

Any help to restore my data will be highly appreciated :)

journalctl for ceph-mon@pve1.service:

Code:
fev 20 09:42:19 pve1 systemd[1]: Failed to start ceph-mon@pve1.service - Ceph cluster monitor daemon.

fev 20 09:42:19 pve1 systemd[1]: ceph-mon@pve1.service: Failed with result 'signal'.
fev 20 09:42:19 pve1 systemd[1]: ceph-mon@pve1.service: Start request repeated too quickly.
fev 20 09:42:19 pve1 systemd[1]: Stopped ceph-mon@pve1.service - Ceph cluster monitor daemon.
fev 20 09:42:19 pve1 systemd[1]: ceph-mon@pve1.service: Scheduled restart job, restart counter is at 5.
fev 20 09:42:09 pve1 systemd[1]: ceph-mon@pve1.service: Failed with result 'signal'.
fev 20 09:42:09 pve1 systemd[1]: ceph-mon@pve1.service: Main process exited, code=killed, status=6/ABRT
fev 20 09:42:09 pve1 ceph-mon[25962]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
fev 20 09:42:09 pve1 ceph-mon[25962]:  17: _start()
fev 20 09:42:09 pve1 ceph-mon[25962]:  16: __libc_start_main()
fev 20 09:42:09 pve1 ceph-mon[25962]:  15: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7e7432e4624a]
fev 20 09:42:09 pve1 ceph-mon[25962]:  14: main()
fev 20 09:42:09 pve1 ceph-mon[25962]:  13: (Monitor::preinit()+0x9a4) [0x5c50fc50be14]
fev 20 09:42:09 pve1 ceph-mon[25962]:  12: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5c50fc4df7fc]
fev 20 09:42:09 pve1 ceph-mon[25962]:  11: (LogMonitor::update_from_paxos(bool*)+0x50) [0x5c50fc56ba40]
fev 20 09:42:09 pve1 ceph-mon[25962]:  10: (LogMonitor::log_external_backlog()+0xf42) [0x5c50fc56acd2]
fev 20 09:42:09 pve1 ceph-mon[25962]:  9: (std::__throw_invalid_argument(char const*)+0x40) [0x7e7432ca0192]
fev 20 09:42:09 pve1 ceph-mon[25962]:  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa90d8) [0x7e7432ca90d8]
fev 20 09:42:09 pve1 ceph-mon[25962]:  7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e85) [0x7e7432ca8e85]
fev 20 09:42:09 pve1 ceph-mon[25962]:  6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e1a) [0x7e7432ca8e1a]
fev 20 09:42:09 pve1 ceph-mon[25962]:  5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9d919) [0x7e7432c9d919]
fev 20 09:42:09 pve1 ceph-mon[25962]:  4: abort()
fev 20 09:42:09 pve1 ceph-mon[25962]:  3: gsignal()
fev 20 09:42:09 pve1 ceph-mon[25962]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8aebc) [0x7e7432ea9ebc]
fev 20 09:42:09 pve1 ceph-mon[25962]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7e7432e5b050]
fev 20 09:42:09 pve1 ceph-mon[25962]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
fev 20 09:42:09 pve1 ceph-mon[25962]:  in thread 7e7431a83d40 thread_name:ceph-mon
fev 20 09:42:09 pve1 ceph-mon[25962]:      0> 2025-02-20T09:42:09.065-0300 7e7431a83d40 -1 *** Caught signal (Aborted) **
fev 20 09:42:09 pve1 ceph-mon[25962]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
fev 20 09:42:09 pve1 ceph-mon[25962]:  17: _start()
fev 20 09:42:09 pve1 ceph-mon[25962]:  16: __libc_start_main()
fev 20 09:42:09 pve1 ceph-mon[25962]:  15: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7e7432e4624a]
fev 20 09:42:09 pve1 ceph-mon[25962]:  14: main()
fev 20 09:42:09 pve1 ceph-mon[25962]:  13: (Monitor::preinit()+0x9a4) [0x5c50fc50be14]
fev 20 09:42:09 pve1 ceph-mon[25962]:  12: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5c50fc4df7fc]
fev 20 09:42:09 pve1 ceph-mon[25962]:  11: (LogMonitor::update_from_paxos(bool*)+0x50) [0x5c50fc56ba40]
fev 20 09:42:09 pve1 ceph-mon[25962]:  10: (LogMonitor::log_external_backlog()+0xf42) [0x5c50fc56acd2]
fev 20 09:42:09 pve1 ceph-mon[25962]:  9: (std::__throw_invalid_argument(char const*)+0x40) [0x7e7432ca0192]
fev 20 09:42:09 pve1 ceph-mon[25962]:  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa90d8) [0x7e7432ca90d8]
fev 20 09:42:09 pve1 ceph-mon[25962]:  7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e85) [0x7e7432ca8e85]
fev 20 09:42:09 pve1 ceph-mon[25962]:  6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e1a) [0x7e7432ca8e1a]
fev 20 09:42:09 pve1 ceph-mon[25962]:  5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9d919) [0x7e7432c9d919]
fev 20 09:42:09 pve1 ceph-mon[25962]:  4: abort()
fev 20 09:42:09 pve1 ceph-mon[25962]:  3: gsignal()
fev 20 09:42:09 pve1 ceph-mon[25962]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8aebc) [0x7e7432ea9ebc]
fev 20 09:42:09 pve1 ceph-mon[25962]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7e7432e5b050]
fev 20 09:42:09 pve1 ceph-mon[25962]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
fev 20 09:42:09 pve1 ceph-mon[25962]:  in thread 7e7431a83d40 thread_name:ceph-mon
fev 20 09:42:09 pve1 ceph-mon[25962]:      0> 2025-02-20T09:42:09.065-0300 7e7431a83d40 -1 *** Caught signal (Aborted) **
fev 20 09:42:09 pve1 ceph-mon[25962]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
fev 20 09:42:09 pve1 ceph-mon[25962]:  17: _start()
fev 20 09:42:09 pve1 ceph-mon[25962]:  16: __libc_start_main()
fev 20 09:42:09 pve1 ceph-mon[25962]:  15: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7e7432e4624a]
fev 20 09:42:09 pve1 ceph-mon[25962]:  14: main()
fev 20 09:42:09 pve1 ceph-mon[25962]:  13: (Monitor::preinit()+0x9a4) [0x5c50fc50be14]
fev 20 09:42:09 pve1 ceph-mon[25962]:  12: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5c50fc4df7fc]
fev 20 09:42:09 pve1 ceph-mon[25962]:  11: (LogMonitor::update_from_paxos(bool*)+0x50) [0x5c50fc56ba40]
fev 20 09:42:09 pve1 ceph-mon[25962]:  10: (LogMonitor::log_external_backlog()+0xf42) [0x5c50fc56acd2]
fev 20 09:42:09 pve1 ceph-mon[25962]:  9: (std::__throw_invalid_argument(char const*)+0x40) [0x7e7432ca0192]
fev 20 09:42:09 pve1 ceph-mon[25962]:  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa90d8) [0x7e7432ca90d8]
fev 20 09:42:09 pve1 ceph-mon[25962]:  7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e85) [0x7e7432ca8e85]
fev 20 09:42:09 pve1 ceph-mon[25962]:  6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e1a) [0x7e7432ca8e1a]
fev 20 09:42:09 pve1 ceph-mon[25962]:  5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9d919) [0x7e7432c9d919]
fev 20 09:42:09 pve1 ceph-mon[25962]:  4: abort()
fev 20 09:42:09 pve1 ceph-mon[25962]:  3: gsignal()
fev 20 09:42:09 pve1 ceph-mon[25962]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8aebc) [0x7e7432ea9ebc]
fev 20 09:42:09 pve1 ceph-mon[25962]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7e7432e5b050]
fev 20 09:42:09 pve1 ceph-mon[25962]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
fev 20 09:42:09 pve1 ceph-mon[25962]:  in thread 7e7431a83d40 thread_name:ceph-mon
fev 20 09:42:09 pve1 ceph-mon[25962]: 2025-02-20T09:42:09.065-0300 7e7431a83d40 -1 *** Caught signal (Aborted) **
fev 20 09:42:09 pve1 ceph-mon[25962]:  17: _start()
fev 20 09:42:09 pve1 ceph-mon[25962]:  16: __libc_start_main()
fev 20 09:42:09 pve1 ceph-mon[25962]:  15: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7e7432e4624a]
fev 20 09:42:09 pve1 ceph-mon[25962]:  14: main()
fev 20 09:42:09 pve1 ceph-mon[25962]:  13: (Monitor::preinit()+0x9a4) [0x5c50fc50be14]
fev 20 09:42:09 pve1 ceph-mon[25962]:  12: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5c50fc4df7fc]
fev 20 09:42:09 pve1 ceph-mon[25962]:  11: (LogMonitor::update_from_paxos(bool*)+0x50) [0x5c50fc56ba40]
fev 20 09:42:09 pve1 ceph-mon[25962]:  10: (LogMonitor::log_external_backlog()+0xf42) [0x5c50fc56acd2]
fev 20 09:42:09 pve1 ceph-mon[25962]:  9: (std::__throw_invalid_argument(char const*)+0x40) [0x7e7432ca0192]
fev 20 09:42:09 pve1 ceph-mon[25962]:  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa90d8) [0x7e7432ca90d8]
fev 20 09:42:09 pve1 ceph-mon[25962]:  7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e85) [0x7e7432ca8e85]
fev 20 09:42:09 pve1 ceph-mon[25962]:  6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa8e1a) [0x7e7432ca8e1a]
fev 20 09:42:09 pve1 ceph-mon[25962]:  5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9d919) [0x7e7432c9d919]
fev 20 09:42:09 pve1 ceph-mon[25962]:  4: abort()
fev 20 09:42:09 pve1 ceph-mon[25962]:  3: gsignal()
fev 20 09:42:09 pve1 ceph-mon[25962]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8aebc) [0x7e7432ea9ebc]
fev 20 09:42:09 pve1 ceph-mon[25962]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7e7432e5b050]
fev 20 09:42:09 pve1 ceph-mon[25962]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
fev 20 09:42:09 pve1 ceph-mon[25962]:  in thread 7e7431a83d40 thread_name:ceph-mon
fev 20 09:42:09 pve1 ceph-mon[25962]: *** Caught signal (Aborted) **
fev 20 09:42:09 pve1 ceph-mon[25962]:   what():  stoull
fev 20 09:42:09 pve1 ceph-mon[25962]: terminate called after throwing an instance of 'std::invalid_argument'
fev 20 09:42:08 pve1 systemd[1]: Started ceph-mon@pve1.service - Ceph cluster monitor daemon.
fev 20 09:42:08 pve1 systemd[1]: Stopped ceph-mon@pve1.service - Ceph cluster monitor daemon.
fev 20 09:42:08 pve1 systemd[1]: ceph-mon@pve1.service: Scheduled restart job, restart counter is at 4.
 
Last edited:
Hello intellq! Could you please post the output of ceph status?

Background: the error is caused exactly here, where the monitor tries to read a configuration file (which succeeds), then tries to parse a string as a number (stoull) and fails (std::invalid_argument exception). My guess is that the monitor was writing the configuration file exactly at the time of the power outage, which corrupted it. Maybe you can try to delete the monitor and create a new one, but I would need some more information (from ceph status) before I suggest that fix.
 
  • Like
Reactions: intellq
Thanks for the help. In one of the switchs, one single port wasn't working. Rebooted and all was back to normal.

That one port was the network interface used by ceph "server" pve2. So no other ceph node can't reach it, including pve1, making it to throw that error.
 
Last edited:
  • Like
Reactions: l.leahu-vladucu