[SOLVED] PVE 5 and Ceph luminous: unable to create monitor

Waschbüsch

Renowned Member
Dec 15, 2014
93
8
73
Munich
I saw some threads on something similar but none of them looked like the issue I have. Apologies if I overlooked a thread already covering this.
Anyway:

I cannot create a new monitor on my cluster. Or rather, it gets created but never gets quorum.

I tried doing this via:

- Web UI
- pveceph via command line
- manually as described here: http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/

I always get the following situation:

pveceph or ui will show:

# pveceph createmon -id 0 said:
Created symlink /etc/systemd/system/ceph-mon.target.wants/ceph-mon@0.service -> /lib/systemd/system/ceph-mon@.service.
admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
INFO:ceph-create-keys:ceph-mon admin socket not ready yet.
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'synchronizing'

without ever ending.

And the log file for the monitor will have something similar to this:

# tail -f /var/log/ceph/ceph-mon.0.log said:
2017-09-03 09:28:44.269377 7f5cdebe4f80 4 rocksdb: [/home/builder/source/ceph-12.1.2/src/rocksdb/db/version_set.cc:2395] Creating manifest 26

2017-09-03 09:28:44.270652 7f5cdebe4f80 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1504430924270649, "job": 1, "event": "recovery_finished"}
2017-09-03 09:28:44.273735 7f5cdebe4f80 4 rocksdb: [/home/builder/source/ceph-12.1.2/src/rocksdb/db/db_impl_open.cc:1063] DB pointer 0x5645b299a000
2017-09-03 09:28:44.275134 7f5cdebe4f80 0 starting mon.0 rank 0 at public addr 192.168.1.1:6789/0 at bind addr 192.168.1.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid 20b519f3-4988-4ac5-ac3c-7cd352431ebb
2017-09-03 09:28:44.275294 7f5cdebe4f80 0 starting mon.0 rank 0 at 192.168.1.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid 20b519f3-4988-4ac5-ac3c-7cd352431ebb
2017-09-03 09:28:44.722781 7f5cdebe4f80 0 mon.0@-1(probing) e1 my rank is now 0 (was -1)
2017-09-03 09:28:44.725226 7f5cd17de700 0 -- 192.168.1.1:6789/0 >> 192.168.1.2:6789/0 conn(0x5645b55ad000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13618560 cs=1 l=0).process missed message? skipped from seq 0 to 124134995
2017-09-03 09:28:44.725381 7f5cd47e4700 0 mon.0@0(probing) e64 my rank is now -1 (was 0)
2017-09-03 09:28:44.727080 7f5cd17de700 0 -- 192.168.1.1:6789/0 >> 192.168.1.2:6789/0 conn(0x5645b5dd9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13618561 cs=1 l=0).process missed message? skipped from seq 0 to 313340281

My main issue right now is that I only have one monitor up, which - testcluster or not - I want to change asap. ;-)
Every node (4 of them) can reach the others and port 6789 is not blocked by the firewall.

Anyone any ideas how to fix this?

Version and config info:

# pveversion --verbose said:
proxmox-ve: 5.0-20 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-30 (running version: 5.0-30/5ab26bc)
pve-kernel-4.10.17-2-pve: 4.10.17-20
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-4
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
openvswitch-switch: 2.7.0-2
ceph: 12.1.2-pve1

# cat /etc/pve/ceph.conf said:
[global]
auth client required = none
auth cluster required = none
auth service required = none
auth supported = cephx
cluster network = 192.168.2.0/24
fsid = 20b519f3-4988-4ac5-ac3c-7cd352431ebb
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 10240
public network = 192.168.1.0/24

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.1]
host = srv02
mon addr = 192.168.1.2:6789
 
what does
Code:
ceph status
say?
 
Well, if I try to add a monitor, all I ever get is a timeout.
Otherwise:
Code:
# ceph status
  cluster:
    id:     20b519f3-4988-4ac5-ac3c-7cd352431ebb
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum 1
    mgr: 1(active), standbys: 0, 2, 3
    osd: 12 osds: 12 up, 12 in
 
  data:
    pools:   4 pools, 2048 pgs
    objects: 1500k objects, 2396 GB
    usage:   7455 GB used, 20257 GB / 27712 GB avail
    pgs:     2048 active+clean
 
  io:
    client:   955 kB/s rd, 5911 kB/s wr, 214 op/s rd, 218 op/s wr
 
what does the log of mon.1 say?
 
during the attempted creation via
Code:
pveceph createmon
I get:

Code:
]2017-09-05 06:34:48.283245 7f1a35f4b700  0 mon.1@0(leader) e82 handle_command mon_command({"format":"plain","prefix":"mon getmap"} v 0) v1
2017-09-05 06:34:48.283295 7f1a35f4b700  0 log_channel(audit) log [DBG] : from='client.? 192.168.1.1:0/3896201511' entity='client.admin' cmd=[{"format":"plain","prefix":"mon getmap"}]: dispatch
2017-09-05 06:34:48.667232 7f1a35f4b700  1 mon.1@0(leader) e82  adding peer 192.168.1.1:6789/0 to list of hints
2017-09-05 06:34:48.667853 7f1a35f4b700  1 mon.1@0(leader) e82  adding peer 192.168.1.1:6789/0 to list of hints
2017-09-05 06:34:49.959705 7f1a32744700  1 mon.1@0(leader).log v49898403 check_sub sending message to client.76084147 192.168.1.2:0/632939495 with 0 entries (version 49898403)
2017-09-05 06:35:40.240509 7f1a38750700  0 mon.1@0(leader).data_health(13721) update_stats avail 59% total 7935 MB, used 2785 MB, avail 4724 MB

I probably need to increase debug level? Which settings do you recommend?
 
mon status:
Code:
{
    "name": "0",
    "rank": -1,
    "state": "synchronizing",
    "election_epoch": 0,
    "quorum": [],
    "features": {
        "required_con": "144115188077969408",
        "required_mon": [
            "kraken",
            "luminous"
        ],
        "quorum_con": "0",
        "quorum_mon": []
    },
    "outside_quorum": [],
    "extra_probe_peers": [],
    "sync_provider": [],
    "sync": {
        "sync_provider": "mon.0 192.168.1.2:6789/0",
        "sync_cookie": 230267289601,
        "sync_start_version": 103105711
    },
    "monmap": {
        "epoch": 83,
        "fsid": "20b519f3-4988-4ac5-ac3c-7cd352431ebb",
        "modified": "2017-09-03 10:34:27.803779",
        "created": "2015-07-15 17:40:37.481052",
        "features": {
            "persistent": [
                "kraken",
                "luminous"
            ],
            "optional": []
        },
        "mons": [
            {
                "rank": 0,
                "name": "1",
                "addr": "192.168.1.2:6789/0",
                "public_addr": "192.168.1.2:6789/0"
            }
        ]
    },
    "feature_map": {
        "mon": {
            "group": {
                "features": 2305244844532236283,
                "release": "luminous",
                "num": 2
            }
        },
        "client": {
            "group": {
                "features": 2305244844532236283,
                "release": "luminous",
                "num": 4
            }
        }
    }
}
 
does the existing monitor have a big mon store? if so, try increasing the timeout for the systemd unit, maybe it is able to successfully sync then:

Code:
mkdir /etc/systemd/system/ceph-mon@.service.d
echo -e '[Service]\nTimeoutStartSec=infinity\n' > /etc/systemd/system/ceph-mon@.service.d/wait-longer.conf
systemctl daemon-reload

removing the wait-longer.conf file reverts to the default timeout again. you can verify whether it is enabled with "systemctl cat ceph-mon@"
 
The setting (applied on both existing node and node to be added) did not change the behavior. :-(

Btw, the store does not look overly large to me. ;-)

Code:
# du -sh /var/lib/ceph/mon/ceph-1
98M    ceph-1
 
Last edited:
could you provide the journal output around the time of the failed creation as well?
 
Code:
journalctl -u "ceph*" -u "pve*" --since "-10m"
(adapt the time if necessary)
 
okay, and once more:
Code:
journalctl --since "11:00" --until "11:01"
 
Code:
-- Logs begin at Sun 2017-09-03 03:23:10 UTC, end at Tue 2017-09-05 11:29:31 UTC. --
Sep 05 11:00:00 srv01 systemd[1]: Starting Proxmox VE replication runner...
Sep 05 11:00:01 srv01 systemd[1]: Started Proxmox VE replication runner.
Sep 05 11:00:14 srv01 pveceph[482]: <root@pam> starting task UPID:srv01:000001E4:013187AF:59AE83BE:ceph
Sep 05 11:00:15 srv01 systemd[1]: Started Ceph cluster monitor daemon.
Sep 05 11:00:15 srv01 systemd[1]: Reloading.
Sep 05 11:00:15 srv01 systemd[1]: apt-daily.timer: Adding 3h 52min 8.583306s random time.
Sep 05 11:00:15 srv01 systemd[1]: apt-daily-upgrade.timer: Adding 24min 8.778454s random time.
Sep 05 11:00:42 srv01 pveceph[484]: received interrupt
Sep 05 11:00:42 srv01 pveceph[520]: command 'ceph-create-keys -i 0' failed: received interrupt
Sep 05 11:00:42 srv01 pveceph[482]: <root@pam> end task UPID:srv01:000001E4:013187AF:59AE83BE:cephcreat
Sep 05 11:00:46 srv01 systemd[1]: Stopping Ceph cluster monitor daemon...
Sep 05 11:00:46 srv01 ceph-mon[523]: 2017-09-05 11:00:46.599548 7ffafb697700 -1 received  signal: Termi
Sep 05 11:00:46 srv01 ceph-mon[523]: 2017-09-05 11:00:46.599572 7ffafb697700 -1 mon.0@-1(synchronizing)
Sep 05 11:00:46 srv01 systemd[1]: Stopped Ceph cluster monitor daemon.

11:00:42 was when I stopped the attempt.
And it seems it was still trying to do 'ceph-create-keys -i 0' when I did.
The weird thing is that other procedures (which I guess also need to create keys) such as 'pveceph createmgr' work without issues.
 
sorry if this was not clear - please keep the attempt to create running at least for a couple of minutes. the new monitor needs to sync its store with the existing one, depending on your network and the size of the store this can take a bit..

also, you can try with the 12.2.0 packages available in the test repo ;)
 
When I added monitors before it was near-instant, but I'll give it more time.
As for 12.2.0 - I'll update and let you know if it makes a difference
 
Twice, I waited for the pveceph command to timeout on its own:
Code:
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
ceph-mon was not able to join quorum within 0 seconds
TASK ERROR: command 'ceph-create-keys -i 0' failed: exit code 1

Then I followed the surviving mon's log during an attempt and found this repeatedly:

Code:
2017-09-05 13:40:52.783440 7f9362301700  1 mon.1@1(peon).paxos(paxos recovering c 103116574..103117208) lease_timeout -- calling new election
2017-09-05 13:40:52.783899 7f935fafc700  1 mon.1@1(probing) e93 handle_timecheck drop unexpected msg

Apparently continuous re-elections cause the synching of a new mon take way longer.

Anyway, why the timeouts? In the end, I had clock skew. I never thought about it and

Code:
ceph status

could not tell me as long as I only had the one surviving monitor.

Thanks, Fabian, for your patience in helping me with this!