Hello.
I don't know if the two things stated in the thread title are related, I'm going to explain my problem.
I have a 3 nodes PVE 5 cluster and I'm testing CEPH on this cluster. I had a healtly ceph cluster for a week, each node has ceph mon + ceph mgr (created with pveceph createmon) and 2 or 3 OSDs (created from web interface). Ceph is running on his dedicated network together with corosync (I know it's not optimal but for now it's to test ceph and there is not a big load on ceph)
my pveversion on all three nodes:
Yesterday I made a test gracefully removing a node pve-hs3 / cluster-3 (migrated every VM/CT, out+stop every OSD, destroyed every OSD, destroyed mon on that node), powered off node and removed from PVE cluster as per instructions with pvecm delnode pve-hs-3
I installed the node from scrath, new installation with Proxmox ISO, I used same IPs and same hostname but the installation is totally new. Joined the cluster with
My /etc/hosts on each node
pvelocalhost is on the right IP on each node
my actual pvecm status
I then proceeded to install ceph on the reinstalled node with the usual pveceph install --version luminous, everything went fine but I noticed errors if I use ceph -s command. I discovered that symlink from /etc/ceph/ceph.conf to /etc/pve/ceph.conf was not automatically created, that is strange, however I did it by hand with ln -s. After this all ceph commands started to work, so I decided to create the mon+mgr on this host with pveceph createmon. On ceph.conf everything seems right
After monitor is created and ceph started to use it, the process randomly crashes with these messages in syslog:
https://pastebin.com/vebhShDH
Then mon start crashing and reloading until I remove the mon with pveceph destromon pve-hs-3
I am running the ceph cluster with 2 mons (on the other 2 nodes), I added 3 OSDs on the node with the mon problem and everything is working fine, only if I create a mon on that node it has problems. Before reinstalling the node, it had a ceph mon without problems.
Any ideas? I can even reinstall again from scratch the node if needed.
I don't know if the two things stated in the thread title are related, I'm going to explain my problem.
I have a 3 nodes PVE 5 cluster and I'm testing CEPH on this cluster. I had a healtly ceph cluster for a week, each node has ceph mon + ceph mgr (created with pveceph createmon) and 2 or 3 OSDs (created from web interface). Ceph is running on his dedicated network together with corosync (I know it's not optimal but for now it's to test ceph and there is not a big load on ceph)
my pveversion on all three nodes:
Code:
pve-hs-main[0]:~$ pveversion -v
proxmox-ve: 5.0-23 (running kernel: 4.10.17-3-pve)
pve-manager: 5.0-32 (running version: 5.0-32/2560e073)
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.10.17-3-pve: 4.10.17-23
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-14
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-18
libpve-guest-common-perl: 2.0-12
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-15
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.1-1
pve-container: 2.0-16
pve-firewall: 3.0-3
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
ceph: 12.2.0-pve1
Yesterday I made a test gracefully removing a node pve-hs3 / cluster-3 (migrated every VM/CT, out+stop every OSD, destroyed every OSD, destroyed mon on that node), powered off node and removed from PVE cluster as per instructions with pvecm delnode pve-hs-3
I installed the node from scrath, new installation with Proxmox ISO, I used same IPs and same hostname but the installation is totally new. Joined the cluster with
Code:
pvecm add 10.10.10.251 -ring0_addr cluster-3
My /etc/hosts on each node
Code:
pve-hs-main[0]:~$ cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.2.251 pve-hs-main.local pve-hs-main pvelocalhost
192.168.2.252 pve-hs-2.local pve-hs-2
192.168.2.253 pve-hs-3.local pve-hs-3
10.10.10.251 cluster-main.local cluster-main
10.10.10.252 cluster-2.local cluster-2
10.10.10.253 cluster-3.local cluster-3
-cut-
my actual pvecm status
Code:
pve-hs-main[0]:~$ pvecm status
Quorum information
------------------
Date: Tue Oct 3 08:14:54 2017
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1/272
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.10.251 (local)
0x00000002 1 10.10.10.252
0x00000003 1 10.10.10.253
Code:
pve-hs-main[0]:~$ pvecm nodes
Membership information
----------------------
Nodeid Votes Name
1 1 cluster-main (local)
2 1 cluster-2
3 1 cluster-3
I then proceeded to install ceph on the reinstalled node with the usual pveceph install --version luminous, everything went fine but I noticed errors if I use ceph -s command. I discovered that symlink from /etc/ceph/ceph.conf to /etc/pve/ceph.conf was not automatically created, that is strange, however I did it by hand with ln -s. After this all ceph commands started to work, so I decided to create the mon+mgr on this host with pveceph createmon. On ceph.conf everything seems right
Code:
[global]
auth client required = none
auth cluster required = none
auth service required = none
bluestore_block_db_size = 64424509440
cluster network = 10.10.10.0/24
fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 10.10.10.0/24
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.pve-hs-main]
host = pve-hs-main
mon addr = 10.10.10.251:6789
[mon.pve-hs-3]
host = pve-hs-3
mon addr = 10.10.10.253:6789
[mon.pve-hs-2]
host = pve-hs-2
mon addr = 10.10.10.252:6789
After monitor is created and ceph started to use it, the process randomly crashes with these messages in syslog:
https://pastebin.com/vebhShDH
Then mon start crashing and reloading until I remove the mon with pveceph destromon pve-hs-3
I am running the ceph cluster with 2 mons (on the other 2 nodes), I added 3 OSDs on the node with the mon problem and everything is working fine, only if I create a mon on that node it has problems. Before reinstalling the node, it had a ceph mon without problems.
Any ideas? I can even reinstall again from scratch the node if needed.
Last edited: