New 3 node cluster, unable to install Ceph

gk_emmo

Member
Oct 24, 2020
15
3
8
38
I am trying to build a new cluster. This will be my 3rd one in the last 3 years. I'd say im not a complete beginner. There are 3 nodes as a start. All nodes connected via switches with VLANs and bonding for Ceph network (2x10G).

Corosync and management are on separate physical interfaces. Via the bonding, i would send the 2 ceph networks and VM net.

I have this network config in all nodes:

Code:
auto lo
iface lo inet loopback


iface eno4 inet manual
#MGMT


auto eno1
iface eno1 inet manual
#BOND-1


auto eno2
iface eno2 inet manual
#BOND-2


auto eno3
iface eno3 inet static
        address 10.25.20.3/24
#COROSYNC


auto bond0
iface bond0 inet manual
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode balance-xor
        bond-xmit-hash-policy layer3+4
        mtu 9582
#BOND0


auto vmbr0
iface vmbr0 inet static
        address 10.231.10.3/24
        gateway 10.231.10.1
        bridge-ports eno4
        bridge-stp off
        bridge-fd 0
#MGMT-BR


auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9582
#BOND-BR


auto vmbr2050
iface vmbr2050 inet static
        address 10.205.20.3/24
        bridge-ports bond0.2050
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9582
#CLUSTER


auto vmbr2060
iface vmbr2060 inet static
        address 10.206.20.3/24
        bridge-ports bond0.2060
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9582
#PUBLIC


auto vmbr100
iface vmbr100 inet manual
        bridge-ports bond0.1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
#LOSONCZI_LAN


source /etc/network/interfaces.d/*

I do the setup the usual way, install nodes, update nodes, create all network config and verify. All nodes can ping each other via all of their interfaces.

After i create the cluster, and verify connectivity, I start to install Ceph via the GUI on first node. I use the no-sub repo with reef (i tried quincy also), after package install, chose the 2 networks, clicked next, done. On this node, MON and MGR has the green pipe, everything looks OK.

When i go to the second node, do the same, install VIA GUI, it closes the shell window by itself, and saying "Got timeout (500)". It never goes until the last screen, like it should. After this, when i go to the 2nd node Ceph screen, it also states "Got timeout (500)".

pveversion-v on all nodes identical:
Code:
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph: 18.2.2-pve1
ceph-fuse: 18.2.2-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.2
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.3-1
proxmox-backup-file-restore: 3.2.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

ceph -s gives me this on 1st node:
cluster:
id: 74a57cd3-5525-412e-bfa9-c20781d6e98c
health: HEALTH_WARN
OSD count 0 < osd_pool_default_size 3

services:
mon: 1 daemons, quorum pve1 (age 14m)
mgr: pve1(active, since 14m)
osd: 0 osds: 0 up, 0 in

data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:


On 2nd node:
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')

I re-installed the cluster 3 times now, and i dont understand what goes wrong with a fresh setup like this. Did somebody ran into an issue similar to this?

I would apprecciate any help.

Thank You
 
I kinda sorted the problem out partly by myself. The reason why i used bonding, that i was lacking an extra 10Gbit interface for VM-NET, and for file services, it would be useful for client to move data if needed.

I thought, that if i can ping with large MTU's, and even telnet the monitor on node one means that the network is fine.

It was somehow not...

I purged the above config, used the 2 10G interfaces without any bonding for cluster and public ceph networks, and everything started working all of a sudden...

If any experienced members can see this, and ran or saw an issue like this before, please enlight me, what did i done wrong.

Thanks in advance..
 
  • Like
Reactions: leesteken

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!