patched from 3.4.15 to 3.4.16, now ceph 0.94.9 fails to start

stefws · Jun 6, 2017

Got an older 7x node 3.4 testlab (running Ceph Hammer 0.94.9 on 4x of the nodes and only VMs on 3x nodes), which we wanted to patch up today, but after rebooting our OSD won't start, seems ceph can't connect to ceph cluster. Wondering why that might be?

Previous version before patching:

root@node2:~# pveversion -verbose
proxmox-ve-2.6.32: 3.4-177 (running kernel: 2.6.32-46-pve)
pve-manager: 3.4-15 (running version: 3.4-15/e1daa307)
pve-kernel-2.6.32-45-pve: 2.6.32-174
pve-kernel-2.6.32-46-pve: 2.6.32-177
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-20
qemu-server: 3.4-9
pve-firmware: 1.1-5
libpve-common-perl: 3.0-27
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-35
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-27
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Version after patching:

root@node1:~# pveversion
pve-manager/3.4-16/40ccc11c (running kernel: 2.6.32-48-pve)
root@node1:~# pveversion -verbose
proxmox-ve-2.6.32: 3.4-187 (running kernel: 2.6.32-48-pve)
pve-manager: 3.4-16 (running version: 3.4-16/40ccc11c)
pve-kernel-2.6.32-48-pve: 2.6.32-187
pve-kernel-2.6.32-46-pve: 2.6.32-177
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-20
qemu-server: 3.4-9
pve-firmware: 1.1-6
libpve-common-perl: 3.0-27
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-35
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-28
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Ceph status, monitor starts but none OSD starts

root@node1:~# /etc/init.d/ceph status
=== osd.7 ===
osd.7: not running.
=== osd.4 ===
osd.4: not running.
=== osd.16 ===
osd.16: not running.
=== osd.5 ===
osd.5: not running.
=== osd.17 ===
osd.17: not running.
=== osd.6 ===
osd.6: not running.
=== mon.2 ===
mon.2: running {"version":"0.94.9"}

Attempt to start OSDs fails due to timeout

root@node1:~# /etc/init.d/ceph start osd.4
=== osd.4 ===
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.4 --keyring=/var/lib/ceph/osd/ceph-4/keyring osd crush create-or-move -- 4 0.13 host=node1 root=default'

Like cli ceph can't connect to cluster, but what would ceph do to cnx w/cluster open an unix or tcp socket to what/where?

stefws · Jun 6, 2017

Seems debian now load a kernel module named vxlan used by openvswitch and the patched node's various vlan ain't working, digging into to this...

stefws · Jun 6, 2017

In /etc/network/interfaces always had this vlan on top of a openvswitch bond:

# Ceph cluster communication vlan (jumbo frames)
allow-vmbr1 vlan3
iface vlan3 inet static
ovs_type OVSIntPort
ovs_bridge vmbr1
#ovs_options vlan_mode=access
ovs_options tag=3
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 10.0.3.1
netmask 255.255.255.0
mtu 9000

Wondering why this changed or causes our vlans to fail now:

root@node1:/etc# ifdown vlan3
root@node1:/etc# ifup vlan3
Set name-type for VLAN subsystem. Should be visible in /proc/net/vlan/config
root@node1:/etc# ping 10.0.3.2
PING 10.0.3.2 (10.0.3.2) 56(84) bytes of data.
^C
--- 10.0.3.2 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 3999ms

stefws · Jun 6, 2017

Booting back in the previous kernel 2.6.32-46 and starting networking manually vlans work again.
(This should properly go into the networking forrum instead...)
Wondering what changed in the kernel causing vlans not to function. Hints anyone?

root@node1:~# ls -l /lib/modules/2.6.32-4?-pve/kernel/net/openvswitch/
/lib/modules/2.6.32-46-pve/kernel/net/openvswitch/:
total 172
-rw-r--r-- 1 root root 15952 Jun 28 2016 brcompat.ko
-rw-r--r-- 1 root root 152512 Jun 28 2016 openvswitch.ko

/lib/modules/2.6.32-48-pve/kernel/net/openvswitch/:
total 128
-rw-r--r-- 1 root root 15952 Mar 8 21:12 brcompat.ko
-rw-r--r-- 1 root root 109256 Mar 8 21:12 openvswitch.ko

Search

Search

patched from 3.4.15 to 3.4.16, now ceph 0.94.9 fails to start

stefws

Renowned Member

stefws

Renowned Member

stefws

Renowned Member

stefws

Renowned Member

We value your privacy