patched from 3.4.15 to 3.4.16, now ceph 0.94.9 fails to start

stefws

Member
Jan 29, 2015
302
4
18
Denmark
siimnet.dk
Got an older 7x node 3.4 testlab (running Ceph Hammer 0.94.9 on 4x of the nodes and only VMs on 3x nodes), which we wanted to patch up today, but after rebooting our OSD won't start, seems ceph can't connect to ceph cluster. Wondering why that might be?

Previous version before patching:
root@node2:~# pveversion -verbose
proxmox-ve-2.6.32: 3.4-177 (running kernel: 2.6.32-46-pve)
pve-manager: 3.4-15 (running version: 3.4-15/e1daa307)
pve-kernel-2.6.32-45-pve: 2.6.32-174
pve-kernel-2.6.32-46-pve: 2.6.32-177
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-20
qemu-server: 3.4-9
pve-firmware: 1.1-5
libpve-common-perl: 3.0-27
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-35
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-27
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Version after patching:
root@node1:~# pveversion
pve-manager/3.4-16/40ccc11c (running kernel: 2.6.32-48-pve)
root@node1:~# pveversion -verbose
proxmox-ve-2.6.32: 3.4-187 (running kernel: 2.6.32-48-pve)
pve-manager: 3.4-16 (running version: 3.4-16/40ccc11c)
pve-kernel-2.6.32-48-pve: 2.6.32-187
pve-kernel-2.6.32-46-pve: 2.6.32-177
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-20
qemu-server: 3.4-9
pve-firmware: 1.1-6
libpve-common-perl: 3.0-27
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-35
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-28
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Ceph status, monitor starts but none OSD starts
root@node1:~# /etc/init.d/ceph status
=== osd.7 ===
osd.7: not running.
=== osd.4 ===
osd.4: not running.
=== osd.16 ===
osd.16: not running.
=== osd.5 ===
osd.5: not running.
=== osd.17 ===
osd.17: not running.
=== osd.6 ===
osd.6: not running.
=== mon.2 ===
mon.2: running {"version":"0.94.9"}

Attempt to start OSDs fails due to timeout
root@node1:~# /etc/init.d/ceph start osd.4
=== osd.4 ===
failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.4 --keyring=/var/lib/ceph/osd/ceph-4/keyring osd crush create-or-move -- 4 0.13 host=node1 root=default'

Like cli ceph can't connect to cluster, but what would ceph do to cnx w/cluster open an unix or tcp socket to what/where?
 
Seems debian now load a kernel module named vxlan used by openvswitch and the patched node's various vlan ain't working, digging into to this...
 
In /etc/network/interfaces always had this vlan on top of a openvswitch bond:
# Ceph cluster communication vlan (jumbo frames)
allow-vmbr1 vlan3
iface vlan3 inet static
ovs_type OVSIntPort
ovs_bridge vmbr1
#ovs_options vlan_mode=access
ovs_options tag=3
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 10.0.3.1
netmask 255.255.255.0
mtu 9000

Wondering why this changed or causes our vlans to fail now:

root@node1:/etc# ifdown vlan3
root@node1:/etc# ifup vlan3
Set name-type for VLAN subsystem. Should be visible in /proc/net/vlan/config
root@node1:/etc# ping 10.0.3.2
PING 10.0.3.2 (10.0.3.2) 56(84) bytes of data.
^C
--- 10.0.3.2 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 3999ms
 
Booting back in the previous kernel 2.6.32-46 and starting networking manually vlans work again.
(This should properly go into the networking forrum instead...)
Wondering what changed in the kernel causing vlans not to function. Hints anyone?

root@node1:~# ls -l /lib/modules/2.6.32-4?-pve/kernel/net/openvswitch/
/lib/modules/2.6.32-46-pve/kernel/net/openvswitch/:
total 172
-rw-r--r-- 1 root root 15952 Jun 28 2016 brcompat.ko
-rw-r--r-- 1 root root 152512 Jun 28 2016 openvswitch.ko

/lib/modules/2.6.32-48-pve/kernel/net/openvswitch/:
total 128
-rw-r--r-- 1 root root 15952 Mar 8 21:12 brcompat.ko
-rw-r--r-- 1 root root 109256 Mar 8 21:12 openvswitch.ko
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!