ceph out of quorum - ping is ok but monitor not

Harald Treis · Jun 13, 2018

Hi,
I have 3 proxmox servers with redundant network interfaces. All servers are connected to 2 different switches, to be prepared if a switch (or just a link) fails. Bonding is configure on both side (server and switch) with LACP. (osd are not defined at the moment)

If one link fails (e.g. I cut the connection to the switch), it takes a couple of seconds an the server is via ping available again. But the ceph-clusters does never return to quorum.

Why is an operating system fail over (tested with ping) possible, but ceph never gets healthy anymore?

My Configuration:
ceph.conf

Code:

[global]
    auth client required = cephx
    auth cluster required = cephx
    auth service required = cephx
    cluster network = 192.168.17.0/24
    fsid = 5070e036-8f6c-4795-a34d-9035472a628d
    keyring = /etc/pve/priv/$cluster.$name.keyring
    mon allow pool delete = true
    osd journal size = 5120
    osd pool default min size = 2
    osd pool default size = 3
    public network = 192.168.17.0/24

[osd]
    keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.ariel1]
    host = ariel1
    mon addr = 192.168.17.31:6789

[mon.ariel4]
    host = ariel4
    mon addr = 192.168.17.34:6789

[mon.ariel2]
    host = ariel2
    mon addr = 192.168.17.32:6789

/etc/network/interfaces (of ariel1, all IPs of ariel2 ends with 32, of ariel4 it is 34)
eth0, eth2 and eth4 are connected to switch-1
eth1, eth3 and eth5 are connected to switch-2

Code:

auto lo
iface lo inet loopback
iface eth0 inet manual
iface eth1 inet manual
iface eth2 inet manual
iface eth3 inet manual
iface eth4 inet manual
iface eth5 inet manual

auto bond0
iface bond0 inet manual
    slaves eth0 eth1
    bond_miimon 100
    bond_mode 802.3ad
        bond_xmit_hash_policy layer3+4
#frontside

auto bond1
iface bond1 inet static
    address  192.168.16.31
    netmask  255.255.255.0
    slaves eth2 eth3
    bond_miimon 100
    bond_mode 802.3ad
        bond_xmit_hash_policy layer3+4
    pre-up (ifconfig eth2 mtu 8996 && ifconfig eth3 mtu 8996)
    mtu 8996
#corosync

auto bond2
iface bond2 inet static
        address  192.168.17.31
        netmask  255.255.255.0
    slaves eth4 eth5
    bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer3+4
    pre-up (ifconfig eth4 mtu 8996 && ifconfig eth5 mtu 8996)
    mtu 8996
#ceph

auto vmbr0
iface vmbr0 inet static
    address  192.168.19.31
    netmask  255.255.255.0
    gateway  192.168.19.1
    bridge_ports bond0
    bridge_stp off
    bridge_fd 0

ping to all IPs in network 192.168.17. (31, 32, 34) from all servers are ok
ceph status

Code:

  cluster:
    id:     5070e036-8f6c-4795-a34d-9035472a628d
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ariel1,ariel2,ariel4
    mgr: ariel2(active), standbys: ariel4
    osd: 0 osds: 0 up, 0 in

Now I pull out eth4 from ariel4 - waiting a couple of seconds and ping is available withour any errors, again
But ceph-cluster fails:

Code:

root@ariel1:~# ceph status
  cluster:
    id:     5070e036-8f6c-4795-a34d-9035472a628d
    health: HEALTH_WARN
            1/3 mons down, quorum ariel1,ariel2

  services:
    mon: 3 daemons, quorum ariel1,ariel2, out of quorum: ariel4
    mgr: ariel2(active), standbys: ariel4
    osd: 0 osds: 0 up, 0 in

Is any configuration missing or is this a bug?
Please help.

Kind regards,
Harry

Alwin · Jun 13, 2018

Are you switches configured with MLAG? Otherwise the LACP doesn't really work, better try a active-backup bond.

dietmar · Jun 13, 2018

You still have quorum, so what exactly is the problem?

Harald Treis · Jun 13, 2018

dietmar said:
You still have quorum, so what exactly is the problem?

ceph status days that ariel4s monitor is down - but server is via ping available

Code:

mon: 3 daemons, quorum ariel1,ariel2, out of quorum: ariel4

log from ariel1

Code:

2018-06-13 10:51:43.078991 mon.ariel1 mon.0 192.168.17.31:6789/0 151 : cluster [WRN] Health check failed: 1/3 mons down, quorum ariel1,ariel2 (MON_DOWN)
2018-06-13 10:51:43.238338 mon.ariel1 mon.0 192.168.17.31:6789/0 152 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel1,ariel2
2018-06-13 11:00:00.000120 mon.ariel1 mon.0 192.168.17.31:6789/0 162 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel1,ariel2
2018-06-13 12:00:00.000116 mon.ariel1 mon.0 192.168.17.31:6789/0 186 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel1,ariel2
2018-06-13 13:00:00.000107 mon.ariel1 mon.0 192.168.17.31:6789/0 211 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel1,ariel2

monitor service on ariel4 is still running

Alwin · Jun 13, 2018

Harald Treis said:
mon: 3 daemons, quorum ariel1,ariel2, out of quorum: ariel4

One out of three MONs has no quorum, but your cluster still does.

Are you switches configured with MLAG? Otherwise the LACP doesn't really work, better try a active-backup bond.

Harald Treis · Jun 13, 2018

Alwin said:
One out of three MONs has no quorum, but your cluster still does.

Are you switches configured with MLAG? Otherwise the LACP doesn't really work, better try a active-backup bond.

Thank you, Alwin.
It looks like our Netgear XS728T do not have this feature..

Harald Treis · Jun 15, 2018

Alwin said:
One out of three MONs has no quorum, but your cluster still does.

Are you switches configured with MLAG? Otherwise the LACP doesn't really work, better try a active-backup bond.

Even the choice "active-backup" does not work.

I tested with a small vm an disables one port on the first netgear switch (as before):

root@ariel4:/var/log/ceph# ceph status
cluster:
id: 5070e036-8f6c-4795-a34d-9035472a628d
health: HEALTH_WARN
1 osds down
1 host (1 osds) down
Degraded data redundancy: 13983/37132 objects degraded (37.658%), 96 pgs degraded
1/3 mons down, quorum ariel2,ariel4
services:
mon: 3 daemons, quorum ariel2,ariel4, out of quorum: ariel1
mgr: ariel4(active), standbys: ariel2, ariel1
osd: 3 osds: 2 up, 3 in
data:
pools: 1 pools, 128 pgs
objects: 18566 objects, 73458 MB
usage: 143 GB used, 5443 GB / 5587 GB avail
pgs: 75.000% pgs not active
13983/37132 objects degraded (37.658%)
96 undersized+degraded+peered
32 active+clean

Why is ceph not able to switch over to the backup port?

Code:

cat /etc/network/interfaces
auto lo
iface lo inet loopback
iface eth0 inet manual
iface eth1 inet manual
iface eth2 inet manual
iface eth3 inet manual
iface eth4 inet manual
iface eth5 inet manual
auto bond0
iface bond0 inet manual
   slaves eth0 eth1
   bond_miimon 100
   bond_mode active-backup
#frontside

auto bond1
iface bond1 inet static
   address  192.168.16.34
   netmask  255.255.255.0
   slaves eth2 eth3
   bond_miimon 100
   bond_mode active-backup
   pre-up (ifconfig eth2 mtu 8996 && ifconfig eth3 mtu 8996)
   mtu 8996
#corosync

auto bond2
iface bond2 inet static
   address  192.168.17.34
   netmask  255.255.255.0
   slaves eth4 eth5
   bond_miimon 100
   bond_mode active-backup
   pre-up (ifconfig eth4 mtu 8996 && ifconfig eth5 mtu 8996)
   mtu 8996
#ceph

auto vmbr0
iface vmbr0 inet static
   address  192.168.19.34
   netmask  255.255.255.0
   gateway  192.168.19.1
   bridge_ports bond0
   bridge_stp off
   bridge_fd 0

That's for help.

Alwin · Jun 15, 2018

Did you trunk the two switches together? If one node switch their primary interface on the bond, they still need to access the other working members. But they are connected on the other switch. With active-backup, the nic is only listening on one port of the bond.

Harald Treis · Jun 18, 2018

Alwin said:
Did you trunk the two switches together? If one node switch their primary interface on the bond, they still need to access the other working members. But they are connected on the other switch. With active-backup, the nic is only listening on one port of the bond.

Hey Alwin,

there is something I do not understand (testing bond2: net 192.168.17.0/24):

I have 3 nodes: ariel1, ariel2, ariel4
All nodes have the same interface configuration, only the last digit of the ip is different:
eth0, eth2, eth4 are connected to switch-1; eth1, eth3, eth5 are connected to switch-2
switches do not support mlag.
all 6 links are up
The ceph status says HEALTH_OK
ping to all servers are ok

When I cut the link, e.g. eth4 for ariel1, the operating system is able to reconnect,
but ceph not. Why?

ping ariel1 -> ariel2/4:

Code:

arp -n | grep 192.168.17
192.168.17.32            ether   a0:36:9f:f7:8f:64   C                     bond2
192.168.17.34            ether   a0:36:9f:27:ba:2c   C                     bond2

PING 192.168.17.32 (192.168.17.32) 56(84) bytes of data.
64 bytes from 192.168.17.32: icmp_seq=1 ttl=64 time=0.074 ms

PING 192.168.17.34 (192.168.17.34) 56(84) bytes of data.
64 bytes from 192.168.17.34: icmp_seq=1 ttl=64 time=0.127 ms

ping ariel2 -> ariel1/4

Code:

arp -n | grep 192.168.17
192.168.17.34            ether   a0:36:9f:27:ba:2c   C                     bond2
192.168.17.31            ether   a0:36:9f:27:ba:18   C                     bond2

PING 192.168.17.31 (192.168.17.31) 56(84) bytes of data.
64 bytes from 192.168.17.31: icmp_seq=1 ttl=64 time=0.135 ms

PING 192.168.17.34 (192.168.17.34) 56(84) bytes of data.
64 bytes from 192.168.17.34: icmp_seq=1 ttl=64 time=0.116 ms

ping ariel4 -> ariel1/2

Code:

arp -n | grep 192.168.17
192.168.17.31            ether   a0:36:9f:27:ba:18   C                     bond2
192.168.17.32            ether   a0:36:9f:f7:8f:64   C                     bond2

PING 192.168.17.31 (192.168.17.31) 56(84) bytes of data.
64 bytes from 192.168.17.31: icmp_seq=1 ttl=64 time=0.082 ms

PING 192.168.17.32 (192.168.17.32) 56(84) bytes of data.
64 bytes from 192.168.17.32: icmp_seq=1 ttl=64 time=0.133 ms

ceph status

Code:

  cluster:
    id:     5070e036-8f6c-4795-a34d-9035472a628d
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum ariel1,ariel2,ariel4
    mgr: ariel4(active), standbys: ariel2, ariel1
    osd: 3 osds: 3 up, 3 in
  data:
    pools:   1 pools, 128 pgs
    objects: 18537 objects, 73211 MB
    usage:   142 GB used, 5444 GB / 5587 GB avail
    pgs:     128 active+clean

cat /etc/network/interfaces

Code:

auto lo
iface lo inet loopback
iface eth0 inet manual
iface eth1 inet manual
iface eth2 inet manual
iface eth3 inet manual
iface eth4 inet manual
iface eth5 inet manual

auto bond0
iface bond0 inet manual
   slaves eth0 eth1
   bond_miimon 100
   bond_mode active-backup
#frontside

auto bond1
iface bond1 inet static
   address  192.168.16.31
   netmask  255.255.255.0
   slaves eth2 eth3
   bond_miimon 100
   bond_mode active-backup
   pre-up (ifconfig eth2 mtu 8996 && ifconfig eth3 mtu 8996)
   mtu 8996
#corosync

auto bond2
iface bond2 inet static
   address  192.168.17.31
   netmask  255.255.255.0
   slaves eth4 eth5
   bond_miimon 100
   bond_mode active-backup
   pre-up (ifconfig eth4 mtu 8996 && ifconfig eth5 mtu 8996)
   mtu 8996
#ceph

auto vmbr0
iface vmbr0 inet static
   address  192.168.19.31
   netmask  255.255.255.0
   gateway  192.168.19.1
   bridge_ports bond0
   bridge_stp off
   bridge_fd 0

Now I pull of the cable in switch-1 for ariel1 (eth4),
no more listening is possible on this interface:

ping ariel1 -> ariel2/4:

Code:

arp -n | grep 192.168.17
192.168.17.32            ether   a0:36:9f:f7:8f:64   C                     bond2
192.168.17.34            ether   a0:36:9f:27:ba:2c   C                     bond2

PING 192.168.17.32 (192.168.17.32) 56(84) bytes of data.
64 bytes from 192.168.17.32: icmp_seq=1 ttl=64 time=0.147 ms

PING 192.168.17.34 (192.168.17.34) 56(84) bytes of data.
64 bytes from 192.168.17.34: icmp_seq=1 ttl=64 time=0.143 ms

ping ariel2 -> ariel1/4

Code:

arp -n | grep 192.168.17
192.168.17.34            ether   a0:36:9f:27:ba:2c   C                     bond2
192.168.17.31            ether   a0:36:9f:27:ba:18   C                     bond2

PING 192.168.17.31 (192.168.17.31) 56(84) bytes of data.
64 bytes from 192.168.17.31: icmp_seq=1 ttl=64 time=0.135 ms

PING 192.168.17.34 (192.168.17.34) 56(84) bytes of data.
64 bytes from 192.168.17.34: icmp_seq=1 ttl=64 time=0.098 ms

ping ariel4 -> ariel1/2

Code:

arp -n | grep 192.168.17
192.168.17.31            ether   a0:36:9f:27:ba:18   C                     bond2
192.168.17.32            ether   a0:36:9f:f7:8f:64   C                     bond2

PING 192.168.17.31 (192.168.17.31) 56(84) bytes of data.
64 bytes from 192.168.17.31: icmp_seq=1 ttl=64 time=0.104 ms

PING 192.168.17.32 (192.168.17.32) 56(84) bytes of data.
64 bytes from 192.168.17.32: icmp_seq=1 ttl=64 time=0.106 ms

ceph status

Code:

  cluster:
    id:     5070e036-8f6c-4795-a34d-9035472a628d
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
            Reduced data availability: 96 pgs inactive
            Degraded data redundancy: 13967/37074 objects degraded (37.673%), 96 pgs degraded, 96 pgs undersized
            1/3 mons down, quorum ariel2,ariel4
  services:
    mon: 3 daemons, quorum ariel2,ariel4, out of quorum: ariel1
    mgr: ariel4(active), standbys: ariel2, ariel1
    osd: 3 osds: 2 up, 3 in
  data:
    pools:   1 pools, 128 pgs
    objects: 18537 objects, 73211 MB
    usage:   142 GB used, 5444 GB / 5587 GB avail
    pgs:     75.000% pgs not active
             13967/37074 objects degraded (37.673%)
             96 undersized+degraded+peered
             32 active+clean

After a while proxmox is reducing th available space from 5587 GB to 3724 GB and does a recovery
one monitor and one osd is missing...

ceph status

Code:

  cluster:
    id:     5070e036-8f6c-4795-a34d-9035472a628d
    health: HEALTH_WARN
            1/3 mons down, quorum ariel2,ariel4
  services:
    mon: 3 daemons, quorum ariel2,ariel4, out of quorum: ariel1
    mgr: ariel4(active), standbys: ariel2, ariel1
    osd: 3 osds: 2 up, 2 in
  data:
    pools:   1 pools, 128 pgs
    objects: 18541 objects, 73226 MB
    usage:   141 GB used, 3583 GB / 3724 GB avail
    pgs:     128 active+clean

When reconecting eth4, ariel1 get quorum and ceph status is HEALTH_OK, with original space

Code:

  cluster:
    id:     5070e036-8f6c-4795-a34d-9035472a628d
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum ariel1,ariel2,ariel4
    mgr: ariel4(active), standbys: ariel2, ariel1
    osd: 3 osds: 3 up, 3 in
  data:
    pools:   1 pools, 128 pgs
    objects: 18541 objects, 73226 MB
    usage:   142 GB used, 5444 GB / 5587 GB avail
    pgs:     128 active+clean

Alwin · Jun 18, 2018

Check your MTU size on all interfaces and the switches.

Search

Search

ceph out of quorum - ping is ok but monitor not

Harald Treis

New Member

Alwin

Proxmox Retired Staff

dietmar

Proxmox Staff Member

Harald Treis

New Member

Alwin

Proxmox Retired Staff

Harald Treis

New Member

Harald Treis

New Member

Alwin

Proxmox Retired Staff

Harald Treis

New Member

Alwin

Proxmox Retired Staff