Ceph Crashing - Have to reboot nodes?

vispa · Oct 30, 2017

Hi All,

I have a cluster of 7 nodes in a HP C7000 chassis. Each Ceph is running over a 10G switch.

The ceph cluster has been running perfectly for almost a year but all of a sudden, something is causing the nodes to loose connectivity and all nodes show as unavailable and CT's crash.

I have found the only way to fix the issue is to reboot all the nodes.

I've read that this could be a problem with multicast however I have run a ten minute test and the results seem ok :-

-------
10.10.11.48 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.040/0.169/0.837/0.072

10.10.11.48 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.044/0.179/0.838/0.073

10.10.11.49 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.165/0.652/0.068

10.10.11.49 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.042/0.178/0.640/0.069

10.10.11.51 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.041/0.154/0.617/0.058

10.10.11.51 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.046/0.163/0.616/0.058

10.10.11.52 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.146/1.240/0.077

10.10.11.52 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.040/0.159/1.236/0.078

10.10.11.53 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.039/0.154/0.423/0.059

10.10.11.53 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.041/0.164/0.435/0.061

10.10.11.54 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.040/0.142/0.352/0.047

10.10.11.54 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.043/0.150/0.378/0.047
--------

Can anyone suggest how to fix this problem?

Alwin · Oct 30, 2017

Please describe your setup further, what is your PVE/Ceph version (pveversion -v / ceph versions), how is your cluster connected to each other (eg. corosync traffic on same network as ceph, ...) and some logs.

vispa · Oct 30, 2017

Hi,

Which log files will be helpful? I will grab a copy when the system crashes again.

I did look through some logs today but I’m not entirely sure what exactly has crashed.

Regards,

James

Alwin · Oct 31, 2017

Every log file, from corosync to syslog, as it could be from hardware failure to too much network load (eg. corosync & ceph on the same network).

vispa · Oct 31, 2017

Hi,

Looking at the Ceph & Corosync config they are on separate networks.

Last time it crashed, I noticed three of the nodes seemed to have rebooted as the uptime was very low.

I'm just in the process of updating all the packages to see if it makes any difference. If and when it crashes again, I will provide the logs.

Code:

[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     cluster network = 10.10.11.0/24
     filestore xattr use omap = true
     fsid = c49cb41f-41fa-41a8-9dce-aa5e202ed84e
     keyring = /etc/pve/priv/$cluster.$name.keyring
     osd journal size = 5120
     osd pool default min size = 1
     public network = 10.10.11.0/24

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring
     osd max backfills = 1
     osd recovery max active = 1

[mon.1]
     host = cloud2
     mon addr = 10.10.11.52:6789

[mon.5]
     host = cloud5
     mon addr = 10.10.11.55:6789

[mon.6]
     host = storage2
     mon addr = 10.10.11.49:6789

[mon.4]
     host = cloud4
     mon addr = 10.10.11.54:6789

[mon.2]
     host = cloud3
     mon addr = 10.10.11.53:6789

[mon.3]
     host = storage1
     mon addr = 10.10.11.48:6789

[mon.0]
     host = cloud1
     mon addr = 10.10.11.51:6789

Corosync

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: cloud5
    nodeid: 6
    quorum_votes: 1
    ring0_addr: cloud5
  }

  node {
    name: cloud4
    nodeid: 5
    quorum_votes: 1
    ring0_addr: cloud4
  }

  node {
    name: storage1
    nodeid: 4
    quorum_votes: 1
    ring0_addr: storage1
  }

  node {
    name: storage2
    nodeid: 7
    quorum_votes: 1
    ring0_addr: storage2
  }

  node {
    name: cloud1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: cloud1
  }

  node {
    name: cloud2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: cloud2
  }
  node {
    name: cloud3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: cloud3
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: vispa
  config_version: 13
  ip_version: ipv4
  secauth: on
  version: 2
  interface {
    bindnetaddr: 83.21x.x.x
    ringnumber: 0
  }

}

vispa · Oct 31, 2017

OK so I have just been updating the packages on node Cloud2 when half way though, nodes Cloud3 & Cloud5 rebooted. This has now caused everything to crash.

Code:

root@cloud3:~# uptime
 11:43:22 up 0 min,  1 user,  load average: 0.17, 0.05, 0.01
root@cloud3:~# cd /ceph stat^C
root@cloud3:~# ceph status
    cluster c49cb41f-41fa-41a8-9dce-aa5e202ed84e
     health HEALTH_WARN
            250 pgs degraded
            1 pgs recovering
            69 pgs recovery_wait
            173 pgs stuck unclean
            184 pgs undersized
            recovery 124692/1034946 objects degraded (12.048%)
            1/14 in osds are down
            1 mons down, quorum 0,1,2,4,5,6 3,6,0,2,4,5
     monmap e9: 7 mons at {0=10.10.11.51:6789/0,1=10.10.11.52:6789/0,2=10.10.11.53:6789/0,3=10.10.11.48:6789/0,4=10.10.11.54:6789/0,5=10.10.11.55:6789/0,6=10.10.11.49:6789/0}
            election epoch 366, quorum 0,1,2,4,5,6 3,6,0,2,4,5
     osdmap e11219: 14 osds: 13 up, 14 in
      pgmap v16926332: 512 pgs, 2 pools, 1305 GB data, 336 kobjects
            4014 GB used, 5280 GB / 9295 GB avail
            124692/1034946 objects degraded (12.048%)
                 261 active+clean
                 173 active+undersized+degraded
                  59 active+recovery_wait+degraded
                  10 active+recovery_wait+undersized+degraded
                   7 active+degraded
                   1 active+clean+scrubbing+deep
                   1 active+recovering+undersized+degraded
recovery io 42303 kB/s, 13 objects/s
  client io 161 kB/s rd, 4345 kB/s wr, 755 op/s
root@cloud3:~# pvecm status
Quorum information
------------------
Date:             Tue Oct 31 11:45:32 2017
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000003
Ring ID:          4/397652
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      6
Quorum:           4 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 83.x.x.48
0x00000007          1 83.x.x.49
0x00000001          1 83.x.x.51
0x00000003          1 83.x.x.53 (local)
0x00000005          1 83.x.x.54
0x00000006          1 83.x.x.55

vispa · Oct 31, 2017

This time I have restarted corosync and it seems to have come back.

Here's the logs. It happened around 11:38.

http://ops.vispa.net.uk/syslog.txt
http://ops.vispa.net.uk/messages.txt
http://ops.vispa.net.uk/ceph.txt
http://ops.vispa.net.uk/cephmon.txt
http://ops.vispa.net.uk/cephaudit.txt

Any help would greatly be appreciated.

Alwin · Nov 2, 2017

You have 6 MONs, for one this is a even number and more likely to be trouble on getting quorum, as it needs a simple majority. Also Ceph states that 5 MONs help if you have a Ceph cluster with a 1000 nodes/clients, besides that, I think it will also reduce your performance as all of the MONs have to bee kept in sync and you need to have enough resources on all 6 machines to have fast performing MONs. I recommend to reduce the number of MONs (min. 3) and reduce the load on the remaining nodes with MONs to keep the latency down.

You are using public IPs for your corosync traffic, I presume that also CT/VM traffic is going through those interface too. That will increase latency for corosync and kill the cluster if the token is not arriving within the specified time. For higher stability it is recommended to use a separate network for corosync and two rings through which corosync can receive its token.

Code:

Oct 31 11:37:50 cloud3 corosync[2176]:  [TOTEM ] A new membership (83.217.161.48:396776) was formed. Members left: 2
Oct 31 11:37:50 cloud3 corosync[2176]:  [QUORUM] Members[6]: 4 7 1 3 5 6
Oct 31 11:37:50 cloud3 corosync[2176]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 31 11:37:50 cloud3 pmxcfs[1334]: [dcdb] notice: cpg_send_message retried 1 times
Oct 31 11:37:51 cloud3 corosync[2176]:  [TOTEM ] A processor failed, forming new configuration.
Oct 31 11:37:53 cloud3 corosync[2176]:  [TOTEM ] Retransmit List: a b c e f 10
Oct 31 11:37:55 cloud3 corosync[2176]:  [TOTEM ] Retransmit List: a b c d f 10 11
Oct 31 11:37:58 cloud3 corosync[2176]:  [TOTEM ] Retransmit List: a b c d f 10 11
Oct 31 11:37:58 cloud3 pvestatd[1784]: modified cpu set for lxc/107: 12-12,7

At this stage your cluster fell apart as corosync couldn't receive its token anymore.

Code:

Oct 31 11:38:15 cloud3 kernel: [93431.428158] traps: asterisk[19467] trap invalid opcode ip:440dff sp:7ffd0730d350 error:0 in asterisk[400000+227000]
Oct 31 11:38:31 cloud3 kernel: [93447.485987] traps: asterisk[19728] trap invalid opcode ip:440dff sp:7ffec593bf10 error:0 in asterisk[400000+227000]
Oct 31 11:38:35 cloud3 kernel: [93451.500211] traps: asterisk[19762] trap invalid opcode ip:440dff sp:7ffc060ecef0 error:0 in asterisk[400000+227000]

Asterisk seems also a little bit broken. But I don't think it is the cause of the cluster failure.
http://lists.digium.com/pipermail/asterisk-users/2013-November/281242.html

vispa · Nov 3, 2017

Hi Alwin,

I will take your advice and reduce to 3 monitors.

Would corosync loosing its token cause the nodes to reboot? I am having major problems as 4-5 nodes out of 6 will reset at the same time.

For corosync, how would I go about separating the network/IP's?

Regards,

James

Alwin · Nov 3, 2017

Yes, when the token is lost the watchdog hits after a grace period and reboots the server. As corosync has such an important role, it is always advised to put it on a separate network and best use two rings. More on the separation in the docs. https://pve.proxmox.com/pve-docs/chapter-pvecm.html#separate-cluster-net-after-creation

vispa · Nov 3, 2017

Hi,

To clarify, I'm running a HP C7000 Blade. It has two switches, 1Gbps & 10Gbps.

On the 10Gbps network (eth1 on my nodes), im using 10.10.11.51-56 for Ceph.
On the 1Gbps network, I have public IP's for the Nodes & Containers.

Are you suggesting I add a 3rd network specifically for corosync? So each machine will have x3 nic's?

Or can I simply add secondary range such as 10.10.12.x to the 10Gbps network which is also running ceph?

Regards,

James

Alwin · Nov 3, 2017

A second ring will raise the reliability of corosync, but in the event that on both ends is too much traffic, your cluster might fail again. My suggestion is a third NIC or just a separate port, so corosync has its own network and unpredictable client traffic will interfere.

Search

Search

Ceph Crashing - Have to reboot nodes?

vispa

Well-Known Member

Alwin

Proxmox Retired Staff

vispa

Well-Known Member

Alwin

Proxmox Retired Staff

vispa

Well-Known Member

vispa

Well-Known Member

vispa

Well-Known Member

Alwin

Proxmox Retired Staff

vispa

Well-Known Member

Alwin

Proxmox Retired Staff

vispa

Well-Known Member

Alwin

Proxmox Retired Staff

We value your privacy