Nodes unable to maintain communication causing Ceph to fail

fvanlint · Jun 12, 2019

Hello, I have been having a very strange problem with some of my Proxmox nodes the past few days. This problem has seemingly started suddenly after having the current configuration running for at least three months.

Some Proxmox nodes are suddenly no longer able to communicate with one another, causing them to become isolated from the cluster, causing the cluster to become out of quorate and causing Ceph to stop functioning.

Symptoms:
Trying to view an effected node from another node results in
Error: Connection Error 595: Connection refused

Trying to ssh from a pve node to another immediately results in
port 22: Connection refused
In the WebUI, these nodes show up with a red cross.

From Ceph, which gives me the most cause for concern.
2019-06-12 11:24:36.689357 mon.pve2 mon.1 192.168.1.102:6789/0 354 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:24:41.716770 mon.pve2 mon.1 192.168.1.102:6789/0 355 : cluster [INF] mon.pve2 is new leader, mons pve2,pve3,pve4,pve5,pve6 in quorum (ranks 1,2,3,4,5)

A few seconds later:
[DBG] mgrmap e182: pve6(active), standbys: pve3, pve5, pve2, pve1
2019-06-12 11:24:41.786249 mon.pve2 mon.1 192.168.1.102:6789/0 360 : cluster [WRN] Health check failed: 1/6 mons down, quorum pve2,pve3,pve4,pve5,pve6 (MON_DOWN)
2019-06-12 11:24:41.796560 mon.pve2 mon.1 192.168.1.102:6789/0 361 : cluster [DBG] osd.12 192.168.1.101:6803/22176 reported immediately failed by osd.21 192.168.1.102:6808/49879
2019-06-12 11:24:41.796727 mon.pve2 mon.1 192.168.1.102:6789/0 362 : cluster [INF] osd.12 failed (root=default,host=pve1) (connection refused reported by osd.21)
2019-06-12 11:24:41.796972 mon.pve2 mon.1 192.168.1.102:6789/0 363 : cluster [DBG] osd.16 192.168.1.101:6805/22281 reported immediately failed by osd.27 192.168.1.102:6814/50219
2019-06-12 11:24:41.797079 mon.pve2 mon.1 192.168.1.102:6789/0 364 : cluster [INF] osd.16 failed (root=default,host=pve1) (connection refused reported by osd.27)
This goes on for a while, and then

2019-06-12 11:25:12.597138 osd.32 osd.32 192.168.1.101:6821/23169 2 : cluster [DBG] map e22911 wrongly marked me down at e22910

Which seems to cause the Ceph cluster to see all the missing osd again:
2019-06-12 11:25:37.532316 mon.pve2 mon.1 192.168.1.102:6789/0 1182 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:26:07.714656 mon.pve5 mon.4 192.168.1.105:6789/0 2571 : cluster [INF] mon.pve5 calling monitor election
2019-06-12 11:26:07.717357 mon.pve4 mon.3 192.168.1.104:6789/0 2806 : cluster [INF] mon.pve4 calling monitor election
2019-06-12 11:26:07.717463 mon.pve3 mon.2 192.168.1.103:6789/0 48373 : cluster [INF] mon.pve3 calling monitor election
2019-06-12 11:26:07.727553 mon.pve2 mon.1 192.168.1.102:6789/0 1183 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:26:07.739388 mon.pve1 mon.0 192.168.1.101:6789/0 1642 : cluster [INF] mon.pve1 calling monitor election
2019-06-12 11:26:08.718094 mon.pve1 mon.0 192.168.1.101:6789/0 1643 : cluster [INF] mon.pve1 is new leader, mons pve1,pve2,pve3,pve4,pve5,pve6 in quorum (ranks 0,1,2,3,4,5)
2019-06-12 11:26:08.735386 mon.pve1 mon.0 192.168.1.101:6789/0 1644 : cluster [DBG] monmap e18: 6 mons at {pve1=192.168.1.101:6789/0,pve2=192.168.1.102:6789/0,pve3=192.168.1.103:6789/0,pve4=192.168.1.104:6789/0,pve5=192.168.1.105:6789/0,pve6=192.168.1.106:6789/0}
2019-06-12 11:26:08.735483 mon.pve1 mon.0 192.168.1.101:6789/0 1645 : cluster [DBG] fsmap cephfs-1/1/1 up {0=pve6=up:active}, 3 up:standby
2019-06-12 11:26:08.735570 mon.pve1 mon.0 192.168.1.101:6789/0 1646 : cluster [DBG] osdmap e22911: 38 total, 26 up, 38 in
2019-06-12 11:26:08.735739 mon.pve1 mon.0 192.168.1.101:6789/0 1647 : cluster [DBG] mgrmap e184: pve6(active), standbys: pve3, pve5, pve2, pve1
2019-06-12 11:26:08.736144 mon.pve1 mon.0 192.168.1.101:6789/0 1648 : cluster [INF] Health check cleared: MON_DOWN (was: 2/6 mons down, quorum pve1,pve3,pve4,pve5)
2019-06-12 11:26:08.738152 mon.pve1 mon.0 192.168.1.101:6789/0 1649 : cluster [DBG] osd.9 192.168.1.106:6805/3434 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738488 mon.pve1 mon.0 192.168.1.101:6789/0 1650 : cluster [DBG] osd.24 192.168.1.102:6806/49768 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738817 mon.pve1 mon.0 192.168.1.101:6789/0 1651 : cluster [DBG] osd.22 192.168.1.102:6812/50099 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738974 mon.pve1 mon.0 192.168.1.101:6789/0 1652 : cluster [DBG] osd.27 192.168.1.102:6814/50219 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.739482 mon.pve1 mon.0 192.168.1.101:6789/0 1653 : cluster [DBG] osd.25 192.168.1.102:6804/49658 failure report canceled by osd.12 192.168.1.101:6803/22176

And so on for all the missing osd.

After some time, 15-30 minutes or so, it starts missing disks again.
I believe this is because the pve nodes are unable to communicate with each other.

Things I have tried to remedy:
On each affected node:

pvecm updatecerts
This caused the affected machines to be reachable by ssh for a little while.

Removing ssh known hosts and authorized hosts, because I thought that perhaps the keys stopped matching
rm ~/.ssh/known_hosts && rm /etc/ssh/ssh_known_hosts
Removing these does not seem to have done anything much

pvecm add 192.168.1.103 -force
This seemed to help for a bit longer (a few hours rather than fifteen minutes)

Currently I am at a loss on how to reach stability again.
Does anyone have an idea what might be the root problem behind these symptoms?

Alwin · Jun 12, 2019

Can you please post your network configuration?

A wild first guess, corosync and ceph share the same network and you also have HA activated, which resets the nodes.

fvanlint · Jun 12, 2019

Hello Alwin, thanks for your reply. My network configuration, from one of the nodes, is as follows:
auto lo
iface lo inet loopback

iface enp2s0 inet manual

iface eno1 inet manual

iface eno1d1 inet manual

auto bond0
iface bond0 inet static
address 10.10.10.3
netmask 255.255.255.0
bond-slaves eno1 eno1d1
bond-miimon 100
bond-mode balance-tlb

auto vmbr0
iface vmbr0 inet static
address 192.168.1.103
netmask 255.255.255.0
gateway 192.168.1.1
bridge-ports enp2s0
bridge-stp off
bridge-fd 0
I have separated ceph and proxmox through bond0 and vmbr0 on all nodes. Each machine has a dual port connectx2 card of which one port is used for ceph and the other for proxmox and VMs and containers.

This can be seen in the global part of ceph config I think:

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.10.10.6/24
fsid = 39497991-921d-46c6-8b71-65ca1522957c
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 192.168.1.106/24

At the moment I have no HA VMs defined, though I do still have some HA groups, which are now empty.

Alwin · Jun 12, 2019

fvanlint said:
public network = 192.168.1.106/24

The public network holds quite some ceph traffic (see image below), especially in a hyper-converged setup, as the server is also client at the same time. And only the replication between the OSDs is separated. So the network is not truly separated.

http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/

fvanlint said:
At the moment I have no HA VMs defined, though I do still have some HA groups, which are now empty.

If the lrm service is still active on the nodes, then the watchdog is still spun up and will trigger on lost quorum.

fvanlint · Jun 12, 2019

Thanks again Alwin for your reply!
Perhaps I have misunderstood the concept of public network.
I will change this value to be in the 10.10.10.x range like the cluster network and report back.

EDIT:
So, this seems to have indeed helped a bit, but just as I wanted to write that, this happened:

2019-06-12 14:55:14.141772 mon.pve1 mon.0 192.168.1.101:6789/0 33388 : cluster [DBG] osd.29 192.168.1.102:6800/49400 reported immediately failed by osd.15 192.168.1.101:6819/23058
2019-06-12 14:55:14.141847 mon.pve1 mon.0 192.168.1.101:6789/0 33389 : cluster [INF] osd.29 failed (root=default,host=pve2) (connection refused reported by osd.15)

All OSD on pve2 (192.168.1.102) dropped out. I had to use ceph-disk activate-all to get them to restart.
Do I also need to change the mon addresses to be in the 10.10.10.x range perhaps?

[mon.pve6] host = pve6 mon addr = 192.168.1.106:6789
[mon.pve1] host = pve1 mon addr = 192.168.1.101:6789
[mon.pve2] host = pve2 mon addr = 192.168.1.102:6789
[mon.pve5] host = pve5 mon addr = 192.168.1.105:6789
[mon.pve4] host = pve4 mon addr = 192.168.1.104:6789
[mon.pve3] host = pve3 mon addr = 192.168.1.103:6789

Alwin · Jun 12, 2019

fvanlint said:
Do I also need to change the mon addresses to be in the 10.10.10.x range perhaps?

Of course and all ceph services need to be restarted, otherwise they will still listen on the other IPs.

fvanlint · Jun 12, 2019

Thanks again for your reply Alwin. I appear to have done something wrong, as Ceph is now not working at all.
I stopped all ceph services using sudo /etc/init.d/ceph -v -a stop before editing ceph.conf, saved and gave sudo /etc/init.d/ceph -v -a start after which I rebooted each node.

After the reboot and waiting for all nodes to come up, the web UI for Ceph main tells me alternating

Code:

got timeout (500)
too many redirections (599)
Connection refused (595)

The page for monitor works and shows the correct monitor addresses, but no quorum for all monitors.
The page for configuration shows the ceph.conf, but no Crush Map.
The page for OSD is empty, as is the CephFS and Pools page. This is very worrying, although I have backups for everything.
Interestingly enough, the syslog page still works:

Code:

2019-06-12 17:19:46.088350 mon.pve3 mon.2 192.168.1.103:6789/0 2 : cluster [WRN] message from mon.0 was stamped 0.437517s in the future, clocks not synchronized

2019-06-12 17:19:46.503116 mon.pve1 mon.0 192.168.1.101:6789/0 1990 : cluster [INF] mon.pve1 is new leader, mons pve1,pve2,pve3,pve4,pve5 in quorum (ranks 0,1,2,3,4)
2019-06-12 17:19:46.516064 mon.pve1 mon.0 192.168.1.101:6789/0 1991 : cluster [WRN] mon.2 192.168.1.103:6789/0 clock skew 0.426511s > max 0.05s
2019-06-12 17:19:46.526453 mon.pve1 mon.0 192.168.1.101:6789/0 1992 : cluster [DBG] monmap e18: 6 mons at {pve1=192.168.1.101:6789/0,pve2=192.168.1.102:6789/0,pve3=192.168.1.103:6789/0,pve4=192.168.1.104:6789/0,pve5=192.168.1.105:6789/0,pve6=192.168.1.106:6789/0}
2019-06-12 17:19:46.526521 mon.pve1 mon.0 192.168.1.101:6789/0 1993 : cluster [DBG] fsmap cephfs-1/1/1 up  {0=pve4=up:replay(laggy or crashed)}, 1 up:standby(laggy or crashed)
2019-06-12 17:19:46.526620 mon.pve1 mon.0 192.168.1.101:6789/0 1994 : cluster [DBG] osdmap e23120: 38 total, 11 up, 28 in
2019-06-12 17:19:46.526715 mon.pve1 mon.0 192.168.1.101:6789/0 1995 : cluster [DBG] mgrmap e193: no daemons active
2019-06-12 17:19:46.527018 mon.pve1 mon.0 192.168.1.101:6789/0 1996 : cluster [WRN] Health check failed: clock skew detected on mon.pve3 (MON_CLOCK_SKEW)
2019-06-12 17:19:46.527047 mon.pve1 mon.0 192.168.1.101:6789/0 1997 : cluster [WRN] Health check update: 1/6 mons down, quorum pve1,pve2,pve3,pve4,pve5 (MON_DOWN)
2019-06-12 17:19:46.541539 mon.pve1 mon.0 192.168.1.101:6789/0 1998 : cluster [WRN] overall HEALTH_WARN 1 filesystem is degraded; 17 osds down; 2 hosts (16 osds) down; no active mgr; Reduced data availability: 102 pgs inactive, 45 pgs peering; Degraded data redundancy: 6678584/10499922 objects degraded (63.606%), 2228 pgs degraded; clock skew detected on mon.pve3; mon pve5 is low on available space; 1/6 mons down, quorum pve1,pve2,pve3,pve4,pve5
2019-06-12 17:19:52.087093 mon.pve5 mon.4 192.168.1.105:6789/0 323 : cluster [WRN] reached concerning levels of available space on local monitor storage (29% free)
2019-06-12 17:19:51.717600 mon.pve3 mon.2 192.168.1.103:6789/0 3 : cluster [WRN] message from mon.0 was stamped 0.429938s in the future, clocks not synchronized
2019-06-12 17:20:20.484770 mon.pve1 mon.0 192.168.1.101:6789/0 1999 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.pve3)

In the local syslog for the nodes I see variations of the following:

Code:

Jun 12 17:18:40 pve1 pvedaemon[3149]: got timeout

Jun 12 17:18:42 pve1 kernel: libceph: mon4 10.10.10.5:6789 socket error on read
Jun 12 17:18:43 pve1 kernel: libceph: mon4 10.10.10.5:6789 socket closed (con state CONNECTING)
Jun 12 17:18:44 pve1 kernel: libceph: mon4 10.10.10.5:6789 socket error on read
Jun 12 17:18:46 pve1 kernel: libceph: mon5 10.10.10.6:6789 socket error on read
Jun 12 17:18:49 pve1 kernel: libceph: mon3 10.10.10.4:6789 socket error on read
Jun 12 17:18:50 pve1 kernel: libceph: mon3 10.10.10.4:6789 socket closed (con state CONNECTING)
Jun 12 17:18:51 pve1 kernel: libceph: mon3 10.10.10.4:6789 socket closed (con state CONNECTING)
Jun 12 17:18:52 pve1 kernel: libceph: mon1 10.10.10.2:6789 socket error on write
Jun 12 17:18:53 pve1 kernel: libceph: mon1 10.10.10.2:6789 socket error on write
Jun 12 17:18:54 pve1 kernel: libceph: mon1 10.10.10.2:6789 socket error on read
Jun 12 17:18:55 pve1 kernel: libceph: mon3 10.10.10.4:6789 socket closed (con state CONNECTING)
Jun 12 17:18:56 pve1 kernel: libceph: mon3 10.10.10.4:6789 socket error on write
Jun 12 17:18:57 pve1 kernel: libceph: mon3 10.10.10.4:6789 socket error on write

It seems, the 192.168.1.x addresses have not yet been accepted by something inside either Ceph or Proxmox, despite showing correctly in ceph.conf and showing up as monitors with the correct addresses in the monitor page in the web UI.
However, Ceph now does not work at all, so in the process of repairing the failing nodes I now managed to break everything.
Did I miss the proper way to gracefully stop, edit ceph.conf and start Ceph again?

EDIT:

After reading about

Code:

systemctl stop ceph\*.service ceph\*.target

and

Code:

systemctl start ceph.target

I tried to stop ceph on all nodes and subsequently starting on all nodes.

Stopping looks like this:

Code:

Jun 12 18:57:19 pve1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph*@.service instances at once.
Jun 12 18:57:19 pve1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-mds@.service instances at once.
Jun 12 18:57:19 pve1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-fuse@.service instances at once.
Jun 12 18:57:19 pve1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-osd@.service instances at once.
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.18...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.13...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.10...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.20...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.19...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.11...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.15...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.12...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.17...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.32...
Jun 12 18:57:19 pve1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-mgr@.service instances at once.
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph cluster manager daemon...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.14...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph object storage daemon osd.16...
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph metadata server daemon...
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph cluster manager daemon.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph metadata server daemon.
Jun 12 18:57:19 pve1 kernel: libceph: mon1 10.10.10.2:6789 socket closed (con state CONNECTING)
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.15.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.32.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.11.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.20.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.16.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.18.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.14.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.10.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.19.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.17.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.13.
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph object storage daemon osd.12.
Jun 12 18:57:19 pve1 systemd[1]: Stopped target ceph target allowing to start/stop all ceph-mon@.service instances at once.
Jun 12 18:57:19 pve1 systemd[1]: Stopping Ceph cluster monitor daemon...
Jun 12 18:57:19 pve1 ceph-mon[35870]: 2019-06-12 18:57:19.501793 7f45953b0700 -1 received  signal: Terminated from  PID: 1 task name: /sbin/init  UID: 0
Jun 12 18:57:19 pve1 ceph-mon[35870]: 2019-06-12 18:57:19.501833 7f45953b0700 -1 mon.pve1@0(leader) e18 * Got Signal Terminated *
Jun 12 18:57:19 pve1 systemd[1]: Stopped Ceph cluster monitor daemon.
Jun 12 18:57:20 pve1 kernel: libceph: mon1 10.10.10.2:6789 socket closed (con state CONNECTING)
Jun 12 18:57:21 pve1 kernel: libceph: mon2 10.10.10.3:6789 socket closed (con state CONNECTING)
Jun 12 18:57:22 pve1 kernel: libceph: mon2 10.10.10.3:6789 socket error on read
Jun 12 18:57:23 pve1 kernel: libceph: mon2 10.10.10.3:6789 socket error on read
Jun 12 18:57:24 pve1 kernel: libceph: mon0 10.10.10.1:6789 socket error on write
Jun 12 18:57:25 pve1 kernel: libceph: mon0 10.10.10.1:6789 socket error on write
Jun 12 18:57:26 pve1 kernel: libceph: mon0 10.10.10.1:6789 socket error on write
Jun 12 18:57:27 pve1 kernel: libceph: mon2 10.10.10.3:6789 socket closed (con state CONNECTING)
Jun 12 18:57:28 pve1 kernel: libceph: mon2 10.10.10.3:6789 socket error on read
Jun 12 18:57:29 pve1 kernel: libceph: mon2 10.10.10.3:6789 socket error on read
Jun 12 18:57:30 pve1 kernel: libceph: mon0 10.10.10.1:6789 socket error on write
Jun 12 18:57:31 pve1 kernel: libceph: mon0 10.10.10.1:6789 socket error on write
Jun 12 18:57:32 pve1 kernel: libceph: mon0 10.10.10.1:6789 socket error on write
Jun 12 18:57:33 pve1 kernel: libceph: mon2 10.10.10.3:6789 socket closed (con state CONNECTING)
Jun 12 18:57:34 pve1 kernel: libceph: mon2 10.10.10.3:6789 socket error on read
Jun 12 18:57:35 pve1 kernel: libceph: mon2 10.10.10.3:6789 socket error on read
Jun 12 18:57:36 pve1 kernel: libceph: mon5 10.10.10.6:6789 socket error on read
Jun 12 18:57:37 pve1 kernel: libceph: mon5 10.10.10.6:6789 socket error on read
Jun 12 18:57:38 pve1 kernel: libceph: mon5 10.10.10.6:6789 socket error on write
Jun 12 18:57:39 pve1 kernel: libceph: mon3 10.10.10.4:6789 socket closed (con state CONNECTING)
Jun 12 18:57:40 pve1 kernel: libceph: mon3 10.10.10.4:6789 socket closed (con state CONNECTING)
Jun 12 18:57:41 pve1 kernel: libceph: mon3 10.10.10.4:6789 socket error on read
Jun 12 18:57:42 pve1 kernel: libceph: mon5 10.10.10.6:6789 socket error on read
Jun 12 18:57:43 pve1 kernel: libceph: mon5 10.10.10.6:6789 socket error on read
Jun 12 18:57:44 pve1 kernel: libceph: mon5 10.10.10.6:6789 socket error on write
Jun 12 18:57:45 pve1 kernel: libceph: mon1 10.10.10.2:6789 socket error on write

Despite stopping, mon1-6 keep complaining about the socket.

Starting again looks like this:

Code:

Jun 12 19:00:00 pve3 systemd[1]: Starting Proxmox VE replication runner...
Jun 12 19:00:01 pve3 systemd[1]: Started Proxmox VE replication runner.
Jun 12 19:00:04 pve3 systemd[1]: Started Ceph cluster manager daemon.
Jun 12 19:00:04 pve3 systemd[1]: Started Ceph metadata server daemon.
Jun 12 19:00:04 pve3 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mgr@.service instances at once.
Jun 12 19:00:04 pve3 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mds@.service instances at once.
Jun 12 19:00:04 pve3 systemd[1]: Started Ceph cluster monitor daemon.
Jun 12 19:00:04 pve3 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mon@.service instances at once.
Jun 12 19:00:04 pve3 systemd[1]: Starting Ceph object storage daemon osd.4...
Jun 12 19:00:04 pve3 systemd[1]: Starting Ceph object storage daemon osd.3...
Jun 12 19:00:04 pve3 systemd[1]: Starting Ceph object storage daemon osd.2...
Jun 12 19:00:04 pve3 systemd[1]: Starting Ceph object storage daemon osd.0...
Jun 12 19:00:04 pve3 systemd[1]: Starting Ceph object storage daemon osd.1...
Jun 12 19:00:04 pve3 systemd[1]: Reached target ceph target allowing to start/stop all ceph-fuse@.service instances at once.
Jun 12 19:00:04 pve3 systemd[1]: Started Ceph object storage daemon osd.1.
Jun 12 19:00:04 pve3 systemd[1]: Started Ceph object storage daemon osd.0.
Jun 12 19:00:04 pve3 systemd[1]: Started Ceph object storage daemon osd.4.
Jun 12 19:00:04 pve3 systemd[1]: Started Ceph object storage daemon osd.3.
Jun 12 19:00:04 pve3 systemd[1]: Started Ceph object storage daemon osd.2.
Jun 12 19:00:04 pve3 systemd[1]: Reached target ceph target allowing to start/stop all ceph-osd@.service instances at once.
Jun 12 19:00:04 pve3 systemd[1]: Reached target ceph target allowing to start/stop all ceph*@.service instances at once.
Jun 12 19:00:04 pve3 ceph-osd[18278]: starting osd.0 at - osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
Jun 12 19:00:04 pve3 ceph-osd[18277]: starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
Jun 12 19:00:04 pve3 ceph-osd[18285]: starting osd.3 at - osd_data /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal
Jun 12 19:00:04 pve3 ceph-osd[18284]: starting osd.4 at - osd_data /var/lib/ceph/osd/ceph-4 /var/lib/ceph/osd/ceph-4/journal
Jun 12 19:00:04 pve3 ceph-osd[18286]: starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
Jun 12 19:00:04 pve3 ceph-mds[18239]: starting mds.pve3 at -
Jun 12 19:00:04 pve3 ceph-mon[18240]: 2019-06-12 19:00:04.884113 7fbe98f64100 -1 WARNING: 'mon addr' config option 10.10.10.3:6789/0 does not match monmap file
Jun 12 19:00:04 pve3 ceph-mon[18240]:          continuing with monmap configuration
Jun 12 19:00:04 pve3 pvestatd[2645]: got timeout
Jun 12 19:00:05 pve3 pvestatd[2645]: status update time (65.550 seconds)
Jun 12 19:00:05 pve3 ceph-osd[18278]: 2019-06-12 19:00:05.597569 7f83f7832e00 -1 osd.0 23109 log_to_monitors {default=true}
Jun 12 19:00:10 pve3 pvestatd[2645]: got timeout
Jun 12 19:00:10 pve3 kernel: libceph: mon1 10.10.10.2:6789 socket error on write
Jun 12 19:00:10 pve3 kernel: libceph: mon1 10.10.10.2:6789 socket error on read
Jun 12 19:00:11 pve3 kernel: libceph: mon1 10.10.10.2:6789 socket closed (con state CONNECTING)
Jun 12 19:00:12 pve3 kernel: libceph: mon1 10.10.10.2:6789 socket error on write
Jun 12 19:00:14 pve3 kernel: libceph: mon1 10.10.10.2:6789 socket error on read
Jun 12 19:00:14 pve3 kernel: libceph: mon4 10.10.10.5:6789 socket closed (con state CONNECTING)
Jun 12 19:00:15 pve3 kernel: libceph: mon4 10.10.10.5:6789 socket closed (con state CONNECTING)
Jun 12 19:00:16 pve3 kernel: libceph: mon4 10.10.10.5:6789 socket error on read
Jun 12 19:00:17 pve3 kernel: libceph: mon0 10.10.10.1:6789 socket closed (con state CONNECTING)
Jun 12 19:00:18 pve3 kernel: libceph: mon0 10.10.10.1:6789 socket error on read
Jun 12 19:00:19 pve3 kernel: libceph: mon0 10.10.10.1:6789 socket closed (con state CONNECTING)
Jun 12 19:00:20 pve3 kernel: libceph: mon4 10.10.10.5:6789 socket error on read
Jun 12 19:00:21 pve3 kernel: libceph: mon4 10.10.10.5:6789 socket error on read
Jun 12 19:00:22 pve3 kernel: libceph: mon4 10.10.10.5:6789 socket error on read
Jun 12 19:00:23 pve3 kernel: libceph: mon3 10.10.10.4:6789 socket error on read
Jun 12 19:00:24 pve3 kernel: libceph: mon3 10.10.10.4:6789 socket error on write
Jun 12 19:00:25 pve3 kernel: libceph: mon3 10.10.10.4:6789 socket error on read
Jun 12 19:00:26 pve3 kernel: libceph: mon4 10.10.10.5:6789 socket error on write

The part about the monmap looks interesting to me, I will try a bit more in that direction and update if I make any progress.

EDIT:

So I searched the monmap from the earlier log and came across http://docs.ceph.com/docs/jewel/rados/operations/add-or-rm-mons/ which despite being for jewel led me to understand once again that I know so little. In addition to editing the ceph.conf, I need to edit the monmap to reflect the changed mon ip's. Encouraged by this, I stopped all ceph services using the systemctl command, edited the ceph.conf file with the earlier 192.168.1.x mon files and restarted using systemctl command. And ceph started working again, VMs started booting as well. No data loss whatsoever, not even a complaint about data being misplaced from ceph. So that to me is yet another display of Ceph's resiliency.

I will try to change the ip addresses using the proper (messy version) method tomorrow and hopefully there will not be any disconnects between nodes in the mean time; which of course was the reason for opening this thread.

Alwin · Jun 13, 2019

Sorry, I must have mixed the networks. Changing the cluster network to the public should be no issue. The other way around is a little bit more complicated/tricky.
http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/

The 'proper' way, as described in the link will need both network reachable. So, they 192.168.1.x would need to be able to communicate with the 10.10.10.x network and vice versa. Then a new monitor could be added with a IP of the 10.10.10.x network. After that every MON with a 192.168.1.x would need to be destroyed and re-created with the new IP.

The advantage of this process, is that your cluster doesn't need to go offline for the procedure. And at any point, you could stop and revert the procedure if anything goes wrong.

And last but not least, always have a back for just in case.

navigator · Oct 27, 2019

fvanlint said:
Hello, I have been having a very strange problem with some of my Proxmox nodes the past few days. This problem has seemingly started suddenly after having the current configuration running for at least three months.

Some Proxmox nodes are suddenly no longer able to communicate with one another, causing them to become isolated from the cluster, causing the cluster to become out of quorate and causing Ceph to stop functioning.

Symptoms:
Trying to view an effected node from another node results in
Error: Connection Error 595: Connection refused

Trying to ssh from a pve node to another immediately results in
port 22: Connection refused
In the WebUI, these nodes show up with a red cross.

From Ceph, which gives me the most cause for concern.
2019-06-12 11:24:36.689357 mon.pve2 mon.1 192.168.1.102:6789/0 354 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:24:41.716770 mon.pve2 mon.1 192.168.1.102:6789/0 355 : cluster [INF] mon.pve2 is new leader, mons pve2,pve3,pve4,pve5,pve6 in quorum (ranks 1,2,3,4,5)

A few seconds later:
[DBG] mgrmap e182: pve6(active), standbys: pve3, pve5, pve2, pve1
2019-06-12 11:24:41.786249 mon.pve2 mon.1 192.168.1.102:6789/0 360 : cluster [WRN] Health check failed: 1/6 mons down, quorum pve2,pve3,pve4,pve5,pve6 (MON_DOWN)
2019-06-12 11:24:41.796560 mon.pve2 mon.1 192.168.1.102:6789/0 361 : cluster [DBG] osd.12 192.168.1.101:6803/22176 reported immediately failed by osd.21 192.168.1.102:6808/49879
2019-06-12 11:24:41.796727 mon.pve2 mon.1 192.168.1.102:6789/0 362 : cluster [INF] osd.12 failed (root=default,host=pve1) (connection refused reported by osd.21)
2019-06-12 11:24:41.796972 mon.pve2 mon.1 192.168.1.102:6789/0 363 : cluster [DBG] osd.16 192.168.1.101:6805/22281 reported immediately failed by osd.27 192.168.1.102:6814/50219
2019-06-12 11:24:41.797079 mon.pve2 mon.1 192.168.1.102:6789/0 364 : cluster [INF] osd.16 failed (root=default,host=pve1) (connection refused reported by osd.27)
This goes on for a while, and then

2019-06-12 11:25:12.597138 osd.32 osd.32 192.168.1.101:6821/23169 2 : cluster [DBG] map e22911 wrongly marked me down at e22910

Which seems to cause the Ceph cluster to see all the missing osd again:
2019-06-12 11:25:37.532316 mon.pve2 mon.1 192.168.1.102:6789/0 1182 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:26:07.714656 mon.pve5 mon.4 192.168.1.105:6789/0 2571 : cluster [INF] mon.pve5 calling monitor election
2019-06-12 11:26:07.717357 mon.pve4 mon.3 192.168.1.104:6789/0 2806 : cluster [INF] mon.pve4 calling monitor election
2019-06-12 11:26:07.717463 mon.pve3 mon.2 192.168.1.103:6789/0 48373 : cluster [INF] mon.pve3 calling monitor election
2019-06-12 11:26:07.727553 mon.pve2 mon.1 192.168.1.102:6789/0 1183 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:26:07.739388 mon.pve1 mon.0 192.168.1.101:6789/0 1642 : cluster [INF] mon.pve1 calling monitor election
2019-06-12 11:26:08.718094 mon.pve1 mon.0 192.168.1.101:6789/0 1643 : cluster [INF] mon.pve1 is new leader, mons pve1,pve2,pve3,pve4,pve5,pve6 in quorum (ranks 0,1,2,3,4,5)
2019-06-12 11:26:08.735386 mon.pve1 mon.0 192.168.1.101:6789/0 1644 : cluster [DBG] monmap e18: 6 mons at {pve1=192.168.1.101:6789/0,pve2=192.168.1.102:6789/0,pve3=192.168.1.103:6789/0,pve4=192.168.1.104:6789/0,pve5=192.168.1.105:6789/0,pve6=192.168.1.106:6789/0}
2019-06-12 11:26:08.735483 mon.pve1 mon.0 192.168.1.101:6789/0 1645 : cluster [DBG] fsmap cephfs-1/1/1 up {0=pve6=up:active}, 3 up:standby
2019-06-12 11:26:08.735570 mon.pve1 mon.0 192.168.1.101:6789/0 1646 : cluster [DBG] osdmap e22911: 38 total, 26 up, 38 in
2019-06-12 11:26:08.735739 mon.pve1 mon.0 192.168.1.101:6789/0 1647 : cluster [DBG] mgrmap e184: pve6(active), standbys: pve3, pve5, pve2, pve1
2019-06-12 11:26:08.736144 mon.pve1 mon.0 192.168.1.101:6789/0 1648 : cluster [INF] Health check cleared: MON_DOWN (was: 2/6 mons down, quorum pve1,pve3,pve4,pve5)
2019-06-12 11:26:08.738152 mon.pve1 mon.0 192.168.1.101:6789/0 1649 : cluster [DBG] osd.9 192.168.1.106:6805/3434 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738488 mon.pve1 mon.0 192.168.1.101:6789/0 1650 : cluster [DBG] osd.24 192.168.1.102:6806/49768 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738817 mon.pve1 mon.0 192.168.1.101:6789/0 1651 : cluster [DBG] osd.22 192.168.1.102:6812/50099 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738974 mon.pve1 mon.0 192.168.1.101:6789/0 1652 : cluster [DBG] osd.27 192.168.1.102:6814/50219 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.739482 mon.pve1 mon.0 192.168.1.101:6789/0 1653 : cluster [DBG] osd.25 192.168.1.102:6804/49658 failure report canceled by osd.12 192.168.1.101:6803/22176

And so on for all the missing osd.

After some time, 15-30 minutes or so, it starts missing disks again.
I believe this is because the pve nodes are unable to communicate with each other.

Things I have tried to remedy:
On each affected node:

pvecm updatecerts
This caused the affected machines to be reachable by ssh for a little while.

Removing ssh known hosts and authorized hosts, because I thought that perhaps the keys stopped matching
rm ~/.ssh/known_hosts && rm /etc/ssh/ssh_known_hosts
Removing these does not seem to have done anything much

pvecm add 192.168.1.103 -force
This seemed to help for a bit longer (a few hours rather than fifteen minutes)

Currently I am at a loss on how to reach stability again.
Does anyone have an idea what might be the root problem behind these symptoms?

Please can you post what says on the isolated node shell?

Alwin · Oct 28, 2019

@Marcos Mendez, if you exhibit a similar issue, then please open up a new thread. Thanks.

Search

Search

Nodes unable to maintain communication causing Ceph to fail

fvanlint

Active Member

Alwin

Proxmox Retired Staff

fvanlint

Active Member

Alwin

Proxmox Retired Staff

fvanlint

Active Member

Alwin

Proxmox Retired Staff

fvanlint

Active Member

Alwin

Proxmox Retired Staff

navigator

Well-Known Member

Alwin

Proxmox Retired Staff