Hello, I have been having a very strange problem with some of my Proxmox nodes the past few days. This problem has seemingly started suddenly after having the current configuration running for at least three months.
Some Proxmox nodes are suddenly no longer able to communicate with one another, causing them to become isolated from the cluster, causing the cluster to become out of quorate and causing Ceph to stop functioning.
Symptoms:
Trying to view an effected node from another node results in
Error: Connection Error 595: Connection refused
Trying to ssh from a pve node to another immediately results in
port 22: Connection refused
In the WebUI, these nodes show up with a red cross.
From Ceph, which gives me the most cause for concern.
2019-06-12 11:24:36.689357 mon.pve2 mon.1 192.168.1.102:6789/0 354 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:24:41.716770 mon.pve2 mon.1 192.168.1.102:6789/0 355 : cluster [INF] mon.pve2 is new leader, mons pve2,pve3,pve4,pve5,pve6 in quorum (ranks 1,2,3,4,5)
A few seconds later:
[DBG] mgrmap e182: pve6(active), standbys: pve3, pve5, pve2, pve1
2019-06-12 11:24:41.786249 mon.pve2 mon.1 192.168.1.102:6789/0 360 : cluster [WRN] Health check failed: 1/6 mons down, quorum pve2,pve3,pve4,pve5,pve6 (MON_DOWN)
2019-06-12 11:24:41.796560 mon.pve2 mon.1 192.168.1.102:6789/0 361 : cluster [DBG] osd.12 192.168.1.101:6803/22176 reported immediately failed by osd.21 192.168.1.102:6808/49879
2019-06-12 11:24:41.796727 mon.pve2 mon.1 192.168.1.102:6789/0 362 : cluster [INF] osd.12 failed (root=default,host=pve1) (connection refused reported by osd.21)
2019-06-12 11:24:41.796972 mon.pve2 mon.1 192.168.1.102:6789/0 363 : cluster [DBG] osd.16 192.168.1.101:6805/22281 reported immediately failed by osd.27 192.168.1.102:6814/50219
2019-06-12 11:24:41.797079 mon.pve2 mon.1 192.168.1.102:6789/0 364 : cluster [INF] osd.16 failed (root=default,host=pve1) (connection refused reported by osd.27)
This goes on for a while, and then
2019-06-12 11:25:12.597138 osd.32 osd.32 192.168.1.101:6821/23169 2 : cluster [DBG] map e22911 wrongly marked me down at e22910
Which seems to cause the Ceph cluster to see all the missing osd again:
2019-06-12 11:25:37.532316 mon.pve2 mon.1 192.168.1.102:6789/0 1182 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:26:07.714656 mon.pve5 mon.4 192.168.1.105:6789/0 2571 : cluster [INF] mon.pve5 calling monitor election
2019-06-12 11:26:07.717357 mon.pve4 mon.3 192.168.1.104:6789/0 2806 : cluster [INF] mon.pve4 calling monitor election
2019-06-12 11:26:07.717463 mon.pve3 mon.2 192.168.1.103:6789/0 48373 : cluster [INF] mon.pve3 calling monitor election
2019-06-12 11:26:07.727553 mon.pve2 mon.1 192.168.1.102:6789/0 1183 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:26:07.739388 mon.pve1 mon.0 192.168.1.101:6789/0 1642 : cluster [INF] mon.pve1 calling monitor election
2019-06-12 11:26:08.718094 mon.pve1 mon.0 192.168.1.101:6789/0 1643 : cluster [INF] mon.pve1 is new leader, mons pve1,pve2,pve3,pve4,pve5,pve6 in quorum (ranks 0,1,2,3,4,5)
2019-06-12 11:26:08.735386 mon.pve1 mon.0 192.168.1.101:6789/0 1644 : cluster [DBG] monmap e18: 6 mons at {pve1=192.168.1.101:6789/0,pve2=192.168.1.102:6789/0,pve3=192.168.1.103:6789/0,pve4=192.168.1.104:6789/0,pve5=192.168.1.105:6789/0,pve6=192.168.1.106:6789/0}
2019-06-12 11:26:08.735483 mon.pve1 mon.0 192.168.1.101:6789/0 1645 : cluster [DBG] fsmap cephfs-1/1/1 up {0=pve6=up:active}, 3 up:standby
2019-06-12 11:26:08.735570 mon.pve1 mon.0 192.168.1.101:6789/0 1646 : cluster [DBG] osdmap e22911: 38 total, 26 up, 38 in
2019-06-12 11:26:08.735739 mon.pve1 mon.0 192.168.1.101:6789/0 1647 : cluster [DBG] mgrmap e184: pve6(active), standbys: pve3, pve5, pve2, pve1
2019-06-12 11:26:08.736144 mon.pve1 mon.0 192.168.1.101:6789/0 1648 : cluster [INF] Health check cleared: MON_DOWN (was: 2/6 mons down, quorum pve1,pve3,pve4,pve5)
2019-06-12 11:26:08.738152 mon.pve1 mon.0 192.168.1.101:6789/0 1649 : cluster [DBG] osd.9 192.168.1.106:6805/3434 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738488 mon.pve1 mon.0 192.168.1.101:6789/0 1650 : cluster [DBG] osd.24 192.168.1.102:6806/49768 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738817 mon.pve1 mon.0 192.168.1.101:6789/0 1651 : cluster [DBG] osd.22 192.168.1.102:6812/50099 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738974 mon.pve1 mon.0 192.168.1.101:6789/0 1652 : cluster [DBG] osd.27 192.168.1.102:6814/50219 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.739482 mon.pve1 mon.0 192.168.1.101:6789/0 1653 : cluster [DBG] osd.25 192.168.1.102:6804/49658 failure report canceled by osd.12 192.168.1.101:6803/22176
And so on for all the missing osd.
After some time, 15-30 minutes or so, it starts missing disks again.
I believe this is because the pve nodes are unable to communicate with each other.
Things I have tried to remedy:
On each affected node:
pvecm updatecerts
This caused the affected machines to be reachable by ssh for a little while.
Removing ssh known hosts and authorized hosts, because I thought that perhaps the keys stopped matching
rm ~/.ssh/known_hosts && rm /etc/ssh/ssh_known_hosts
Removing these does not seem to have done anything much
pvecm add 192.168.1.103 -force
This seemed to help for a bit longer (a few hours rather than fifteen minutes)
Currently I am at a loss on how to reach stability again.
Does anyone have an idea what might be the root problem behind these symptoms?
Some Proxmox nodes are suddenly no longer able to communicate with one another, causing them to become isolated from the cluster, causing the cluster to become out of quorate and causing Ceph to stop functioning.
Symptoms:
Trying to view an effected node from another node results in
Error: Connection Error 595: Connection refused
Trying to ssh from a pve node to another immediately results in
port 22: Connection refused
In the WebUI, these nodes show up with a red cross.
From Ceph, which gives me the most cause for concern.
2019-06-12 11:24:36.689357 mon.pve2 mon.1 192.168.1.102:6789/0 354 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:24:41.716770 mon.pve2 mon.1 192.168.1.102:6789/0 355 : cluster [INF] mon.pve2 is new leader, mons pve2,pve3,pve4,pve5,pve6 in quorum (ranks 1,2,3,4,5)
A few seconds later:
[DBG] mgrmap e182: pve6(active), standbys: pve3, pve5, pve2, pve1
2019-06-12 11:24:41.786249 mon.pve2 mon.1 192.168.1.102:6789/0 360 : cluster [WRN] Health check failed: 1/6 mons down, quorum pve2,pve3,pve4,pve5,pve6 (MON_DOWN)
2019-06-12 11:24:41.796560 mon.pve2 mon.1 192.168.1.102:6789/0 361 : cluster [DBG] osd.12 192.168.1.101:6803/22176 reported immediately failed by osd.21 192.168.1.102:6808/49879
2019-06-12 11:24:41.796727 mon.pve2 mon.1 192.168.1.102:6789/0 362 : cluster [INF] osd.12 failed (root=default,host=pve1) (connection refused reported by osd.21)
2019-06-12 11:24:41.796972 mon.pve2 mon.1 192.168.1.102:6789/0 363 : cluster [DBG] osd.16 192.168.1.101:6805/22281 reported immediately failed by osd.27 192.168.1.102:6814/50219
2019-06-12 11:24:41.797079 mon.pve2 mon.1 192.168.1.102:6789/0 364 : cluster [INF] osd.16 failed (root=default,host=pve1) (connection refused reported by osd.27)
This goes on for a while, and then
2019-06-12 11:25:12.597138 osd.32 osd.32 192.168.1.101:6821/23169 2 : cluster [DBG] map e22911 wrongly marked me down at e22910
Which seems to cause the Ceph cluster to see all the missing osd again:
2019-06-12 11:25:37.532316 mon.pve2 mon.1 192.168.1.102:6789/0 1182 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:26:07.714656 mon.pve5 mon.4 192.168.1.105:6789/0 2571 : cluster [INF] mon.pve5 calling monitor election
2019-06-12 11:26:07.717357 mon.pve4 mon.3 192.168.1.104:6789/0 2806 : cluster [INF] mon.pve4 calling monitor election
2019-06-12 11:26:07.717463 mon.pve3 mon.2 192.168.1.103:6789/0 48373 : cluster [INF] mon.pve3 calling monitor election
2019-06-12 11:26:07.727553 mon.pve2 mon.1 192.168.1.102:6789/0 1183 : cluster [INF] mon.pve2 calling monitor election
2019-06-12 11:26:07.739388 mon.pve1 mon.0 192.168.1.101:6789/0 1642 : cluster [INF] mon.pve1 calling monitor election
2019-06-12 11:26:08.718094 mon.pve1 mon.0 192.168.1.101:6789/0 1643 : cluster [INF] mon.pve1 is new leader, mons pve1,pve2,pve3,pve4,pve5,pve6 in quorum (ranks 0,1,2,3,4,5)
2019-06-12 11:26:08.735386 mon.pve1 mon.0 192.168.1.101:6789/0 1644 : cluster [DBG] monmap e18: 6 mons at {pve1=192.168.1.101:6789/0,pve2=192.168.1.102:6789/0,pve3=192.168.1.103:6789/0,pve4=192.168.1.104:6789/0,pve5=192.168.1.105:6789/0,pve6=192.168.1.106:6789/0}
2019-06-12 11:26:08.735483 mon.pve1 mon.0 192.168.1.101:6789/0 1645 : cluster [DBG] fsmap cephfs-1/1/1 up {0=pve6=up:active}, 3 up:standby
2019-06-12 11:26:08.735570 mon.pve1 mon.0 192.168.1.101:6789/0 1646 : cluster [DBG] osdmap e22911: 38 total, 26 up, 38 in
2019-06-12 11:26:08.735739 mon.pve1 mon.0 192.168.1.101:6789/0 1647 : cluster [DBG] mgrmap e184: pve6(active), standbys: pve3, pve5, pve2, pve1
2019-06-12 11:26:08.736144 mon.pve1 mon.0 192.168.1.101:6789/0 1648 : cluster [INF] Health check cleared: MON_DOWN (was: 2/6 mons down, quorum pve1,pve3,pve4,pve5)
2019-06-12 11:26:08.738152 mon.pve1 mon.0 192.168.1.101:6789/0 1649 : cluster [DBG] osd.9 192.168.1.106:6805/3434 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738488 mon.pve1 mon.0 192.168.1.101:6789/0 1650 : cluster [DBG] osd.24 192.168.1.102:6806/49768 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738817 mon.pve1 mon.0 192.168.1.101:6789/0 1651 : cluster [DBG] osd.22 192.168.1.102:6812/50099 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.738974 mon.pve1 mon.0 192.168.1.101:6789/0 1652 : cluster [DBG] osd.27 192.168.1.102:6814/50219 failure report canceled by osd.12 192.168.1.101:6803/22176
2019-06-12 11:26:08.739482 mon.pve1 mon.0 192.168.1.101:6789/0 1653 : cluster [DBG] osd.25 192.168.1.102:6804/49658 failure report canceled by osd.12 192.168.1.101:6803/22176
And so on for all the missing osd.
After some time, 15-30 minutes or so, it starts missing disks again.
I believe this is because the pve nodes are unable to communicate with each other.
Things I have tried to remedy:
On each affected node:
pvecm updatecerts
This caused the affected machines to be reachable by ssh for a little while.
Removing ssh known hosts and authorized hosts, because I thought that perhaps the keys stopped matching
rm ~/.ssh/known_hosts && rm /etc/ssh/ssh_known_hosts
Removing these does not seem to have done anything much
pvecm add 192.168.1.103 -force
This seemed to help for a bit longer (a few hours rather than fifteen minutes)
Currently I am at a loss on how to reach stability again.
Does anyone have an idea what might be the root problem behind these symptoms?