Hello,
We are tring to pretend a disaster in our proxmox environments, so to see what we can do.
We have 4 ceph nodes only run ceph, plus some addition nodes only acts as HV.
We decleared 2 datacenter, 2 ceph server belongs to datacenter A, and the other 2 ceph servers belongs to datacenter B.
We changed the crush maps , declared datacenter, and also changed the replication rules to ensure every copy is at every datacenter every host.
That is all fine, but until we really shut down two ceph server which belongs to one datacenter, everything starts not working, VMs started to freeze.
How can you maintain the cluster to continue to works after some nodes/datacenter down?
We can not even run ceph commands now if we shutdown two ceph servers (e.g ceph3 and ceph4, but ceph1 and ceph2 are still online in this case):
root@ceph1:~# ceph osd tree
2018-06-27 17:54:24.333963 7f572ae99700 0 monclient(hunting): authenticate timed out after 300
2018-06-27 17:54:24.333992 7f572ae99700 0 librados: client.admin authentication error (110) Connection timed out
root@ceph1:~# ceph status
2018-06-27 18:09:29.116223 7fd00104b700 0 monclient(hunting): authenticate timed out after 300
2018-06-27 18:09:29.116245 7fd00104b700 0 librados: client.admin authentication error (110) Connection timed out
[errno 110] error connecting to the cluster
Even under the UI, the status is green with tick,
Cluster: CephCluster, Quorate: Yes
But it actually is not working now.
How can we solve this problem???
Thanks
We are tring to pretend a disaster in our proxmox environments, so to see what we can do.
We have 4 ceph nodes only run ceph, plus some addition nodes only acts as HV.
We decleared 2 datacenter, 2 ceph server belongs to datacenter A, and the other 2 ceph servers belongs to datacenter B.
We changed the crush maps , declared datacenter, and also changed the replication rules to ensure every copy is at every datacenter every host.
That is all fine, but until we really shut down two ceph server which belongs to one datacenter, everything starts not working, VMs started to freeze.
How can you maintain the cluster to continue to works after some nodes/datacenter down?
We can not even run ceph commands now if we shutdown two ceph servers (e.g ceph3 and ceph4, but ceph1 and ceph2 are still online in this case):
root@ceph1:~# ceph osd tree
2018-06-27 17:54:24.333963 7f572ae99700 0 monclient(hunting): authenticate timed out after 300
2018-06-27 17:54:24.333992 7f572ae99700 0 librados: client.admin authentication error (110) Connection timed out
root@ceph1:~# ceph status
2018-06-27 18:09:29.116223 7fd00104b700 0 monclient(hunting): authenticate timed out after 300
2018-06-27 18:09:29.116245 7fd00104b700 0 librados: client.admin authentication error (110) Connection timed out
[errno 110] error connecting to the cluster
Even under the UI, the status is green with tick,
Cluster: CephCluster, Quorate: Yes
But it actually is not working now.
How can we solve this problem???
Thanks