Hello,
We operate a cluster with 10+ nodes where 1 of them is serving as SAN with CEPH. It has 20+ disks (OSD) inside, with one monitor and one manager installed on the SAN.
The rest of the nodes are data nodes with 4 bay chassis with 2 installed disks in ZFS RAID1 mode.
We have scheduled maintenance + updates for the SAN node, with planned downtime of at least 2 hours.
I need to make a plan on how to ensure non-interrupted services to all nodes during SAN downtime. Right now I see two options:
(1) * Install temporary external NFS server, import it into the cluster and move all drives on nodes from CEPH to NFS, perform the maintenance and then return drives from NFS to CEPH;
(2) * Install temporary drives in the two remaining free bays on data nodes, install 2 additional monitors and 1 additional manager on selected nodes, then import the disks as OSDs, migrate all the data to them, remove OSDs from SAN and then stop it. Perform the maintenance and then return all OSDs in SAN and remove OSDs from data nodes.
As complicated as it seems option (2) I need very detailed step by step plan on how to perform it. I am not very familiar with CEPH and no-out option. My research so far shows that I may need to use it. I am also not very skilled on moving OSDs out of the cluster and do I need to remove them or only set as "out" for some time?
I am kindly asking on advise either if there is additional third option that I may not have figured out yet? Opinions on (1) vs (2), and detailed step-by-step manual on how to perform (2) would be of great help for me.
We intend to keep the existing structure with 1 SAN after we complete the maintenance.
Thank you!
We operate a cluster with 10+ nodes where 1 of them is serving as SAN with CEPH. It has 20+ disks (OSD) inside, with one monitor and one manager installed on the SAN.
The rest of the nodes are data nodes with 4 bay chassis with 2 installed disks in ZFS RAID1 mode.
We have scheduled maintenance + updates for the SAN node, with planned downtime of at least 2 hours.
I need to make a plan on how to ensure non-interrupted services to all nodes during SAN downtime. Right now I see two options:
(1) * Install temporary external NFS server, import it into the cluster and move all drives on nodes from CEPH to NFS, perform the maintenance and then return drives from NFS to CEPH;
(2) * Install temporary drives in the two remaining free bays on data nodes, install 2 additional monitors and 1 additional manager on selected nodes, then import the disks as OSDs, migrate all the data to them, remove OSDs from SAN and then stop it. Perform the maintenance and then return all OSDs in SAN and remove OSDs from data nodes.
As complicated as it seems option (2) I need very detailed step by step plan on how to perform it. I am not very familiar with CEPH and no-out option. My research so far shows that I may need to use it. I am also not very skilled on moving OSDs out of the cluster and do I need to remove them or only set as "out" for some time?
I am kindly asking on advise either if there is additional third option that I may not have figured out yet? Opinions on (1) vs (2), and detailed step-by-step manual on how to perform (2) would be of great help for me.
We intend to keep the existing structure with 1 SAN after we complete the maintenance.
Thank you!