Hello,
I seem to be having troubles with failover / redundancy that I am hoping someone in the community might be able to help me understand.
I have a four node cluster, which I am working to ensure high-availability of the vms and containers being managed.
This is a hobby cluster in my garage, so nothing mission critical, and as such, I am not averse to intentionally taking nodes offline as in the below test.
The issue appears to be a cascade of failures!
See the below screenshot of the state of my cluster 5 mins after intentionally killing a single node (dl380g7):
For testing, I have the ceph "noout" global flag enabled, to avoid shuffling data around. This is why all osds are "in".
It appears that the failure of dl380g7 causes a handful of OSDs on other machines to fail as well, which in turn causes everything to lock up once I drop below the min_size of my pools.
This then leads to an apparent crash of pvestatd? Though the service remains active.
From there, everything is a gong show of restarting services to get it all back online.
Can anyone shed some light on what might be happening here, and how I am best to go about debugging?
I seem to be having troubles with failover / redundancy that I am hoping someone in the community might be able to help me understand.
I have a four node cluster, which I am working to ensure high-availability of the vms and containers being managed.
This is a hobby cluster in my garage, so nothing mission critical, and as such, I am not averse to intentionally taking nodes offline as in the below test.
The issue appears to be a cascade of failures!
See the below screenshot of the state of my cluster 5 mins after intentionally killing a single node (dl380g7):
For testing, I have the ceph "noout" global flag enabled, to avoid shuffling data around. This is why all osds are "in".
It appears that the failure of dl380g7 causes a handful of OSDs on other machines to fail as well, which in turn causes everything to lock up once I drop below the min_size of my pools.
This then leads to an apparent crash of pvestatd? Though the service remains active.
From there, everything is a gong show of restarting services to get it all back online.
Can anyone shed some light on what might be happening here, and how I am best to go about debugging?