Hello,
We have been experiencing strange issues since we have installed our cluster.
First, let's give you a glance of our configuration:
* 3 times the same server (RAM, disks, CPUs),
* Connected to our network from one side,
* All on UPS, switches included,
* Connected to a 10GB switch for CEPH (private) on the other side,
* RAID card (Avago MegaRaid SAS MFI 3108) which is not used for RAID, but just configured as "pass-through" so we can see all individually (maybe not the best part of the design, we should have removed it completely), no battery,
* 2x Samsung SSD 1TB (SSD model compatible with RAID card, I've checked),
* 2x 2TB spinning.
Configuration at a glance:
* Ceph is using the 2 individual spinning disks on each server as volumes,
* Ceph journal is on first SSD,
* Remaining space of first SSD is a local volume for VM,
* Other SSD is a local volume for VMs,
* Ceph is configured in HA, so any failing server will migrate automatically the VM to another healthy one (tests OK).
Please let me know if you want specific logs/details that could help in the diagnostic.
Issue description:
Randomly, and too often (~5 times in a month), we have lost one or the other SSD from our configuration on one or another server, or 2 at the same time: the RAID card flags the SSD as failing, and removes it from the system. When it is the one with CEPH journal on it, we partly loose CEPH redundancy + some VMs are dying (no local disks); when it is the other one, we "only" loose VM.
We can repair this state by rebooting the server, marking the SSD disk as good and importing former RAID configuration, but we could face a total disaster if we are not lucky and loose all CEPH SSD at once.
Note that we never noticed the spinning disks to be flagged as faulty.
Due to this problem, we are seriously considering stepping back to VMware. Of course, I'm not sure this is 100% related to Proxmox, by the way, I'm just looking here for any suggestion, recommendation, or a similar experience with this issue.
Thank you, and best regards,
-- Nicolas
We have been experiencing strange issues since we have installed our cluster.
First, let's give you a glance of our configuration:
* 3 times the same server (RAM, disks, CPUs),
* Connected to our network from one side,
* All on UPS, switches included,
* Connected to a 10GB switch for CEPH (private) on the other side,
* RAID card (Avago MegaRaid SAS MFI 3108) which is not used for RAID, but just configured as "pass-through" so we can see all individually (maybe not the best part of the design, we should have removed it completely), no battery,
* 2x Samsung SSD 1TB (SSD model compatible with RAID card, I've checked),
* 2x 2TB spinning.
Configuration at a glance:
* Ceph is using the 2 individual spinning disks on each server as volumes,
* Ceph journal is on first SSD,
* Remaining space of first SSD is a local volume for VM,
* Other SSD is a local volume for VMs,
* Ceph is configured in HA, so any failing server will migrate automatically the VM to another healthy one (tests OK).
Please let me know if you want specific logs/details that could help in the diagnostic.
Issue description:
Randomly, and too often (~5 times in a month), we have lost one or the other SSD from our configuration on one or another server, or 2 at the same time: the RAID card flags the SSD as failing, and removes it from the system. When it is the one with CEPH journal on it, we partly loose CEPH redundancy + some VMs are dying (no local disks); when it is the other one, we "only" loose VM.
We can repair this state by rebooting the server, marking the SSD disk as good and importing former RAID configuration, but we could face a total disaster if we are not lucky and loose all CEPH SSD at once.
Note that we never noticed the spinning disks to be flagged as faulty.
Due to this problem, we are seriously considering stepping back to VMware. Of course, I'm not sure this is 100% related to Proxmox, by the way, I'm just looking here for any suggestion, recommendation, or a similar experience with this issue.
Thank you, and best regards,
-- Nicolas