Hey all,
I am writing today about some issues I'm having with some OSDs. To give some background data:
- we have 2x separate clusters
- 1x cluster with 5x nodes
- another 1x cluster with 10x nodes (currently configured with only 9x for the time being)
- all 15x nodes have 1x SAS drive with Proxmox installed, 1x SSD, 4x HDD
- both clusters have an HDD pool and an SSD pool
- the HDDs are 5400 RPM
- each cluster is running ceph with the results from the ceph calculator results used
- there are separate HDD vs SSD rules to manage both sets of pools
After doing many preliminary tests (creating test VMs, testing migration and HA, etc.) on both cluster, we decided to get some production VMs up and running. Since then, we've faced quite a few errors. Basically the VMs were created and launched onto one cluster, then they started getting a lot of "FailedOpenDrive" SMART results from that cluster's OSDs. This led us to believe the disks were bad but we got them swapped out only to face issues with that storage pool. This was all on our 5x nodes. I am still working to understand what's going on with that cluster and gathering additional detail.
Well, we moved the VMs to our other 9x node cluster to mitigate risk and started getting similar SMART logs. Not to make things confusing but we had these production VMs running on our 9x node initially after testing and due to some colo matters, we temporarily migrated the VMs to our 5x node. For the brief time the VMs were on that 9x node cluster, there were no issues. It wasn't a long time but anyhow, after temporarily migrating to our 5x nodes, then we got hit with many SMART logs. Did quite a bit of troubleshooting and whatnot (some may recognize the posts I've made in regards to that) but the errors just kept stacking, even after replacing our disks.
Finally, we are where we are now. We got our colo matters sorted, got the 9x node cluster back up and running, ran some test, all with no issues. Cool, but now that we migrated the production VMs back to the 9x node cluster, we started getting similar logs. Granted we didn't get nearly as many as on our 5x node, I am just lost as to what to do??
I checked the SYSLOGs as suggested in the notification email; see the attached logs. Further, I saw the /var/log/ceph/ceph-osd files and captured the corresponding logs for each of the problem nodes )nodes 2 and 9) on our 9x node cluster. Each upload is specific to that node with both relevant log directories within.
As for the issues at hand, is it possible the VMs are I/O heavy on the disks? Could it be anything else?
How can I dig into what the issue is here, as those VMs are sluggish and unresponsive. I'm going to see if there's any other data that I can get from the stakeholder but this is all I have ATM.
I can currently create new VMs on that pool and launch them, but the current (here I mean the production VMs that we migrated back over here) VMs don't stop or respond to similar commands.
Let me know if there is some additional data that I can provide.
To be clear here, I am requesting support for out 9x cluster. We obviously want to get to the bottom of what's going on, as we definitely don't want the larger cluster to face the issues we had and are having with our smaller cluster.
I am writing today about some issues I'm having with some OSDs. To give some background data:
- we have 2x separate clusters
- 1x cluster with 5x nodes
- another 1x cluster with 10x nodes (currently configured with only 9x for the time being)
- all 15x nodes have 1x SAS drive with Proxmox installed, 1x SSD, 4x HDD
- both clusters have an HDD pool and an SSD pool
- the HDDs are 5400 RPM
- each cluster is running ceph with the results from the ceph calculator results used
- there are separate HDD vs SSD rules to manage both sets of pools
After doing many preliminary tests (creating test VMs, testing migration and HA, etc.) on both cluster, we decided to get some production VMs up and running. Since then, we've faced quite a few errors. Basically the VMs were created and launched onto one cluster, then they started getting a lot of "FailedOpenDrive" SMART results from that cluster's OSDs. This led us to believe the disks were bad but we got them swapped out only to face issues with that storage pool. This was all on our 5x nodes. I am still working to understand what's going on with that cluster and gathering additional detail.
Well, we moved the VMs to our other 9x node cluster to mitigate risk and started getting similar SMART logs. Not to make things confusing but we had these production VMs running on our 9x node initially after testing and due to some colo matters, we temporarily migrated the VMs to our 5x node. For the brief time the VMs were on that 9x node cluster, there were no issues. It wasn't a long time but anyhow, after temporarily migrating to our 5x nodes, then we got hit with many SMART logs. Did quite a bit of troubleshooting and whatnot (some may recognize the posts I've made in regards to that) but the errors just kept stacking, even after replacing our disks.
Finally, we are where we are now. We got our colo matters sorted, got the 9x node cluster back up and running, ran some test, all with no issues. Cool, but now that we migrated the production VMs back to the 9x node cluster, we started getting similar logs. Granted we didn't get nearly as many as on our 5x node, I am just lost as to what to do??
I checked the SYSLOGs as suggested in the notification email; see the attached logs. Further, I saw the /var/log/ceph/ceph-osd files and captured the corresponding logs for each of the problem nodes )nodes 2 and 9) on our 9x node cluster. Each upload is specific to that node with both relevant log directories within.
As for the issues at hand, is it possible the VMs are I/O heavy on the disks? Could it be anything else?
How can I dig into what the issue is here, as those VMs are sluggish and unresponsive. I'm going to see if there's any other data that I can get from the stakeholder but this is all I have ATM.
I can currently create new VMs on that pool and launch them, but the current (here I mean the production VMs that we migrated back over here) VMs don't stop or respond to similar commands.
Let me know if there is some additional data that I can provide.
To be clear here, I am requesting support for out 9x cluster. We obviously want to get to the bottom of what's going on, as we definitely don't want the larger cluster to face the issues we had and are having with our smaller cluster.
Attachments
Last edited: