Ceph OSD Issues

emilhozan

Member
Aug 27, 2019
51
4
8
Hey all,

I am writing today about some issues I'm having with some OSDs. To give some background data:
- we have 2x separate clusters
- 1x cluster with 5x nodes
- another 1x cluster with 10x nodes (currently configured with only 9x for the time being)
- all 15x nodes have 1x SAS drive with Proxmox installed, 1x SSD, 4x HDD
- both clusters have an HDD pool and an SSD pool
- the HDDs are 5400 RPM
- each cluster is running ceph with the results from the ceph calculator results used
- there are separate HDD vs SSD rules to manage both sets of pools

After doing many preliminary tests (creating test VMs, testing migration and HA, etc.) on both cluster, we decided to get some production VMs up and running. Since then, we've faced quite a few errors. Basically the VMs were created and launched onto one cluster, then they started getting a lot of "FailedOpenDrive" SMART results from that cluster's OSDs. This led us to believe the disks were bad but we got them swapped out only to face issues with that storage pool. This was all on our 5x nodes. I am still working to understand what's going on with that cluster and gathering additional detail.

Well, we moved the VMs to our other 9x node cluster to mitigate risk and started getting similar SMART logs. Not to make things confusing but we had these production VMs running on our 9x node initially after testing and due to some colo matters, we temporarily migrated the VMs to our 5x node. For the brief time the VMs were on that 9x node cluster, there were no issues. It wasn't a long time but anyhow, after temporarily migrating to our 5x nodes, then we got hit with many SMART logs. Did quite a bit of troubleshooting and whatnot (some may recognize the posts I've made in regards to that) but the errors just kept stacking, even after replacing our disks.

Finally, we are where we are now. We got our colo matters sorted, got the 9x node cluster back up and running, ran some test, all with no issues. Cool, but now that we migrated the production VMs back to the 9x node cluster, we started getting similar logs. Granted we didn't get nearly as many as on our 5x node, I am just lost as to what to do??

I checked the SYSLOGs as suggested in the notification email; see the attached logs. Further, I saw the /var/log/ceph/ceph-osd files and captured the corresponding logs for each of the problem nodes )nodes 2 and 9) on our 9x node cluster. Each upload is specific to that node with both relevant log directories within.


As for the issues at hand, is it possible the VMs are I/O heavy on the disks? Could it be anything else?
How can I dig into what the issue is here, as those VMs are sluggish and unresponsive. I'm going to see if there's any other data that I can get from the stakeholder but this is all I have ATM.

I can currently create new VMs on that pool and launch them, but the current (here I mean the production VMs that we migrated back over here) VMs don't stop or respond to similar commands.

Let me know if there is some additional data that I can provide.

To be clear here, I am requesting support for out 9x cluster. We obviously want to get to the bottom of what's going on, as we definitely don't want the larger cluster to face the issues we had and are having with our smaller cluster.
 

Attachments

  • SMART_Error-Node_2.txt
    2.1 KB · Views: 2
  • SMART_Error-Node_9.txt
    1.3 KB · Views: 1
Last edited:
This issue is related to this one, isn't it? Please don't double post (spread), it makes it harder to help. Thanks.
https://forum.proxmox.com/threads/odd-ceph-issues.61630/#post-286517

2020-01-03 22:19:29.603957 7f60ed61be00 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-6: (5) Input/output error^[[0m
As multiple disks are involved, it sounds more like an controller issue. What controller are the disks connected to?
 
@Alwin,

Thank you for the follow up.

No, this post is regarding a second cluster I have. The link you provided is for the first cluster.
What further sets this post apart from the other post is that these VMs were previously running fine on the cluster this post is regarding (let's call it Cluster2). These VMs were running for weeks just fins on Cluster2. Something came up with the service provider, we temporarily moved the VMs to Cluster1 (the previous posts' corresponding cluster).

All these VMs were now migrated back to Cluster2 after dealing with the service provider, and these issues arose.
 
2020-01-03 22:19:29.603957 7f60ed61be00 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-6: (5) Input/output error^[[0m
As multiple disks are involved, it sounds more like an controller issue. To what controller are the disks connected to?
 
To what controller are the disks connected to
H200, for Dell R610

As multiple disks are involved, it sounds more like an controller issue
As coincidental as this may sound, I don't suspect it to be the backplane. Upon moving the VMs back onto this cluster, issues started arising once HA was enabled on critical VMs. I am currently testing by removing those resources from HA to see if it helps.


Another note of interest.
Some VMs use just the SSD pool and others have their system installed on SSD, but also a data store on the HDD pool. Even when no data is being accessed on the HDD pool, the VMs are exhibiting latency and issues.
 
H200, for Dell R610
Is this RAID controller running as an HBA? RAID controller can cause all sorts of symptoms in combination with Ceph/ZFS.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

Are you using more than one controller? The H200 has only 4 ports, if more than 4 disks are connected to it, it needs to multiplex. Introducing latency.

- the HDDs are 5400 RPM
These are slow drives not only for Ceph. KVM uses one thread for all disks of a VM, so the slowest storage will introduce latency to the guest. With ionice, each disk of a VM gets its own thread. This may help reduce the latency to the OS disk of the VM.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!