[SOLVED] Faulty Disks / Missing OSDs

emilhozan

Member
Aug 27, 2019
51
4
8
Hey all,

I received an email notification regarding "SMART error (FailedOpenDevice) detected on host:" After looking this up a bit, I wasn't able to find anything relating to the issues I am experiencing, hence me posting this. The email is attached as "host3_sdd.txt" containing the email notification.

Currently I have 4/5 nodes in this cluster experiencing this issue. The effected hosts all have VMs running, the 5th node does NOT have any VMs - perhaps there's a correlation there?

The cluster is 5x nodes, each with x6 drives:
1x SAS holds OS
1x SSD for fast I/O requirements
x4 HDDs for larger storage requirements


An example scenario: host 3
- has /sdc & /sdd as down
- relevant time of issue: Thu Oct 31 03:59:19 2019 PDT
- GUI no longer shows drives listed
- ls -l /dev/sd* doesn't show the two drives listed

That said, I started with the SYSLOG as suggested and worked my way backwards to find the last time the drive was good and identify the beginning of the error messages. Below you'll find an excerpt from SYSLOG:

(drive was being detected beforehand and SMART was reporting back)
Oct 31 03:56:42 k1n3 kernel: [914232.144604] sd 0:0:3:0: [sdd] tag#1 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
...
Oct 31 03:56:42 k1n3 kernel: [914232.144609] print_req_error: I/O error, dev sdd, sector 231254656
...
Oct 31 03:56:42 k1n3 kernel: [914232.144679] sd 0:0:3:0: [sdd] killing request
...
Oct 31 03:59:19 k1n3 smartd[1087]: Device: /dev/sdd [SAT], open() failed: No such device
...


Please help!
 

Attachments

I only titled this "Faulty Disks" because of the fact that they're not being detected in the hosts themselves. These were brand new drives and ran through many tests, never had this issue come up.

That said, what would be VM-related that could cause an issue like this? When setting up the cluster, I created many VMs, tested HA and LM, put data, etc., but there was no issue. Now that we're in production, the disks are showing an issue.

What's important to note here is that these same VMs were in a secondary DC with no issues like this. The two nodes are the same EXCEPT:
- the current cluster was upgraded with SSDs for faster I/O requirements
- the current cluster has 5x nodes, each with 6x drives
- the other cluster ONLY has HDDs (this is where we came to the realization of needing SSDs)
- the other cluster has 10x nodes
- both clusters have the same branded HDDs


Let me know what else that I can do to help further investigate this issue. Currently our options are limited, in that we cannot migrate the VMs or have pretty much no access / intermittent access to them. Things seem to be progressing worse.
 
One last thing to add. The cluster was working fine through all testing phases. When we started migrating live VMs to this cluster (after upgrading with SSDs), the host hosting the VMs starting getting intermittent issues. When going on-site to troubleshoot, I wasn't able to find any issues. I rebooted the host a few times, due to some messages about synch issues. Things seemed to then self-heal (ceph perk?).

However, the next day, a staff member raised issues again, I tried creating a VM on that same host with no issues. But now, hosts 1 - 4 are non responsive to remote reboot requests and all.
 
Okay, so host 1 is now only unresponsive, the other nodes finally rebooted. Seeing that I attempted to manually reboot host 1 last time but the issue magnified, I am curious if a manual reboot would even help now.

Are there any commands I can run to figure out what's up here?

I'll see if the drives appear after the servers start back up.
 
1x SSD for fast I/O requirements
x4 HDDs for larger storage requirements
How are they connected (controller/HBA)?

I only titled this "Faulty Disks" because of the fact that they're not being detected in the hosts themselves. These were brand new drives and ran through many tests, never had this issue come up.
They still seem to be dead. Could be anything from bad firmware, cabling, to controller.

Are there any commands I can run to figure out what's up here?
Check the ceph logs (on all nodes) + journal/syslog. And how is the status of the cluster now, ceph -s?
 
How are they connected (controller/HBA)?

The RAID card is a PERC H200, which is one of the supported cards that seemed most suitable, per PowerEdge R610 data sheet. They allow pass through drives to let ceph manage all.


Check the ceph logs (on all nodes) + journal/syslog. And how is the status of the cluster now, ceph -s?

For the first part
- are you stating to repeat what I did for that one server and do the same to the other servers?
I'm assuming to check and see if there are differences between the logs? If that's the case, I did to two servers, but will ensure they all match as well.
- Can you confirm if the SYSLOG is the only place to look? I saw another directory per a different forum but those logs weren't useful. I can't remember right now where they were but can you confirm the directories to investigate these types of issues for me and future readers?

Regarding the latter part, see attached text document (ceph-s.txt).
- This was pulled from one of the five, and it does have 1 disk missing.



They still seem to be dead. Could be anything from bad firmware, cabling, to controller.

I plan on going on-site ASAP and will follow up with some physical testing.


Follow up question(s):
- are there preferred settings to use for the VM hardware configurations to ensure maximum performance and compatibility? For instance the "SCSI Controller" setting, or the "Machine"?
- is it possible for hard disks to "get fried" through this use? IDK if that's a silly question but I think it's valid. I have experience with NAS solutions and have never seen something like this happen.
- One thing to note with the above is that the disks are 5400 RPM. Obviously that's not preferred for max performance or whatever, but is that a negative in that it's not fast enough? I'm just trying to brainstorm I guess and wrap my head around what's going on.
 

Attachments

@Alwin

I went on-site last night and manually powered off the servers, reseated all storage devices, and they all came back online. Clearly the disks were not faulty, somehow the Proxmox GUI wasn't detecting them, even after a remote reboot. It took physically reseating them to fix this issue, so not sure what's up with that.


Since I wasn't able to immediately go on-site, I removed the problem OSDs from the ceph cluster to do some other remote testing. I didn't think this would've been a big issue if indeed the drives were dead. We needed to do other testing to try and identify the root cause. Prior to going to production, I ran through numerous tests and never ran into the issues we currently are experiencing. IDK if it's the VM load or what, but no issues arose with creating VMs, Live Migrating, and even HA fail over tests.

My remote testing incurred some additional odd behaviors. I was having issues with creating VMs on the HDD pool. VM creation times were way bad, I didn't even time it that's how bad they were. For example, I'd create a VM with the hard disk in the HDD pool. The VM entry would appear but the hardware info wouldn't populate and didn't reflect the settings I configured.

When creating a test VM on our SSD pool, the VM entry was immediate as was the hardware configuration details. Before facing these issues, creating a VM with HDD disks was immediate as well. Do you have any ideas here?

Further, after coming back to the HDD VMs, quite some time later, their entry appeared. I went through the install and setup steps, and all worked semi good. I was able to get into my test VMs and do a few things. However, I was unable to live migrate a VM. I had to power it off in order to migrate to another node.


Further, after removing the OSDs, I am not unsure of which OSDs are registered to which physical device. I considered "zapping" the OSD drives I removed but I don' know which one belongs to which! How can I determine the OSD-to-hard drive mapping? I tried looking online but didn't find anything pertaining to what I was looking for.

To be clear in what I'm asking here: how can I determine what disk OSD.{n} is in terms of /dev/sd{n}


I'm assuming what I've experienced isn't expected, so how can we figure out what happened and why?
 
Is it possible one of the HBA cards is bad and that causes ceph to freak out as disks are coming in and out or flopping? Are there any disks not on an HBA that can be pooled and tested? Are the same drives consistently complaining? Can you replace the drives that are reported as bad or move them around to see if it's a controller issue?

Also was the VM set to be HA? Otherwise I feel it won't be special unless it's converted to an HA VM under the Datacenter >> HA option no?

C
 
- are there preferred settings to use for the VM hardware configurations to ensure maximum performance and compatibility? For instance the "SCSI Controller" setting, or the "Machine"?
The defaults should be fine for good performing VMs. The rest is up to testing, as it depends greatly on the workload.

- is it possible for hard disks to "get fried" through this use? IDK if that's a silly question but I think it's valid. I have experience with NAS solutions and have never seen something like this happen.
Yes, even with NAS/DAS/SAN systems. ;)

- One thing to note with the above is that the disks are 5400 RPM. Obviously that's not preferred for max performance or whatever, but is that a negative in that it's not fast enough? I'm just trying to brainstorm I guess and wrap my head around what's going on.
Well, those are really not the type of drives you want to have for hosting VM/CT on it. Run a rados -p <pool> bench 60 write and compare the performance to our Ceph Benchmark Paper [0] and the forum thread [1].

To be clear in what I'm asking here: how can I determine what disk OSD.{n} is in terms of /dev/sd{n}
Ceph Nautilus uses ceph-volume and it can list the OSDs->disks, ceph-volume lvm list.

[0] https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
[1] https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
 
@Alwin
Thanks for the update here.

I wasn't able to get to this until now. However, the provided command - "ceps-volume lvm list" as - doesn't help me, the result is "No valid Ceph devices found"

I tried messing around with the command a bit and might have figured it out. Is there an alternative command that will show such a mapping?
I ran "ceps-volume simple scan" and checked the physical block definition (if that's the technical wording). In short, I was able to see which OSD was with which sd[N] but it wasn't straightforward.

From there, I used "ceph-volume lvm zap /dev/sd[X] --destroy" to work my way through the "faulty" disks.
 
From there, I used "ceph-volume lvm zap /dev/sd[X] --destroy" to work my way through the "faulty" disks.
So, you replaced all the old OSDs with new ones (same disks). Then the above command should work now (I assumed this).
 
So, you replaced all the old OSDs with new ones (same disks). Then the above command should work now (I assumed this).

Correct, I did replace the OSDs with the same physical disk. However, the command still does not work:
1574285596292.png
 
As mentioned above, ceph-volume is for Ceph Nautilus, while ceph-disk is for Ceph Luminous. PVE 5.x uses Luminous and a disk creation with pveceph will use ceph-disk underneath. The above command will only work on PVE 6.x or when manually creating the OSDs with ceph-volume.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!