Looking for feedback on a past devastating SAN failure

mo_ · Feb 13, 2013

Hi,

we have suffered a severe SAN failure, possibly resulting in the loss of ALL windows VM data (and some CTs) and well basically Im looking for comments about whether that should be able to happen and maybe some explanations about oddities that are now being experienced.

Heres whats happened:

About 3 weeks ago, one of the disks of the RAID6 thats apparently hosting the volume group used as KVM shared storage (and another VG hosting the containers) failed. Simultaneously the hot spare disk died / was already broken... anyhow, the hot spare disk did not jump in. The SAN did not report anything because technically a RAID6 with 1 broken disk is still fully functional. I can already see the SAN controllers software being at fault here for not reporting this problem at all, but thats not the actual problem I dont think. Furthermore a RAID6 producing errors after 1 disk failure tells me this is a horrible horrible controller, right?

The actual problem is that, even though the VG should still be functional, apparantly there are bad sectors on it now. This has become apparent since:

- some containers had issues. Luckily checking the LV with fsck is easy and found+corrected some errors.

- Now here's what I dont fully understand yet: pretty much all the KVM disks apparantly are broken in the same way. We only discovered the problem because suddenly one of the windows machines became unresponsive. And it went straight downhill from there because none of the windows VMs is booting anymore. ALL of them fail to boot. Windows will then complain that the last boot failed and will offer to start some repair thing or to start normally. attempting to start normally results in a hard reset (of the VM) with no error message given whatsoever. Attempting to start the repair thingy also fails with

Error 17 - Ramdisk device creation failed due to insufficient memory

However, the VM has 2 gigs of RAM assigned, 20MB of which are being used by this pre-boot thingy. It's fairly obvious, thats it's not actually a lack of ram but something else. Has anybody seen this before? Any ideas on what to do about it? The VM is using the default cpu setting of qemu64 if that makes any difference.

Oh, also the host system shows a ram usage of 14/24GB so it really is not even close to being out of memory by any stretch of the imagination.

PS: I'm assuming there are bad sectors on the VG hosting the shared storage for the VMs, but how would I be able to check this? I guess I could run a fsck against the LVM device that is one of the VM disks, but I would much rather check the underlying VG as a hole. Sadly the SAN doesnt offer anything for that, so I guess Im stuck with tools on the proxmox host.

Overall the whole storage concept is a mess (because its a SAN ... so its messy by design) involving LVM, multipathing, fibre channel switches and lastly an apparently terrible SAN controller

PPS: You might be asking why I havent mentioned backups yet. The reason for that is that the tape library only stores the VMs for a two week frame, so all the available backups are faulty / not booting too.

udo · Feb 13, 2013

mo_ said:
Hi,

we have suffered a severe SAN failure, possibly resulting in the loss of ALL windows VM data (and some CTs) and well basically Im looking for comments about whether that should be able to happen and maybe some explanations about oddities that are now being experienced.

Heres whats happened:

About 3 weeks ago, one of the disks of the RAID6 thats apparently hosting the volume group used as KVM shared storage (and another VG hosting the containers) failed. Simultaneously the hot spare disk died / was already broken... anyhow, the hot spare disk did not jump in. The SAN did not report anything because technically a RAID6 with 1 broken disk is still fully functional. I can already see the SAN controllers software being at fault here for not reporting this problem at all, but thats not the actual problem I dont think. Furthermore a RAID6 producing errors after 1 disk failure tells me this is a horrible horrible controller, right?

...

Hi,
this looks like an very bad raid controller - I guess there are more disks faulty. (from which vendor is the raid controller?)
Are the errors present with replaced (perhaps other type) disks? Some raid controllers have trouble with specific disks (firmware).
Shows the raid controller errors with other disks?

Normaly you can do an rebuild or check checksum on a raidcontroller. Do you run such a check?

Udo

mo_ · Feb 13, 2013

its an Infortrend SAN and it was filled with fully certified disks at the time of failure. once we had noticed the error and asked for replacement disks they sent non-certified disks that were bigger than the original ones because they didnt have any certified disks for the SAN anymore for whatever reason. they replaced them later on with certified disks again and Im mainly telling this story to illustrate why this is the last SAN of that brand were going to use.

I shall try to find out whether that thing has such a check/rebuild mechanism.

Anything about that windows error though? or is that really just one of the weird errors popping up whenever theres disk errors?

Search

Search

Looking for feedback on a past devastating SAN failure

mo_

Renowned Member

udo

Distinguished Member

mo_

Renowned Member