Looking for feedback on a past devastating SAN failure

mo_

Renowned Member
Oct 27, 2011
401
7
83
Germany
Hi,

we have suffered a severe SAN failure, possibly resulting in the loss of ALL windows VM data (and some CTs) and well basically Im looking for comments about whether that should be able to happen and maybe some explanations about oddities that are now being experienced.

Heres whats happened:

About 3 weeks ago, one of the disks of the RAID6 thats apparently hosting the volume group used as KVM shared storage (and another VG hosting the containers) failed. Simultaneously the hot spare disk died / was already broken... anyhow, the hot spare disk did not jump in. The SAN did not report anything because technically a RAID6 with 1 broken disk is still fully functional. I can already see the SAN controllers software being at fault here for not reporting this problem at all, but thats not the actual problem I dont think. Furthermore a RAID6 producing errors after 1 disk failure tells me this is a horrible horrible controller, right?

The actual problem is that, even though the VG should still be functional, apparantly there are bad sectors on it now. This has become apparent since:

- some containers had issues. Luckily checking the LV with fsck is easy and found+corrected some errors.

- Now here's what I dont fully understand yet: pretty much all the KVM disks apparantly are broken in the same way. We only discovered the problem because suddenly one of the windows machines became unresponsive. And it went straight downhill from there because none of the windows VMs is booting anymore. ALL of them fail to boot. Windows will then complain that the last boot failed and will offer to start some repair thing or to start normally. attempting to start normally results in a hard reset (of the VM) with no error message given whatsoever. Attempting to start the repair thingy also fails with

Error 17 - Ramdisk device creation failed due to insufficient memory

However, the VM has 2 gigs of RAM assigned, 20MB of which are being used by this pre-boot thingy. It's fairly obvious, thats it's not actually a lack of ram but something else. Has anybody seen this before? Any ideas on what to do about it? The VM is using the default cpu setting of qemu64 if that makes any difference.

Oh, also the host system shows a ram usage of 14/24GB so it really is not even close to being out of memory by any stretch of the imagination.


PS: I'm assuming there are bad sectors on the VG hosting the shared storage for the VMs, but how would I be able to check this? I guess I could run a fsck against the LVM device that is one of the VM disks, but I would much rather check the underlying VG as a hole. Sadly the SAN doesnt offer anything for that, so I guess Im stuck with tools on the proxmox host.

Overall the whole storage concept is a mess (because its a SAN ... so its messy by design) involving LVM, multipathing, fibre channel switches and lastly an apparently terrible SAN controller


PPS: You might be asking why I havent mentioned backups yet. The reason for that is that the tape library only stores the VMs for a two week frame, so all the available backups are faulty / not booting too.
 
Hi,

we have suffered a severe SAN failure, possibly resulting in the loss of ALL windows VM data (and some CTs) and well basically Im looking for comments about whether that should be able to happen and maybe some explanations about oddities that are now being experienced.

Heres whats happened:

About 3 weeks ago, one of the disks of the RAID6 thats apparently hosting the volume group used as KVM shared storage (and another VG hosting the containers) failed. Simultaneously the hot spare disk died / was already broken... anyhow, the hot spare disk did not jump in. The SAN did not report anything because technically a RAID6 with 1 broken disk is still fully functional. I can already see the SAN controllers software being at fault here for not reporting this problem at all, but thats not the actual problem I dont think. Furthermore a RAID6 producing errors after 1 disk failure tells me this is a horrible horrible controller, right?

...
Hi,
this looks like an very bad raid controller - I guess there are more disks faulty. (from which vendor is the raid controller?)
Are the errors present with replaced (perhaps other type) disks? Some raid controllers have trouble with specific disks (firmware).
Shows the raid controller errors with other disks?

Normaly you can do an rebuild or check checksum on a raidcontroller. Do you run such a check?

Udo
 
its an Infortrend SAN and it was filled with fully certified disks at the time of failure. once we had noticed the error and asked for replacement disks they sent non-certified disks that were bigger than the original ones because they didnt have any certified disks for the SAN anymore for whatever reason. they replaced them later on with certified disks again and Im mainly telling this story to illustrate why this is the last SAN of that brand were going to use.

I shall try to find out whether that thing has such a check/rebuild mechanism.

Anything about that windows error though? or is that really just one of the weird errors popping up whenever theres disk errors?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!