Hi,
we have suffered a severe SAN failure, possibly resulting in the loss of ALL windows VM data (and some CTs) and well basically Im looking for comments about whether that should be able to happen and maybe some explanations about oddities that are now being experienced.
Heres whats happened:
About 3 weeks ago, one of the disks of the RAID6 thats apparently hosting the volume group used as KVM shared storage (and another VG hosting the containers) failed. Simultaneously the hot spare disk died / was already broken... anyhow, the hot spare disk did not jump in. The SAN did not report anything because technically a RAID6 with 1 broken disk is still fully functional. I can already see the SAN controllers software being at fault here for not reporting this problem at all, but thats not the actual problem I dont think. Furthermore a RAID6 producing errors after 1 disk failure tells me this is a horrible horrible controller, right?
The actual problem is that, even though the VG should still be functional, apparantly there are bad sectors on it now. This has become apparent since:
- some containers had issues. Luckily checking the LV with fsck is easy and found+corrected some errors.
- Now here's what I dont fully understand yet: pretty much all the KVM disks apparantly are broken in the same way. We only discovered the problem because suddenly one of the windows machines became unresponsive. And it went straight downhill from there because none of the windows VMs is booting anymore. ALL of them fail to boot. Windows will then complain that the last boot failed and will offer to start some repair thing or to start normally. attempting to start normally results in a hard reset (of the VM) with no error message given whatsoever. Attempting to start the repair thingy also fails with
Error 17 - Ramdisk device creation failed due to insufficient memory
However, the VM has 2 gigs of RAM assigned, 20MB of which are being used by this pre-boot thingy. It's fairly obvious, thats it's not actually a lack of ram but something else. Has anybody seen this before? Any ideas on what to do about it? The VM is using the default cpu setting of qemu64 if that makes any difference.
Oh, also the host system shows a ram usage of 14/24GB so it really is not even close to being out of memory by any stretch of the imagination.
PS: I'm assuming there are bad sectors on the VG hosting the shared storage for the VMs, but how would I be able to check this? I guess I could run a fsck against the LVM device that is one of the VM disks, but I would much rather check the underlying VG as a hole. Sadly the SAN doesnt offer anything for that, so I guess Im stuck with tools on the proxmox host.
Overall the whole storage concept is a mess (because its a SAN ... so its messy by design) involving LVM, multipathing, fibre channel switches and lastly an apparently terrible SAN controller
PPS: You might be asking why I havent mentioned backups yet. The reason for that is that the tape library only stores the VMs for a two week frame, so all the available backups are faulty / not booting too.
we have suffered a severe SAN failure, possibly resulting in the loss of ALL windows VM data (and some CTs) and well basically Im looking for comments about whether that should be able to happen and maybe some explanations about oddities that are now being experienced.
Heres whats happened:
About 3 weeks ago, one of the disks of the RAID6 thats apparently hosting the volume group used as KVM shared storage (and another VG hosting the containers) failed. Simultaneously the hot spare disk died / was already broken... anyhow, the hot spare disk did not jump in. The SAN did not report anything because technically a RAID6 with 1 broken disk is still fully functional. I can already see the SAN controllers software being at fault here for not reporting this problem at all, but thats not the actual problem I dont think. Furthermore a RAID6 producing errors after 1 disk failure tells me this is a horrible horrible controller, right?
The actual problem is that, even though the VG should still be functional, apparantly there are bad sectors on it now. This has become apparent since:
- some containers had issues. Luckily checking the LV with fsck is easy and found+corrected some errors.
- Now here's what I dont fully understand yet: pretty much all the KVM disks apparantly are broken in the same way. We only discovered the problem because suddenly one of the windows machines became unresponsive. And it went straight downhill from there because none of the windows VMs is booting anymore. ALL of them fail to boot. Windows will then complain that the last boot failed and will offer to start some repair thing or to start normally. attempting to start normally results in a hard reset (of the VM) with no error message given whatsoever. Attempting to start the repair thingy also fails with
Error 17 - Ramdisk device creation failed due to insufficient memory
However, the VM has 2 gigs of RAM assigned, 20MB of which are being used by this pre-boot thingy. It's fairly obvious, thats it's not actually a lack of ram but something else. Has anybody seen this before? Any ideas on what to do about it? The VM is using the default cpu setting of qemu64 if that makes any difference.
Oh, also the host system shows a ram usage of 14/24GB so it really is not even close to being out of memory by any stretch of the imagination.
PS: I'm assuming there are bad sectors on the VG hosting the shared storage for the VMs, but how would I be able to check this? I guess I could run a fsck against the LVM device that is one of the VM disks, but I would much rather check the underlying VG as a hole. Sadly the SAN doesnt offer anything for that, so I guess Im stuck with tools on the proxmox host.
Overall the whole storage concept is a mess (because its a SAN ... so its messy by design) involving LVM, multipathing, fibre channel switches and lastly an apparently terrible SAN controller
PPS: You might be asking why I havent mentioned backups yet. The reason for that is that the tape library only stores the VMs for a two week frame, so all the available backups are faulty / not booting too.