Significant Filesystem Corruption with LSI 2308

Silvaire

New Member
Nov 4, 2015
6
0
1
I've been troubleshooting a considerable meltdown of my NAS just after I switched to Proxmox from ESXI. In ESXI my NAS was a VM passing through the LSI 2308 in IT mode and I never had a problem.

I set up Proxmox with a KVM VM for the NAS with IOMMU enabled and the controller passed through with "hostpci0: 2:00.0." Everything was working. After about a week with Proxmox I experienced an event whereby I tried to delete a file on the NAS and it crashed. On reboot, every single hard drive on the controller failed FSCK and complained about missing superblocks. Due to my own lack of knowledge in dealing with this situation, of my 6 drives, I lost 2 that I salvaged with photorec by purchasing another hard drive. The other 4 seemed OK but as a precaution I re-partitioned and reformatted all drives with EXT4 again.

I re-set up the NAS and experienced a similar event when I was trying to write to one of the hard drives. The drives all dropped out of their mounts and on reboot all of them had filesystem corruption again. In PartedMagic I can access most of the essential files (I have other backups) and am copying them out to my desktop in the meantime for easy access.

I updated the bios for my motherboard, Supermicro X10SL7-F, and updated the firmware for the LSI 2308 to the latest version. I ran FSCK on all hard drives and they passed except one, which had a bunch of multiple blocks where from what I can tell the files I had been copying to it had occupied the same spot as other files somehow (even though I had wiped the filesystem prior). FSCK completed on that drive.

I did some research and changed my KVM from "hostpci0: 2:00.0" to "hostpci0: 2:00.0,pcie=1,driver=vfio" with "machine: q35" as mentioned in wiki for PCI-E. My card shows up as PCI-E so I thought maybe that was it.

However, upon boot, I tried to copy some files out of the drives and it all crapped out again -- the drives disappeared from samba, and all drives complain of missing superblocks on reboot and require manual fsck. I'm scared to run these manual fscks because that is what helped me lose 1 drive earlier.

I'm at wit's end here as this goes beyond my knowledge and googling skill and am here to ask for advice. In my searching this is the only reference I could find: https://peterkieser.com/2013/08/07/...ontroller-to-a-guest-causing-data-corruption/

The timing of this issue (right after switching from ESXI) makes me strongly believe that this is something KVM-related. All my HDD's pass Smart test and each time the filesystem corruption happens it happens to all the drives on the controller and not the other drives.

Is there any way to salvage Proxmox with this or should I pack it in? Anyone else had similar issues? Thanks in advance!
 
Last edited:
I've never tried PCI Passthrough, but, I'm using a LSI 9260 (2208 chip) using the ProxMox kernel LSI driver to handle the card / drives.
The Firmware is in standard LSI RAID mode.
Then I export a RAID device to the KVM as a 'scsi' device (example : scsi0 : /dev/sdc )
This works quite well, perhaps due to you using an OEM card , things are a bit different, I'm not sure.
 
I am currently running tests on a baremetal setup and it seems to be working fine. My working hypothesis is that my earlier lost HDD's could have been saved had I checked them outside the passthrough environment. So while I can throw the same hard drives into passthrough and produce all kinds of ext4 superblock errors and the inability to do moderate i/o without dropping out, I can as of the last 4 hours, work with them fine in baremetal with no fsck complaints.
 
Same thing happened to me. Same board. I was able to recover all of the drives, but there were some times that I thought I was doomed. As soon as I went back to using virtio in the vm.conf file to passthrough the disk using the kernel everything started working normally.

It's really a shame because the whole reason I bought this board was to be able to pass the LSI controller directly to my NAS software. However, it has not caused any performance issues.

If someone does have a solution for this issue, please post. I am very interested in finding out what I'm doing wrong. Like Silvaire, I couldn't find anything on the web.
 
I feel moderately more relieved knowing that it was not just me. I've had a lot of sleepless nights in the past 2 weeks. :P I wish I had realized it was the controller passthrough from the beginning but this level of sysadmin troubleshooting was very much a learn-as-you-go for me.

Not that it's any consolation to you, tomfowler, but the passthrough does work 100% in ESXI. I switched to Proxmox mainly because of the webui and access to LXC containers. Before that I ran ESXI with the PCI passthrough for approximately 1.5 years with no problems.
 
Last edited:
From the folks over in the FreeNAS forums, it is discouraged. I've run Proxmox for about 18 months and it has been super stable using virtio. This was my first attempt at pass through. Alas, no big deal. Works fine having the kernel do the work.


Sent from my iPhone using Tapatalk
 
I've been troubleshooting a considerable meltdown of my NAS just after I switched to Proxmox from ESXI. In ESXI my NAS was a VM passing through the LSI 2308 in IT mode and I never had a problem.
<snip>
Hi!

Silvaire, have you resolved this issue?
It seems, that I have the same troubles.

Code:
[    0.000000] DMI: Supermicro SYS-1018D-73MTF/X10SL7-F, BIOS 2.00 04/24/2014
Code:
proxmox-ve: 4.1-34 (running kernel: 4.2.6-1-pve) 
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c) 
pve-kernel-4.2.6-1-pve: 4.2.6-34

I created KVM-machine with a pair VirtIO-drives on built-in LSI RAID1. And it worked fine. But at some moment filesystems remounted in read-only mode due to a fs corruption. I didn't use passthrough, just plain VirtIO-drives with wright-through.
 
I had the issue on p19 and p20. For the record, I was never able to resolve it, and because I didn't know what was going on until the end, caused myself data loss after running the filesystem recovery within the virtual environment. The takeaway is that if you look at the data outside the virtual environment, it is fine. It is something to do with KVM PCI passthrough and that controller (which is one of the more popular controllers out there so it's puzzling it hasn't affected more people). I ended up abandoning Proxmox and KVM and am currently just running Debian.

I don't know if your problem is the same as mine though, since you are not using PCI passthrough. It's definitely a KVM issue though -- whole setup worked fine for years in ESXI.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!