TrueNAS CORE crashes my entire PVE server.

pico1180

New Member
Jul 12, 2023
4
0
1
Hello everyone,

To say I am new at this is a pretty big understatement. And by "new" I mean started 3 weeks ago. I had some very limited knowledge of Lynix to say that I installed Ubuntu a few times. But other then that, I would say my experience in this arena is zero.

I was hoping just to have a discussion about the issue I am having. If it can be resolved, that would be great, but really all I'm looking to do is hear peoples thoughts on this issue. This is basically a "cool story, bro" moment.

The dramatis personae here are a Gigabyte GA-6PXSV4 and a Highpoint RocketRAID 2720A. I am confident both pieces of hardware and all other secondary hardware in the server are working fine.

I was running TrueNAS CORE fine for the last few weeks. It was setup with five (5) 6TB drives all individually mounted to TrueNASS. That particular setup was a Chinese x79 and all the drives were plugged into onboard STATA connectors. There were no obvious glitchs or hickups.

I found bifurcation and PCI passthrough to be a little.... unpredictable on the Chinese motherboard. However, I was new to it and may not have completely understood what I was doing. So I swapped the board out for the overmentioned GA-6PXSV4. In the interest of full disclosure, this board could be faulty as well. Two (2) of the four (4) LAN ports don't seem to function, and even though there is clearly SAS on the board, nothing plugged into it is detected, there is no reference to it in the BIOS, and no O/S I have tried acknowledge anything is there. So that could very well be dead.

With that out of the way...

I gave the GA-6PXSV4 a fresh PVE install and restored all my VM's from backup. I put in the 2720, setup PCI passthrough, blacklisted it from PVE and handed the the card over to TrueNAS. This is where the fun began.

I tried to run the 2720 in RAID mode. I know this is not recommended, but I didn't realize it was impossible. TreuNAS hung on repeated attempts to boot. Specifically, TreuNAS hung on the 2720's hard drive detection BIOS screen. Again, I know it wasn't recommended, but I just wanted to see what would happen. Apparently hard locking the VM was what would happen.

So I flashed the 2720 to "IT" mode. It doesn't actually call it that, but I think that is the colloquial for it. Highpoint calls it "rapid boot" or something.

So I flash it to IT mode and BOOM TrueNAS boots up detecting all the drives on the 2720 with no issue.

I had two datasets. One on the five (5) 6TB's and a new one I just created from three (3) 18TB drives. I got the shares going, logged into a VM and was just going to copy the data from one share to another. I was hoping to leverage the speed of the hard drives and not worry about network bandwidth. But the data was transferring at like 20 to 50MB/s. I though that was weird but lacked experience to understand how or why that may have been the case. I tried copying files to and from the shares from a bare metal machine. The transfer rates were maximizing the 2.5Gb network. I had a backup of the 5x6TB data I wanted to move the the 3x18TB dataset so I just decided to do it from bare metal over the network to the TrueNAS VM. I first started with 1.7GB of random data consisting of small files ranging from just a few MB's to the largest files being around 9GBs. This took the better part 10 hours. However, it was over the network, from a 2.5inch drive attached via USB3 so I totally dismissed that time as being completely normal considering the circumstances. After that was done, I started with much larger files. Specifically multiple terabytes of large files (20GB+) coming off a USB3 3.5" Barracuda. The transfer rates started at around 160 to 180 MB/s which I accepted as being fine but dropped rapidly to about 20 to 50MB/s. I thought it was lack of memory assigned to TrueNAS. It had 16GB's. I gave it 32GB's and tried to restart TruNAS. It locked the entire server. PVE boot screen started reporting something like, core 1 locked up, core 2, locked up, core 3 locked up, so on and so on. Or it was words to that affect.

I soft booted the server but the RocketRAID didn't come up during boot. I powered down the server, turned it back on, and the Rocket RAID was there.

Got it all booted back up and back into TrueNAS with it's 32GB's of ram. Started the aforementioned transfer of large files again and the speeds were the same. It did seem as though the speed seamed to be fluctuating though. It was all over the place from 120MB/s down to 20 or so.

I let the transfer go and came back 6 hours only to find that the speed had settled in at around 20MB/s. I tried to reboot TrueNAS again and it crashed the server like before reporting the cores had locked up.

So that is where we are at.

It's hard for me to believe a server motherboard like the GA-6PXSV4 isn't handling PCI passthrough correctly. But there are some indications the board is failing in other areas so who knows.

Maybe the 2720 isn't compatible with PCI pass through? I don't know if anyone has any data on that.

Maybe going from mounting hard drives to the TrueNAS VM on the X79 Chinese board to SAS controller passthrough on the GA-6PXSV4 broke something in TrueNAS?

My next step moving forward was going to be a fresh install of TrueNAS. But if that doesn't fix it...

I don't want to mount drives as I don't get telemetry off the drives when I do that.

I will through money at this problem if the community thinks the board is faulty or if the 2720 isn't compatible with PCI passthrough.

I guess my next step should be a fresh setup of TrueNAS and go from there?
 
Last edited:
I am suspecting this is the culprit. Any suggestions?

[ 1640.329616] kvm [12899]: ignored rdmsr: 0xc0011029 data 0x0
[ 1640.901482] DMAR: DRHD: handling fault status reg 2
[ 1640.901506] DMAR: [DMA Write NO_PASID] Request device [04:00.0] fault addr 0xffffa000 [fault reason 0x05] PTE Write access is not set
[ 1664.297829] irq 29: nobody cared (try booting with the "irqpoll" option)

and there was this:

[ 1664.298232] handlers:
[ 1664.298251] [<00000000725317b0>] vfio_intx_handler [vfio_pci_core]
 
I am suspecting this is the culprit. Any suggestions?

[ 1640.901506] DMAR: [DMA Write NO_PASID] Request device [04:00.0] fault addr 0xffffa000 [fault reason 0x05] PTE Write access is not set
I've seen such error many years ago with PCI (not PCIe) TV tuners that would just refuse to work with passthrough, and I've used USB every since. My suggestion would be to use a (very) different drive controller, which is known to work well with passthrough and Linux, And maybe switch to TrueNAS Scale for more driver support.
 
Scale didn't help. I went with an LSI 9207 based off the literature provided by truenas. I will try the rocketRAID on bare metal under truenas and see how that goes while I await the delivery of the 9207.

Side note: mounting the drives in the VM seemed to work well, but truenas couldn't pull telemetry off the drives. It can pull telemetry now that I passed the rocketRAID throw to it tho.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!