PVE 6.4 (and 6.3) installation to ZFS mirror crashes

ZPrime

New Member
Apr 29, 2021
12
5
3
43
Cleveland, Ohio, USA
Proxmox gurus - subject says most of the problem.

I have a Protectli "FW6D" unit (core i5 8250U, passively cooled with a massive heatsink case) that has two internal SSDs - one mSATA (it's not M.2, it's an actual mSATA module), and the other standard cabled SATA device. The mSATA is a Kingston SSD, nominal "240GB," and the cabled drive is a Samsung 840 Pro "256GB". The system has 2x 8GB DDR4 modules installed, and they are testing OK using the built-in memory test utility.

When installing, I have the CSM entirely disabled in the BIOS, so it is booting from the install USB in UEFI mode.

I am configuring the zpool to only be 200GB, so even though the drives are slightly mismatched in size, I'm not trying to fill them.

Everything else in the ZFS config is at default (other than setting the max size to 200GB). I'm able to proceed through the installation, and it appears to create the zpool correctly, and even begins copying & extracting packages onto the zpool... and then the screen goes black and I'm staring at the UEFI/POST screen again. The installation does not complete, it seems to die around 50-60% through. I've repeated this multiple times and it always dies in the middle with a system reset.

I then went back and installed from the same stick, to the mSATA SSD in EXT4, and that installation completes without any problem (although I did not actually attempt to boot the result). Something is going wrong related to either ZFS, or the secondary (cabled) SATA disk (which was known-functional and working fine in an external USB enclosure previously).

I saw that PVE 6.4 was just released, so I'm going to build a USB stick for that and see if everything Just Works there... if not, I will update.
If anybody has any thoughts on why 6.3 is crashing in the middle of the install, or what I can do to help track down the root cause, I would be very grateful (and would be happy to provide debug information if it will help improve the product / fix the problem). :)

[edit] Same problem happening with 6.4. Help?
 
Last edited:
Hi,

I just tried to (virtually) recreate your problem with one 240GB disk, one 256GB disk and UEFI configured, but it always worked:
- RAID 0 with hdsize 200 worked
- RAID 1 with hdsize 200 worked
- RAID 1 with hdsize 240 (default) worked

Could you please start the installation in Debug mode change to another tty with Ctrl + Alt + F2 and see if something interesting gets logged?
Does ZFS RAID 0 with only a single disk (especially the cabled SATA disk) work?
 
Hi,

I just tried to (virtually) recreate your problem with one 240GB disk, one 256GB disk and UEFI configured, but it always worked:
- RAID 0 with hdsize 200 worked
- RAID 1 with hdsize 200 worked
- RAID 1 with hdsize 240 (default) worked

Could you please start the installation in Debug mode change to another tty with Ctrl + Alt + F2 and see if something interesting gets logged?
Does ZFS RAID 0 with only a single disk (especially the cabled SATA disk) work?
I did actually try it in debug mode, but I didn't realize it was printing more info to another tty so I didn't think to swap over there and try to catch it. The problem is that it bombs out and restarts, so unless I hold a phone and video the screen, I may not be able to note what it says when it's dying. The fact that it's fully rebooting kind of makes me suspect some sort of hardware problem though? I would think that even a kernel crash would leave me staring at a screenful of crash output rather than fully rebooting the box.

Is there a way to get debug mode to log output to a file on the USB stick so I can do a post-mortem if I can't see or video anything interesting?

I didn't realize zfs-raid0 would allow me to do a single disk. I tried zfs-raid1 and only using one disk and it did not allow that. I will try zfs-raid0 on each single disk, along with just an EXT4 install to the cabled SATA disk, and report back.

Thankfully, this system is going to be a replacement for my home router, and the router I'm using for now is working well enough that I'm not in a rush to get the new one running. So, I've got time to troubleshoot this. (I'm planning to virtualize the router with NIC passthrough and just use PVE to make it easier to try different router OS distributions, like OPNsense vs. VyOS vs. ClearOS, without having to mess around with long internet outages at home while I tinker with the OS or swap cables).
 
Just finished some more testing.

Installed successfully to the SATA SSD using EXT4 (so I don't think it's anything with the SATA interface or that drive at this point).

With ZFS raid0 single drive on the internal mSATA, same crash and instant reboot. :(
Even worse, it seems like whenever I switch away from the X tty, it pauses the install process? So I can't be watching the tty2 console while allowing the install to continue in background on tty4/X, which means catching the actual failure is going to be impossible without some sort of lightning reflexes.

I'm open to further ideas to catch what's going wrong!

I'm also going to reach out to the hardware vendor (Protectli) to see if they have any thoughts. Their BIOS / UEFI is far more extensive than even the most open "overclocker/gaming" systems I've ever seen - it has config options I've never even heard of. I left things at defaults, but I'm wondering if maybe something should be tweaked... although I don't know what or why, when I can install with EXT4 to either drive alone without it dying.

I kind of wonder if it's an insufficient power supply problem, like apt/dpkg spiking the CPU (which I assume is happening as it's trying to extract the .debs to the system) along with ZFS checksumming and hard write load to both SSDs simultaneously are managing to overdraw the external power brick...? Just a wild guess though. It's a 60W (5A @ 12VDC) supply and I don't think I'm anywhere near that level given the components in the system...
 
  • Like
Reactions: Dominic
Oho! Ran memtest again, this time from the GRUB menu of the freshly installed system... and it threw an error, around 168.8MB. I'm re-running that test some more now to see if I can get more errors. Next step is to try yanking that stick of RAM and moving the other one to the same slot (and running with 8GB instead of 16) to see if things work. These are brand new sticks of Crucial DDR4-2400 (system uses SO-DIMMs) from Microcenter, so I had no reason to suspect they'd be bad, but I guess anything can happen. :)

Unfortunately the system doesn't support ECC RAM (I know it's "preferred" with ZFS but I figure ZFS with non-ECC is better than nothing at all).

[edit]
Something fishy is going on here. Put Passmark's Memtest86 (different from the memtest86+ included with PVE) on a USB stick, and the system actually rebooted itself in the middle of a test with that. No memory errors that I noticed, but the system did the same powercycle behavior (the power LED even turns off shortly, before it then lights up again and displays BIOS output).

A Ubuntu 20.04 liveUSB is also exhibiting the same behavior.

Don't think this is Proxmox's fault at this point. ;)
 
Last edited:
  • Like
Reactions: Dominic
After a bunch more troubleshooting, I think it's a CPU problem, and more specifically, the integrated GPU (and possibly some of the more advanced math registers) in the i5-8250U that is in this system.

The Ubuntu LiveUSB always dies/resets right when it is trying to fire up the graphical interface. If I use "safe graphics" mode, it seems to get further, and actually displays a mouse cursor and a blank desktop... but then crashes/resets the same way (so definitely not a Proxmox-exclusive problem).

Obviously, the PVE installer is in a graphical mode and seems to be fine, but I would guess those graphics are far less accelerated than even "safe mode" Ubuntu Desktop graphics. I'm guessing that maybe something in the ZFS codepath (maybe checksumming or compression?) is hitting the AVX registers, or something else "graphics-adjacent," and crashing the system. This would explain why installing to EXT4 is fine but ZFS isn't. I didn't actually try firing up VMs inside the EXT4 install of PVE, but I suspect that anything beyond DOS or a text-only server install might also cause a crash.

The vendor is sending me a replacement, hopefully that shows up this week. I'm expecting it to be fine and solve everything. :)
 
Last edited:
  • Like
Reactions: Dominic
Protectli replaced my "FW6D" and the problem is solved, I have PVE installed on the new one with a ZFS mirror pool. :)

Something was definitely wrong with the old machine. Comparatively, the new one has far fewer BIOS settings available, which tells me something really weird was going on with the BIOS on the old one. (It almost looked like an engineering / development BIOS, it had all sorts of options I've legitimately never seen in a BIOS before, and I've been tinkering with PCs since the 486 days.)

It was also running ridiculously hot. The system is passively cooled, the entire case is the heatsink (and it's pretty large). The "bad" system was reading over 140F on the heatsink with a cheap IR temp gun, and this was just from running memtest86. It was hot enough to burn if you touched it for more than a few seconds. I had a hard time handling it to remove the RAM after shutting it down.

The new system, during the PVE installation to ZFS, was barely warm to the touch. I mean, I expect a passively cooled 8th gen i5 to be fairly warm if it's being stressed hard, but memtest86 is not a CPU burn test.
 
Last edited:
  • Like
Reactions: Dominic

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!