Monthly server lockups

opndiv

New Member
Mar 19, 2018
3
0
1
54
Hi all,

ever since we upgraded to ZFS we are experiencing monthly server lockups. Everything is running fine until at one point everything stalls and we have to hard-reset the server. Unfortunately I can't provide any logs as nothing has been written to the server then it locks up. We're using a hardware RAID underneath in a Supermicro server.

Proxmox packages:

Code:
ii  pve-cluster                          5.0-19                         amd64        Cluster Infrastructure for Proxmox Virtual Environment
ii  pve-container                        2.0-17                         all          Proxmox VE Container management tool
ii  pve-docs                             5.1-12                         all          Proxmox VE Documentation
ii  pve-firewall                         3.0-5                          amd64        Proxmox VE Firewall
ii  pve-firmware                         2.0-3                          all          Binary firmware code for the pve-kernel
ii  pve-ha-manager                       2.0-4                          amd64        Proxmox VE HA Manager
ii  pve-kernel-4.13.4-1-pve              4.13.4-26                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-4.13.8-3-pve              4.13.8-30                      amd64        The Proxmox PVE Kernel Image
ii  pve-libspice-server1                 0.12.8-3                       amd64        SPICE remote display system server library
ii  pve-manager                          5.1-38                         amd64        Proxmox Virtual Environment Management Tools
ii  pve-qemu-kvm                         2.9.1-3                        amd64        Full virtualization on x86 hardware
Kernel: Linux 4.13.8-3-pve #1 SMP PVE 4.13.8-30 (Tue, 5 Dec 2017 13:06:48 +0100) x86_64 GNU/Linux
Hardware: 192GB RAM, 12TB HDD RAID, 4TB SSD RAID and 750GB RAID for boot

Is anybody else experiencing these kinds of issues? We have clone of that server in the same rack, it's not running any VMs, just for ZFS replication and it never had any problems. So is this problem related to load?

Should I just run an update for all system packages and reboot the server on the weekend?

Kind regards,
Oliver
 
Hi all,

ever since we upgraded to ZFS we are experiencing monthly server lockups. Everything is running fine until at one point everything stalls and we have to hard-reset the server. Unfortunately I can't provide any logs as nothing has been written to the server then it locks up. We're using a hardware RAID underneath in a Supermicro server.

Proxmox packages:

Code:
ii  pve-cluster                          5.0-19                         amd64        Cluster Infrastructure for Proxmox Virtual Environment
ii  pve-container                        2.0-17                         all          Proxmox VE Container management tool
ii  pve-docs                             5.1-12                         all          Proxmox VE Documentation
ii  pve-firewall                         3.0-5                          amd64        Proxmox VE Firewall
ii  pve-firmware                         2.0-3                          all          Binary firmware code for the pve-kernel
ii  pve-ha-manager                       2.0-4                          amd64        Proxmox VE HA Manager
ii  pve-kernel-4.13.4-1-pve              4.13.4-26                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-4.13.8-3-pve              4.13.8-30                      amd64        The Proxmox PVE Kernel Image
ii  pve-libspice-server1                 0.12.8-3                       amd64        SPICE remote display system server library
ii  pve-manager                          5.1-38                         amd64        Proxmox Virtual Environment Management Tools
ii  pve-qemu-kvm                         2.9.1-3                        amd64        Full virtualization on x86 hardware
Kernel: Linux 4.13.8-3-pve #1 SMP PVE 4.13.8-30 (Tue, 5 Dec 2017 13:06:48 +0100) x86_64 GNU/Linux
Hardware: 192GB RAM, 12TB HDD RAID, 4TB SSD RAID and 750GB RAID for boot

Is anybody else experiencing these kinds of issues? We have clone of that server in the same rack, it's not running any VMs, just for ZFS replication and it never had any problems. So is this problem related to load?

Should I just run an update for all system packages and reboot the server on the weekend?

Kind regards,
Oliver
Hi Oliver,
of couse, you should update pve to the actual version (apt dist-upgrade).

Do you have limit zfs_arc_(min/max) in /etc/modprobe.d/zfs.conf?
How much RAM are used by VMs?
BTW, I had years ago also an supermicro server with monthly (or 2-4 weeks) reboots - until I updated the bios!!

Udo
 
It should not matter if you run ZFS on top og hardware RAID. You just lose a few functions of ZFS.
It is more likely a matter of memory pressure from ZFS caching and the likes, or maybe SWAP usage on ZVOL.

Please search the forums and WIKI. You should find plenty of info on optimizing the system for stable ZFS operation.
But mainly limit ZFS arc memory and reduce swapping.
 
It *does* matter if you run ZFS on HW-raid or not! And it is not just about some features missing. ZFS was designed from scratch with intention of having direct unrestricted (unfiltered) access to underlying hardware (hdd/ssd). Having ZFS on top of HW-raid means in the worst case you might loose data.

This has been discussed many times, so I'm not going to put it all here again. Just read this explanation. I believe, openzfs-devs know what they are talking about...
 
  • Like
Reactions: morph027
I read the before linked "explanation" and I stand by what I wrote.

Here is copy paste from that article:
"While ZFS will likely be more reliable than other filesystems on Hardware RAID, it will not be as reliable as it would be on its own."
To put it in other words, it's as reliable as EXT*, XFS, Reiserfs or better.

I challenge you to provide and example, where using ZFS on top of HW RAID would lead to data loss, and would not lead to data loss when using other mainstream *nix filesystems like EXT4. Maybe you can find an example from Open ZFS devs. I would believe them to. :)

As for your idea of probability for bottenecking by CPU of HW RAID, while it might be true, it's not very likely, as I doubt ZFS puts so much more stress on HW RAID CPU, as their previous sollution on the same hardware (RAID) did.
 
I challenge you to provide and example, where using ZFS on top of HW RAID would lead to data loss, and would not lead to data loss when using other mainstream *nix filesystems like EXT4.
Never claimed that. Check again what I wrote above...
 
Yes, correct. That's what I said, and what you can find on openzfs-dev web.

Now explain where and when I said: "...and would not lead to data loss when using other mainstream *nix filesystems like EXT4."
 
Eh..
You are correct, you (deliberately?) left that important part out, and mislead user(s) to think that using ZFS on top of HW RAID could lead to data loss. You should have stated, that using ZFS on top of HW RAID is safer than most other mainstream file systems and should not fear monger.
I have had to clarify that, so people know, that using ZFS on top of HW RAID, is perfectly acceptable option, if one can not use disks directly, which would be the preferred method.
Even I do it sometimes, because I like the ZFS syncing and snapshotting functions, which I coincidently use to fight possible data loss for whatever reason (HW RAID fault, hacking, etc..).

We could continue to argue about semantics, but the fact, written in bold in this post above, remains.
 
OP did not say anything about other filesystems. Even you said nothing about "other mainstream file systems" in your first post in this thread. So quite logicaly, I never compared ZFS with any other filesystem and I could not leave anything important. I just commented the fact OP is using ZFS on HW-raid.

It is you, who (deliberately?) turned "ZFS & HW-raid" topic into "ZFS vs other file system". If you want to discuss about that, open your new thread instead of stealing this one, because this one is not about it.

You claim:
using ZFS on top of HW RAID, is perfectly acceptable option

On the other side, ZFS-devs say:
Hardware RAID controllers should not be used with ZFS.
With all do respect, I believe more what ZFS-devs have to say about it.
 
They also say:
"While ZFS will likely be more reliable than other filesystems on Hardware RAID, it will not be as reliable as it would be on its own."
Therefor:
using ZFS on top of HW RAID, is perfectly acceptable option.
 
Alright, coming back to the main topic about the sudden server stalls:

I will update the server to the latest and greatest version via apt-get dist-upgrade and reboot to the new kernel version on the weekend. Also I will check if there is a BIOS update available for that mainboard.

I have checked the memory pressure and it's around 90%. In January I have limited the arc_size_max to 64GB, but that was discarded after the next reboot and then there was another crash, so I don't think adjusting this setting helps solving this problem.
 
I have checked the memory pressure and it's around 90%. In January I have limited the arc_size_max to 64GB, but that was discarded after the next reboot and then there was another crash, so I don't think adjusting this setting helps solving this problem.
Hi,
why your arc_size_max are discarded after reboot?
If you use /etc/modprobe.d/zfs.conf it's will be activated during boot (6+8GB in this case):
Code:
cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_min=6442450944
options zfs zfs_arc_max=8589934592
Udo
 
  • Like
Reactions: NewDude
Hi guys, I have updated the proxmox via apt-get dist-upgrade and updated the BIOS to the latest version from this year. I will now see how we go with this setup, if it still stalls after a month or so then I will try reducing the arc size even further. Half of the RAM should be okay for now. Thanks for your help for now, I will let you know how it goes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!