ZFS on proxmox could be better

BloodyIron

Renowned Member
Jan 14, 2013
229
13
83
it.lanified.com
Hi Folks,

First, I think ZFS being in proxmox now for a while is awesome. I love ZFS and really think it's one of the best filesystems out there. That being said, I've recently seen some important shortcomings with it's implementation in proxmox, and I really hope my concerns can get addressed in future developments.


First, no alerts in the webGUI.

When I test-yanked a disk out of a ZFS mirror that is the OS pool (and local storage), the webGUI literally showed ZERO alerts or information of a disk being missing. This was very alarming to me as an admin I would have to go and manually check to see such an event had occurred. We really need alerts for problem scenarios for ZFS. Even if the disk is okay, but disconnected, this kind of thing _NEEDS_ to be presented as an alarm in the webGUI, email, both, whatever.

Second, no hot spare options.

When making zpools, be it for booting the OS, or other local storage, having hot spares would help. I've found no install or GUI way to setup spares, and I think this would increase the quality of life for ZFS usage in the Proxmox VE ecosystem.

Third, no live action for re-insertion of disks.

In my mirror yanking scenario, I plugged the disk back in, and.... nothing. zpool status showed that the pool did not see the disk as reconnected. The Linux OS however did present the /dev/ device, yet ZFS had not reconnected the disk back to the pool, despite it being a member. This is a problem for intermittent interfaces, or if you accidentally yank a disk and put it back in. My "solution" was rebooting the proxmox host. This _should not_ be the solution. My expectation is it should see the device as present and put it back in the pool without user interaction.

Fourth, no webGUI tools for interacting with ZFS zpools.

This part is a pretty big deal. Putting aside the lack of alerts, being unable to replace a disk in the webGUI makes ZFS unattractive to admins, and also potentially problematic. Having a webGUI method means that it is not only convenient to administrate ZFS, but also ensure the replacement disk is correctly partitioned. I myself know why the partitioning is a good idea, but many other admins may not. Furthermore, having webGUI functions to check the status of pools can be helpful for just briefly checking the health of such. These tools should also include things like adding L2ARC/ZIL devices, hot spares, etc. I don't think this should _necessarily_ be as robust as say... FreeNAS' toolset, but it should be at least capable of replacing disks.

Fifth, no scrubbing, snapshot, or other scheduled task abilities at present.

Scrubbing a ZFS zpool is important, even if it's done once a month. It's important because it can pick up on data corruption that might not have been addressed in passing activities. Over a long period of time not scrubbing your pool can have compounding effects. At the bare minimum the webGUI should have a way to schedule scrubs and view the results of the most recent one. If the webGUI could later be extended to do snazzy snapshotting stuff I think this would make ZFS super awesome in the Proxmox VE environment (snapshots would rely on crons or other schedule task stuff).


Despite these areas that I think would really benefit from dev love, I like how far ZFS has come in Proxmox VE. It's dead simple to setup in the installer, and I don't see any immediate failures of it. However, the current implementation leaves me wanting for when I may have to deal with failure scenarios. I don't think most of what I'm asking here is too much, and I think a fair amount of it would be generally appreciated by the Proxmox VE community.

That being said, I would love to hear your thoughts. If there's any more I can do to help with this development, apart from writing it myself, please let me know. I work with ZFS as part of my business, so I've studied it heavily.

Thanks peeps!
 
  • Like
Reactions: Amin Vakil
Hi Folks,

First, I think ZFS being in proxmox now for a while is awesome. I love ZFS and really think it's one of the best filesystems out there. That being said, I've recently seen some important shortcomings with it's implementation in proxmox, and I really hope my concerns can get addressed in future developments.


First, no alerts in the webGUI.

When I test-yanked a disk out of a ZFS mirror that is the OS pool (and local storage), the webGUI literally showed ZERO alerts or information of a disk being missing. This was very alarming to me as an admin I would have to go and manually check to see such an event had occurred. We really need alerts for problem scenarios for ZFS. Even if the disk is okay, but disconnected, this kind of thing _NEEDS_ to be presented as an alarm in the webGUI, email, both, whatever.

there are no alerts for any other degraded states for any other storage - except for the Ceph dashboard which is a pretty recent addition to our interface. while I agree this would be nice, it is not something that just happens automatically over night, but a lot of work. for mail notifications, you can already use zfs-zed which comes with some premade scripts for various failure scenarios.

Second, no hot spare options.

When making zpools, be it for booting the OS, or other local storage, having hot spares would help. I've found no install or GUI way to setup spares, and I think this would increase the quality of life for ZFS usage in the Proxmox VE ecosystem.

the installer only offers those options which cannot be easily changed after the initial setup. this is intentional. if you want to setup hot spares or other special features, you will need to roll up your sleeves and do it yourself ;) a certain level of administration proficiency is a requirement for administrating systems.

Third, no live action for re-insertion of disks.

In my mirror yanking scenario, I plugged the disk back in, and.... nothing. zpool status showed that the pool did not see the disk as reconnected. The Linux OS however did present the /dev/ device, yet ZFS had not reconnected the disk back to the pool, despite it being a member. This is a problem for intermittent interfaces, or if you accidentally yank a disk and put it back in. My "solution" was rebooting the proxmox host. This _should not_ be the solution. My expectation is it should see the device as present and put it back in the pool without user interaction.

this should work automatically - but I have to admit I haven't tested it in a while. if I can reproduce this I will see where the issue is.

Fourth, no webGUI tools for interacting with ZFS zpools.

This part is a pretty big deal. Putting aside the lack of alerts, being unable to replace a disk in the webGUI makes ZFS unattractive to admins, and also potentially problematic. Having a webGUI method means that it is not only convenient to administrate ZFS, but also ensure the replacement disk is correctly partitioned. I myself know why the partitioning is a good idea, but many other admins may not. Furthermore, having webGUI functions to check the status of pools can be helpful for just briefly checking the health of such. These tools should also include things like adding L2ARC/ZIL devices, hot spares, etc. I don't think this should _necessarily_ be as robust as say... FreeNAS' toolset, but it should be at least capable of replacing disks.

would be nice, but is a lot of work and thus has not been implemented yet. it is also not very easy to safely handle all the potential custom configurations that users have..

Fifth, no scrubbing, snapshot, or other scheduled task abilities at present.

Scrubbing a ZFS zpool is important, even if it's done once a month. It's important because it can pick up on data corruption that might not have been addressed in passing activities. Over a long period of time not scrubbing your pool can have compounding effects. At the bare minimum the webGUI should have a way to schedule scrubs and view the results of the most recent one. If the webGUI could later be extended to do snazzy snapshotting stuff I think this would make ZFS super awesome in the Proxmox VE environment (snapshots would rely on crons or other schedule task stuff).

something like a snapshot scheduler would be nice, maybe it will be possible to add that during the 5.x release cycle once the storage replication patches are finalized and have settled in a bit. regular scrubbing is already configured via a cron job by the ZFS packages:

Code:
/etc/cron.d/zfsutils-linux
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Scrub the second Sunday of every month.
24 0 8-14 * * root [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ] && /usr/lib/zfs-linux/scrub

this is a conf file, so if you want to change the interval you can just do so and it won't be overwritten on upgrades unless you say so.

Despite these areas that I think would really benefit from dev love, I like how far ZFS has come in Proxmox VE. It's dead simple to setup in the installer, and I don't see any immediate failures of it. However, the current implementation leaves me wanting for when I may have to deal with failure scenarios. I don't think most of what I'm asking here is too much, and I think a fair amount of it would be generally appreciated by the Proxmox VE community.

That being said, I would love to hear your thoughts. If there's any more I can do to help with this development, apart from writing it myself, please let me know. I work with ZFS as part of my business, so I've studied it heavily.

feedback is always appreciated, but keep in mind that dev time is limited, so big changes need time ;) also, ZFS is not the only storage we (have to) support.

the two big things that I think would be nice to have are
  • snapshot scheduler (this is not ZFS specific ;))
  • some sort of minimal ZFS management interface, similar to pveceph and the ceph dashboard (I don't think supporting all the possible scenarios makes sense, but something like replacing a faulty disk with a new blank one, or triggering a manual scrub, or creating a new zpool from blank disks, or adding/removing log or cache devices would make sense).
 
  • Like
Reactions: gkovacs
Just to be clear, I was posting this to pass along my experience. I "expect" it to be "done"... when it's done ;)

As in, I understand that these things take time, and whenever that is, isn't dire to me. Hopefully this info can help guide future development, whenever that happens. :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!