Preview, Feedback wanted - Cluster Dashboard

dcsapak · Oct 19, 2016

Hi all,

i am currently working on a cluster dashboard, which shows some cluster-wide information.
Here is a Screenshot of the current state:

i already posted a little earlier version to the user and devel mailing list, and already got some good feedback. (for example: i think about how to integrate a ceph status here, but this has some difficulties)

what do you think? something important missing? too much?

i look forward to your comments

dmora · Oct 19, 2016

dcsapak said:
Hi all,

i am currently working on a cluster dashboard, which shows some cluster-wide information.
Here is a Screenshot of the current state:
View attachment 4311

i already posted a little earlier version to the user and devel mailing list, and already got some good feedback. (for example: i think about how to integrate a ceph status here, but this has some difficulties)

what do you think? something important missing? too much?

i look forward to your comments

I'd include node communication stats(latency between nodes). I'll be running a 35+ node cluster and from what I've seen so far, the only way to tell if the cluster is in healthy state is cephcm status. I'd include relevant services as well critical to cluster being healthy such as corosync running also disk health such as OS zfs raid array, or smart data ect...basic stuff.
A notation or warning section that tells you a node is acting poorly. When you're managing 35+ nodes, having to sift through each ones logs is terrible.

Oh, IOPS would be nice

I'm sure I'll think of more.

dcsapak · Oct 20, 2016

dmora said:
I'd include node communication stats(latency between nodes). I'll be running a 35+ node cluster and from what I've seen so far, the only way to tell if the cluster is in healthy state is cephcm status. I'd include relevant services as well critical to cluster being healthy such as corosync running

I guess you mean pvecm status?
Since we do not collect data such as latency between nodes, i currently cannot display them (but i'll think about if it is worth to integrate)

also the cluster output on top comes from the same data as pvecm status, so this is already there
(when corosync is dead on one node, it can not communicate with the cluster anyway, so it already displays as offline)

the services do not make sense here, because it would mean we have n api calls (for n nodes) which may take up to 30 seconds if the node is not reachable, which makes this impractical

dmora said:
disk health such as OS zfs raid array, or smart data ect...basic stuff.

i guess this could be done, but would be non-trivial to add (we would have to periodically collect these into the cluster filesystem), so i'll keep this on my todo list for later

dmora said:
A notation or warning section that tells you a node is acting poorly. When you're managing 35+ nodes, having to sift through each ones logs is terrible.

i agree that there should be an "important messages" area, but i think is better seperate from the dashboard,
also under what circumstances would you expect for a node to be "acting poorly" (this is very vague) because a failing communication with the cluster is already shown (in the tree for instance)

dmora said:
Oh, IOPS would be nice

Ok, but iops of what? all vms? the local storage? shared storage? combined?

i do not think a single "IOPS: YYY" value would be very helpful ?

hybrid512 · Oct 20, 2016

Definitely love it !
Would need some more informations like Ceph status, cluster load, ... but definitely the way to go !
ProxMox is great but it still lacks some abstract view, this is still too low level and it is hard to have a global view of the cluster state.
One thing that is related and missing too is some sort of Capacity planning ... beeing able to move or provision VMs automatically (or not) based on node ressources availability because when you have more than 10 nodes, it becomes uneasy to select which node is best to receive the moved or newly created VM ... it is still too manual, a little automation would be welcomed.

hybrid512 · Oct 20, 2016

For example, when you want to remove a node for maintenance, you need to migrate VMs on the remaining nodes but reight now, you need to move them manually one by one or move every one of them but on only one target node which is sometime not possible.
I try to keep my node's memory usage under 60% and try not to overload CPU ressources too ... but when you have around 10 runnings VMs on your node which is already at 50~60% memory, you can't just move them on another node which is at 40% memory used, it won't fit.
Beeing able to dispatch your VMs automaticaly on nodes with available ressources would be just great !
That's what I meant by "Capacity Planning" but I understand I'm a bit off topic right now ... sorry.

udo · Oct 20, 2016

dcsapak said:
Hi all,

i am currently working on a cluster dashboard, which shows some cluster-wide information.
Here is a Screenshot of the current state:

Hi,
what storage include the cluster-wide storage? All storage together? Local/shared/distributed?

Could be difficult if storage are on some nodes available only...

Udo

dmora · Oct 20, 2016

Ok, but iops of what? all vms? the local storage? shared storage? combined?

i do not think a single "IOPS: YYY" value would be very helpful ?

Make options for all of them that you can toggle via a setting. This is great for capacity planning large clusters. For example, On our current KVM environment all hosts are generating a total of ~2k write IOPS across all local disks. This helps me capacity plan my Ceph storage cluster. It doesn't sound like Proxmox has seen any big deployments... everyone here seems to only run 3-4 boxes. I'm going to run more than 30+. We'll see if this system is ready for real production. If not...well I'll be sure to let you know.

dmora · Oct 20, 2016

hybrid512 said:
Definitely love it !
Would need some more informations like Ceph status, cluster load, ... but definitely the way to go !
ProxMox is great but it still lacks some abstract view, this is still too low level and it is hard to have a global view of the cluster state.
One thing that is related and missing too is some sort of Capacity planning ... beeing able to move or provision VMs automatically (or not) based on node ressources availability because when you have more than 10 nodes, it becomes uneasy to select which node is best to receive the moved or newly created VM ... it is still too manual, a little automation would be welcomed.

v4 allows you a move all option.

But yes i agree it lacks in the clustering area of visibility. Luckily you can get around this by running zabbix or zenoss or any other monitoring platorm.

alexskysilk · Oct 20, 2016

I like having the clusterwide utilization graphs, but in addition it would be useful to have any nodes that are individually tripping resource threshholds (this is most relevant to CPU or RAM utilization.) This is important as long as we dont have DRS like functionality

dcsapak · Oct 24, 2016

hi,

i have worked on this a bit and have some improvements to show

for now, i do not think i will incorporate iops in the cluster dashboard
(one reason is that currently we do not even measure them, so we cannot display them)

i added the ceph status of the node you connected to with the webgui

also, to identify individual resource limits of nodes, i added a cpu/memory usage column in the node grid
with colored progress bars and of course they are sortable, so you can easily identify which nodes have cpu/memory to spare
(without clicking the individual nodes),

the panel which holds the nodes can be resized (currently with the up/down buttons in the panel header)

udo said:
what storage include the cluster-wide storage? All storage together? Local/shared/distributed?

this includes all storages combined (i know this is not really optimal, especially in heterogenous environments, do you have a suggestion to improve this? i really do not wan to add much buttons/comboboxes/switches as this clutters the ui and for the basic use case makes it harder to use)

edit: forgot the screenshot

Ashley · Oct 24, 2016

Looks great, re the storage count.

Maybe have just a small option drop down menu to select or un-select storage to be counted, allowing people with Shared Storage to just select the shared storage instance to monitor. Obviously saved state so does not require changing on each load.

Will save having extra UI element's and changing to the preferred setup on each load.

udo · Oct 24, 2016

dcsapak said:
hi,

...
this includes all storages combined (i know this is not really optimal, especially in heterogenous environments, do you have a suggestion to improve this? i really do not wan to add much buttons/comboboxes/switches as this clutters the ui and for the basic use case makes it harder to use)

Hi,
perhaps use an subtitle with the storage-name and switch every 3-5sec through any defined storages on the cluster?
What doing with the local-storage? take the storage from the node with the highes filling??

Udo

dcsapak · Oct 24, 2016

Ashley said:
Maybe have just a small option drop down menu to select or un-select storage to be counted, allowing people with Shared Storage to just select the shared storage instance to monitor. Obviously saved state so does not require changing on each load.

Will save having extra UI element's and changing to the preferred setup on each load.

i also have (already sent) patches for a "client settings" area, where you can reset the local storage of the browser for the gui,
this enables us to save all column width/order etc.

i can imagine that we put such a selection in that client area

udo said:
Hi,
perhaps use an subtitle with the storage-name and switch every 3-5sec through any defined storages on the cluster?
What doing with the local-storage? take the storage from the node with the highes filling??

no, i count all local storages... mhmm maybe we really need a possibility to select them ...

shantanu · Oct 25, 2016

Let me just say that this idea is awesome.

I have been doing the cumulative CPU usage and vms running using a script which logs into each machine and harvests information; cpu, disk, memory, etc.
A dashboard is most welcome.

I have been AFK for few weeks, so won't be able to suggest/recommend what *more* could be useful information here.
Other than that, great work.
Regards,
Shantanu Gadgil

marsian · Oct 25, 2016

Great idea! +1

Regarding the "what to display"; I'd like to see an information per node on "Current PVE/Software version", as well as an information on if there are patches available or not, if possible distinguished by "regular maintenance" or "urgent security issue" (maybe based on a color etc.). Also I'd like to get some information if there are maybe errors or alerts on a node (little exclamation mark?) that need immediate attention, as well an information on the uptime of each node.

Besides that, I would rearrange the sorting in the vertical menu so "Summary" is always the first item instead of "Search".

Thanks!

riptide_wave · Oct 25, 2016

Looking good so far!

My only request is that the information stored on this page can be queried via API. This would make it easy to automate auto-balancing and scaling across an environment, as well as monitoring your cluster at a high level.

lilCDNnrg · Oct 25, 2016

Really like the dashboard, I normally have to switch back and forth between my nodes to see how each is running compared to the other and make sure everything is okay on the VMs, this would give me a great snapshot of how everything is going really quickly. As for storage monitoring I like the idea of configuring what is counted on the gauge but what about also having a drop down similar to the node details so show the details on each storage again that is selected to be in that list.

aychprox · Oct 25, 2016

Looks nice !
It would be great if you can incorporate info like ceph-dash (Ceph Cluster Placement Group Status) too.

Just a small column or value is sufficient. Even though this can be monitored from Proxmox GUI, but we relied a lot on ceph-dash to see the PG status, especially when re-balancing in place.

hybrid512 · Oct 25, 2016

I would personnaly remove the "folder view" and "storage view" which are pretty redundant and useless to me ... never used them because "server view" and "pool view" are by far much more usefull

valeech · Oct 25, 2016

+1 on marsian's suggestions. Show node software versions and updates. Also, show node uptime. It would be nice to see who in the cluster has been up and for how long. Can you add a field for Fence events? Somehow track how often a node gets fenced.

+1 on Ceph dashboard view too.

Preview, Feedback wanted - Cluster Dashboard

Proxmox Staff Member

New Member

Proxmox Staff Member

Active Member

Active Member

Distinguished Member

New Member

New Member

Distinguished Member

Proxmox Staff Member

Member

Distinguished Member

Proxmox Staff Member

Renowned Member

Well-Known Member

Member

Member

Renowned Member

Active Member

Well-Known Member

We value your privacy