[SOLVED] Is Prometheus an optimal option to monitor Proxmox?

PythonTrader

New Member
Sep 25, 2023
29
0
1
I have a Prometheus + Grafana setup in my home lab that is working well.

I have a Proxmox cluster that is consists of 3 nodes. I like to monitor my cluster.

I notice that Proxmox has no built-in support for Prometheus:

1721866484193.png


Considering that I already have a working Prometheus instance, should I use it for Proxmox or use what monitoring that Proxmox come with it?
 
I've the same setup and also looking for a good solution.

There is a prometheus-exporter ( https://github.com/prometheus-pve/prometheus-pve-exporter ), but this does not feel right.

As there are already some existing Dashboards for Proxmox and InfluxDB available (https://grafana.com/grafana/dashboards/?dataSource=influxdb&search=proxmox) - so I am exploring the influxdb solution.

Initial testing looks good ( low IO , some working Dashboards ), but now i need to "understand" the influxdb-server config and make it secure.


One warning for others : I used an existing "graphite"-vm for one week for my 6 proxmox-nodes. During that week the graphite-VM was suffering from high IO.
Additionally when the graphite-vm was down, the Proxmox-WebUI became unesponsive ( showing only '?' and no VM-Names anymore ). Starting the graphite-VM or disabling the graphite-metics resolved the problem.
 
I've the same setup and also looking for a good solution.

There is a prometheus-exporter ( https://github.com/prometheus-pve/prometheus-pve-exporter ), but this does not feel right.

As there are already some existing Dashboards for Proxmox and InfluxDB available (https://grafana.com/grafana/dashboards/?dataSource=influxdb&search=proxmox) - so I am exploring the influxdb solution.

Initial testing looks good ( low IO , some working Dashboards ), but now i need to "understand" the influxdb-server config and make it secure.


One warning for others : I used an existing "graphite"-vm for one week for my 6 proxmox-nodes. During that week the graphite-VM was suffering from high IO.
Additionally when the graphite-vm was down, the Proxmox-WebUI became unesponsive ( showing only '?' and no VM-Names anymore ). Starting the graphite-VM or disabling the graphite-metics resolved the problem.


Thank you for sharing your experience.

It is possible to have an instance of influxdb in a docker compose stack, so that it does not need much of setup, we need to provide volumens for storage and config.


I was hoping not to add yet another storage to what I already have (Loki and Prometheus )
 
I've the same setup and also looking for a good solution.

There is a prometheus-exporter ( https://github.com/prometheus-pve/prometheus-pve-exporter ), but this does not feel right.

As there are already some existing Dashboards for Proxmox and InfluxDB available (https://grafana.com/grafana/dashboards/?dataSource=influxdb&search=proxmox) - so I am exploring the influxdb solution.

Initial testing looks good ( low IO , some working Dashboards ), but now i need to "understand" the influxdb-server config and make it secure.


One warning for others : I used an existing "graphite"-vm for one week for my 6 proxmox-nodes. During that week the graphite-VM was suffering from high IO.
Additionally when the graphite-vm was down, the Proxmox-WebUI became unesponsive ( showing only '?' and no VM-Names anymore ). Starting the graphite-VM or disabling the graphite-metics resolved the problem.

BTW, agree that prometheus-pve-exporter doesn't feel right.

I learned that we can have it on a separate machine, but not sure repercussions.
 
I was searching for the same thing and found this thread.

Went and looked at pve-exporter - its fine, you can deploy it as a container easily, but you're making an API call you don't need since we already have exporters built-in, and a full scrape is multiple API calls. That seems a bit unnecessary when you can push from PVE.

Then I found this: https://github.com/prometheus/graphite_exporter

Honestly, for a true-blue big-cash-money production environment you're probably best off with influxdb, but if you absolutely must squeeze your proxmox stats onto the rest of your prometheus/mimir setup, it works fine, just need to set up some pattern-matching/regex to add labeling to the data.

This is what I'm running, I have a very minimal setup of 1 node not in a cluster and it only has VMs, no LXCs:

YAML:
mappings:
  # Matches Nodes.NICs
  - match: 'proxmox\.nodes\.([^\.]+)\.nics\.(.+)\.(receive|transmit)'
    match_type: regex
    name: 'proxmox_nodes_nics_${3}'
    labels:
      node: ${1}
      nic: ${2}
  # Node Uptime
  - match: proxmox.nodes.*.uptime
    name: proxmox_nodes_uptime
    labels:
      node: $1
  # All other node stats
  - match: 'proxmox.nodes.*.*.*'
    name: 'proxmox_nodes_${2}_${3}'
    labels:
      node: $1
  # Cluster storage
  - match: 'proxmox.storages.*.*.*'
    name: 'proxmox_storages_${3}'
    labels:
      node: $1
      id: $2
  # I don't need a stat called vmid with the vmid in the name
  # and the vmid in the value
  - match: proxmox.qemu.*.vmid
    action: drop
    name: "dropped"
  # VM block device
  - match: 'proxmox\.qemu\.([0-9]+)\.blockstat\.([^\.]+)\.(.*)'
    match_type: regex
    name: 'proxmox_qemu_blockstat_${3}'
    labels:
      vmid: ${1}
      device: ${2}
  # VM NICs
  - match: 'proxmox\.qemu\.([0-9]+)\.nics\.([^\.]+)\.(.*)'
    match_type: regex
    name: 'proxmox_qemu_nics_${3}'
    labels:
      vmid: ${1}
      nic: ${2}
  # VM Support
  - match: 'proxmox\.qemu\.([0-9]+)\.proxmox-support\.(.*)'
    match_type: regex
    name: 'proxmox_qemu_support_${2}'
    labels:
      vmid: ${1}
  # All other VM stats
  - match: 'proxmox\.qemu\.([0-9]+)\.(.*)'
    match_type: regex
    name: 'proxmox_qemu_${2}'
    labels:
      vmid: ${1}

My queries are very simple, so I keep labeling to a bare minimum. I could probably drop the node labels altogether, since I don't cluster.

That's my config.yaml and I deploy it with compose:

YAML:
services:

  graphite_exporter:
    image: prom/graphite-exporter
    container_name: graphite_exporter
    restart: unless-stopped
    networks:
      - external
    ports:
      # /metrics endpoint for debugging
      #- "${PUBLIC_IP}:9108:9108"
      # Graphite receiver
      - "${PUBLIC_IP}:9109:9109/udp"
    environment:
      - "TS=${TIMEZONE}"
    command:
      - --graphite.mapping-config=/etc/prometheus/mapping-config.yaml
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - "${STACK_PATH}/config.yaml:/etc/prometheus/mapping-config.yaml:ro"

networks:
  # Shared with Alloy
  external:
    external: true

I scrape it with Alloy, add a label with the deployment name to all metrics, then push it to local mimir and grafana cloud as its own tenant.

The metrics look something like this, taken from alloy's debug:

JSON:
{__name__="proxmox_storages_total", id="iso-storage", instance="graphite_exporter:9108", job="prometheus.scrape.graphite", node="ms-01"}
{__name__="proxmox_qemu_nics_netout", instance="graphite_exporter:9108", job="prometheus.scrape.graphite", nic="tap102i0", vmid="102"}
{__name__="proxmox_qemu_maxmem", instance="graphite_exporter:9108", job="prometheus.scrape.graphite", vmid="101"}
{__name__="proxmox_qemu_blockstat_flush_operations", device="scsi0", instance="graphite_exporter:9108", job="prometheus.scrape.graphite", vmid="102"}
{__name__="proxmox_nodes_nics_transmit", instance="graphite_exporter:9108", job="prometheus.scrape.graphite", nic="vmbr1v4", node="ms-01"}

Its a little more laborious than using a prebuilt dashboard and influx, or pve-exporter, but if you absolutely want to save on deploying Yet Another Time-Series Database, it seems like the way to go.