CPU usage metrics in external metrics server not reported

effects_studio

New Member
Dec 26, 2019
10
0
1
33
I've followed the guide in
https://pve.proxmox.com/wiki/External_Metric_Server
and I can see metrics such as cpu usage overall
Code:
SELECT mean("cpu") FROM "cpustat" WHERE GROUP BY time(1m) fill(null)
, but the "system" set seems incomplete:
Eg.

Code:
SELECT last("cpu") FROM "system" GROUP BY time(1m), "host" fill(null)
returns empty result set.
Does metric reporting for guests need to be enabled somewhere?
 
Code:
> show field keys FROM system
name: system
fieldKey      fieldType
--------      ---------
active        float
avail         float
content       string
enabled       float
load1         float
load15        float
load5         float
n_cpus        integer
n_users       integer
shared        float
total         float
type          string
uptime        integer
uptime_format string
used          float
the cpu field seems to be missing entirely
 
Maybe a red herring, but
Code:
root@pve:~# tcpdump -i vmbr0 udp port 8089
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:51:54.645951 IP localladdr.44223 > metrics-srv.8089: UDP, bad length 4132 > 1472
17:51:54.653836 IP localladdr.55759 > metrics-srv.8089: UDP, bad length 5558 > 1472
17:51:54.691771 IP localladdr.44544 > metrics-srv.8089: UDP, bad length 2367 > 1472
17:51:54.720723 IP localladdr.43296 > metrics-srv.8089: UDP, length 1026

Seems like metric submission disregards MTU on the network and thus packets aren't getting through.
 
Ok, story continues, metrics are sent in UDP packages now, but the guest drops the packages as they have checksum errors.
 
More errors in this area
Code:
ts=2020-01-09T02:55:44.131034Z lvl=info msg="Failed to write point batch to database" log_id=0KE3J33G000 service=udp db_instance=proxmox error="partial write: field type conflict: input field \"uptime\" on measurement \"system\" is type float, already exists as type integer dropped=20"

It seems that proxmox sends `0.0` metric value as `0` and thus creates wrong implicit schema
 
so I have a "solution" for this.
run "drop series from system" in influx up until you win russian roulette and the package with non-integer uptime comes first.

This is a pretty horrible solution and should be easy to fix on proxmox side: make whatever submits metrics not report `0.0` as `0`.
 
@effects_studio

Hi,
I have wrote the original influxdb, I really didn't known about this.
Currently we simply push values from pvestatd to influxdb.

looking at my system table, I only have float values indeed

Code:
> show field keys FROM system;
name: system
fieldKey    fieldType
--------    ---------
active      float
avail       float
balloon     float
balloon_min float
content     string
cpu         float
cpus        float
disk        float
diskread    float
diskwrite   float
enabled     float
freemem     float
lock        string
maxdisk     float
maxmem      float
mem         float
name        string
netin       float
netout      float
pid         float
qmpstatus   string
serial      float
shared      float
shares      float
status      string
template    float
total       float
type        string
uptime      float
used        float
vmid        float

I'm seeing that you have 2 integers values:

n_cpus integer
n_users integer

I don't seem them in proxmox code, is it custom values from you ?

I would like to known if I can't make a patch, and consider all numeric values as float.

it could be fixed in
/usr/share/perl5/PVE/Status/InfluxDB.pm

Code:
sub prepare_value {
    my ($value) = @_;

    if (looks_like_number($value)) {
        if (isnan($value) || isinf($value)) {
            # we cannot send influxdb NaN or Inf
            return undef;
        }

        # influxdb also accepts 1.0e+10, etc.
        $value = printf '%.02f',$value;    # add this to convert to float
        return $value;
    }
 
also, what is your influxdb version ?

because I'm running 1.7.9, and I'm seeing a lot of 0 values in float measurements.

(I would like to known if you already used influxdb 2.0, as I known that a lot of changes have occured)
 
@effects_studio

I just ran into this issue and working from @spirit comment I think I might be close to a solution.

The error we're seeing (write fails because uptime as float -> integer) relates to how influxdb treats numeric values as floats by default. If you want an integer, you append a 'i' to the value. See the line protocol doc

As @spirit noted the pvestatd just takes the value and passes it along to influxdb. That means a value that looks like an integer (say 1234) is treated as a float by influxdb.

However if you have other inputs submitting data - for instance if you have telegraf clients also sending system information, they could be sending values over as integers with the suffixed 'i'. If a shard receives the telegraf integer uptime data first, then the field type will be set as integer and you will see the noted error about a mismatched type; pvestatd sends a "float" and influxdb refuses to write that to an integer field.

My first hacky solution was to append the 'i' when a number looks like an integer with:

$value = "${value}i" if ($value =~ /^[0-9]+$/);

but of course that just shifts the issue to conflicts with other measurements. After that change I started receiving system data over on influxdb/grafana, but my blockstat measurments failed with the reverse issue and error:

error="partial write: field type conflict: input field \"bavail\" on measurement \"blockstat\" is type integer, already exists as type float dropped=109"

So I'm still working out a fix. It seems we might need to add a conversion lookup table to match up measurement/field types.

Hope this helps a bit.
 
I have been having this issue every few days or something like that.
Restarting the database was fixing it for me which was annoying but worked. Until it didn't 3-4 times in a row so I went deeper and found this thread.

Like @effects_studio said, dropping the measurement enough times you can fluke it but.. uhg

@spirit your code change it to float, but I had issues with all the values, I have never used perl before but "printf '%.02f'" seemed to be doing unexpected things.

After dropping measurements a bunch of times since, this seems to have fixed it for me:


Perl:
sub prepare_value {
    my ($value) = @_;

    if (looks_like_number($value)) {
        if (isnan($value) || isinf($value)) {
            # we cannot send influxdb NaN or Inf
            return undef;
        }

        # influxdb also accepts 1.0e+10, etc.
        $value = sprintf '%.2f',$value;    # add this to convert to float
        return $value;
    }
 
I tried with
Code:
$value = sprintf '%.2g',$value;
in hope that if would remove the trailing '.0' from the value when the value is integer, but no go.
The problem here is semantic: You should not push a float for uptime. it is an integer. I'm logging many hosts in the system table, not only proxmox. But proxmox is the only one sending a float for this field.
I think the conversion algo used if too simple and needs to account for different data types.
It's a bug, plain/simple.
 
I was using Telegraf to receive metrics from Proxmox and send them to InfluxDB and got the same error. Worked around the issue by making Telegraf convert the value to an integer.

Code:
# error="partial write: field type conflict: input field \"uptime\" on measurement \"system\" is type float, already exists as type integer dropped=15"
[[processors.converter]]
  [processors.converter.fields]
    integer = ["uptime"]
 
Last edited:
I was using Telegraf to receive metrics from Proxmox and send them to InfluxDB and got the same error. Worked around the issue by making Telegraf convert the value to an integer.

Code:
# error="partial write: field type conflict: input field \"uptime\" on measurement \"system\" is type float, already exists as type integer dropped=15"
[[processors.converter]]
  [processors.converter.fields]
    integer = ["uptime"]
Where exactly did you put this in your config?
I pasted it at the end and the error still persists.

EDIT: Despite this thread being nearly 2 years old now, i encountered this bug just recently. Can't see exactly when because i don't look in the syslog every day on the logging machine. But can only be some weeks. Maybe proxmox changed something in the recent code.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!