pvestatd: storage 'wxyz' is not online

upnort · May 16, 2018

Proxmox 5.1.

I found an older thread addressing this issue.

I see these error messages a lot in the Proxmox /var/log/syslog. I am using an NFS storage location.

First item is an enhancement request --- could the actual error message please be updated? Consider revising the text string "is not online" to "did not respond to a showmount request"? The simple change would help much with debugging the problem.

I ask that the text string be revised because the storage device is indeed online. Something in the Proxmox scripts either times out (2 seconds), does not receive a showmount reply, or something else. I do not suspect a lack of a showmount reply because I can run the showmount command all day and the remote storage device responds.

Second item is seeking help to debug this log message. Try as I might I cannot debug the root cause of this error message.

The systems are using the standard 8 NFS server threads. Running the nfsstat -r command shows no errors, retrans, or bad calls. Likewise, the ifconfig command shows no dropped packets. Hardware includes 1 Gbps connections and the systems use 10000 rpm RAID SAS drives.

I appreciate that the messages are "harmless," but the problem causes a lot of log spew. I see in the referenced thread that the developers do not see this problem in their lab. I will help test if Proxmox developers want data or information.

Thanks again.

fabian · May 17, 2018

upnort said:
Proxmox 5.1.

I found an older thread addressing this issue.

I see these error messages a lot in the Proxmox /var/log/syslog. I am using an NFS storage location.

First item is an enhancement request --- could the actual error message please be updated? Consider revising the text string "is not online" to "did not respond to a showmount request"? The simple change would help much with debugging the problem.

I ask that the text string be revised because the storage device is indeed online. Something in the Proxmox scripts either times out (2 seconds), does not receive a showmount reply, or something else. I do not suspect a lack of a showmount reply because I can run the showmount command all day and the remote storage device responds.

Second item is seeking help to debug this log message. Try as I might I cannot debug the root cause of this error message.

The systems are using the standard 8 NFS server threads. Running the nfsstat -r command shows no errors, retrans, or bad calls. Likewise, the ifconfig command shows no dropped packets. Hardware includes 1 Gbps connections and the systems use 10000 rpm RAID SAS drives.

I appreciate that the messages are "harmless," but the problem causes a lot of log spew. I see in the referenced thread that the developers do not see this problem in their lab. I will help test if Proxmox developers want data or information.

Thanks again.

in most cases it is simply a slow NFS export running into pvestatd's 2 second timeout. in most of the other cases it is an NFS server where showmounts does not work. the rest are really offline.

upnort · May 17, 2018

in most cases it is simply a slow NFS export running into pvestatd's 2 second timeout. in most of the other cases it is an NFS server where showmounts does not work. the rest are really offline.

Thank you.

Perhaps the 2 second timeout could be configurable? Or the script revised to try more than once before logging something? And the text string error in the logs revised to be more technically correct?

I accept that the scripts are written in a way that creates the perception that a storage device is unavailable or not responding. Done that to myself many times with scripts.

I know the devices are available and responding, which is why this log error is perplexing.

fabian · May 18, 2018

upnort said:
Thank you.

Perhaps the 2 second timeout could be configurable? Or the script revised to try more than once before logging something? And the text string error in the logs revised to be more technically correct?

the 2 seconds are not really extendable, since we want to finish a whole stat update cycle (all guests, storages, and the node itself) in 5 seconds, and schedule one every 10 seconds. trying twice is not really an option for the same reason. the log message is pretty harmless, it just tells you that your storage is slow and can help correlate other issues (you will also miss stat updates for that storage and cycle of course).

changing the log message might not be a good idea unless there is a very good reason - you never know who has matching in monitoring for such strings, and although we don't guarantee any of your output strings to be fixed forever, we also try to not change them just because we can

I accept that the scripts are written in a way that creates the perception that a storage device is unavailable or not responding. Done that to myself many times with scripts. I know the devices are available and responding, which is why this log error is perplexing.

refactoring the daemon that collects the statistics data (pvestatd) is very high on our TODO list, and the new design should handle slow storages much more gracefully (and allow us to bump the timeout by running checks in parallel instead of series)!

upnort · May 19, 2018

changing the log message might not be a good idea unless there is a very good reason - you never know who has matching in monitoring for such strings, and although we don't guarantee any of your output strings to be fixed forever, we also try to not change them just because we can

Thank you for replying Fabian.

I understand the explanation -- been there myself many times. Nonetheless, I believe changing the text string would be more "technically correct" and more helpful. If such a change was made, explicitly noting the new string in the change logs would alert admins who scan that specific text string in their logs.

refactoring the daemon that collects the statistics data (pvestatd) is very high on our TODO list, and the new design should handle slow storages much more gracefully (and allow us to bump the timeout by running checks in parallel instead of series)!

I look forward to the changes. Although the message is "harmless," the log spew is a nuisance.

I would like to constructively offer that, as I shared previously, I know the storage links are available. I can run the showmount command all day without an error. Something else is going on that triggers the log spew. Possibly we have run into a corner case issue that is not easily tested in the labs. I wish I knew perl scripting better because I would love to dig deeper into this.

Search

Search

pvestatd: storage 'wxyz' is not online

upnort

Member

fabian

Proxmox Staff Member

upnort

Member

fabian

Proxmox Staff Member

upnort

Member