Node with question mark

Jannoke · Apr 27, 2024

iSotopeOfAdmiralty said:
just "service pvestatd start" did it for me. but it keeps happening over and over for some reasons

It means you have something that intermittently takes too much time to query for pvestatd daemon (mountpoint, some other info shown in gui). Usually slow disks or unreliable nfs mount.

iSotopeOfAdmiralty · Apr 27, 2024

yup. a metric server was down

bbgeek17 · Apr 27, 2024

iSotopeOfAdmiralty said:
yup. a metric server was down

I'd recommend either opening a new bug in https://bugzilla.proxmox.com , or adding to https://bugzilla.proxmox.com/show_bug.cgi?id=3259.
Although the cause is different in your case, the resulting state is the same.

Ideally, an offline external log/metric collector should not cause cluster heartburn in PVE.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

navigator · Apr 28, 2024

Coisas que ao longo do tempo me fizeram perceber que são elas as verdadeiras causas do pvestatd service ter problemas e precisar ser re-inicializado.

NFS ou sistema de arquivos comum ao cluster não encontrado ou com problemas de latência na conexão ou perda de pacotes entre es NFS
Uso excesivo de memória RAM dos nodes seja esta causada falsamente por VMs sem QEMU-Agent instalado nas VMs seja porque realmente este node tem um uso alto de RAM (Ceph server, I.O com muito delay ou coisas do genero)

Falha na conexão a algum sistema de BACKUP
Por exemplo PBSs ou outros sistemas auxiliares de backup tipo NFS com dificuldade na conexão física latencia ou perda de pacotes. Apesar de que ultimamente não tive mais problemas destes.

iSotopeOfAdmiralty · Apr 30, 2024

bbgeek17 said:
I'd recommend either opening a new bug in https://bugzilla.proxmox.com , or adding to https://bugzilla.proxmox.com/show_bug.cgi?id=3259.
Although the cause is different in your case, the resulting state is the same.

Ideally, an offline external log/metric collector should not cause cluster heartburn in PVE.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

I intentionally added a faulty metrics server to a Proxmox node in my testing cluster.
sure enough the exact same issue came back, I watched the node closely though Netdata (a working metrics server) and this is the errors that seems to be the root cause of the issue:

Alert: apps_group_file_descriptors_utilization

Chart: app.proxmox-ve_fds_open_limit

Context: app.fds_open_limit

Raised to warning, for 0 seconds

On Sat Apr 27 2024, 01:28:22 CDT

By: TEST01

Space: OpenServeonics TESTING

Rooms: All nodes, Rack 1

Global time: Sat Apr 27 2024, 06:28:22 UTC

Classification: Utilization

Role: sysadmin

iSotopeOfAdmiralty · Apr 30, 2024

it se

iSotopeOfAdmiralty said:
I intentionally added a faulty metrics server to a Proxmox node in my testing cluster.
sure enough the exact same issue came back, I watched the node closely though Netdata (a working metrics server) and this is the errors that seems to be the root cause of the issue:

Alert: apps_group_file_descriptors_utilization
Chart: app.proxmox-ve_fds_open_limit
Context: app.fds_open_limit
Raised to warning, for 0 seconds

On Sat Apr 27 2024, 01:28:22 CDT
By: TEST01
Space: OpenServeonics TESTING
Rooms: All nodes, Rack 1
Global time: Sat Apr 27 2024, 06:28:22 UTC

Classification: Utilization
Role: sysadmin

it seems to me like pvestatd tried to open a none existing file descriptor in order to connect to the none existent metrics server.
there is a limit on how many files it can open set for pvestatd, probably to prevent it using too much resource or even crashing a node, since the file it tried to open is nonexistent, it retired over and over, triggering the limit.
And thus, pvestatd was killed by the system.

Node with question mark

Jannoke

Renowned Member

iSotopeOfAdmiralty

New Member

bbgeek17

Distinguished Member

navigator

Well-Known Member

iSotopeOfAdmiralty

New Member

iSotopeOfAdmiralty

New Member

We value your privacy