PVE after time no longer available

Seqway

Active Member
Dec 1, 2018
20
0
41
53
Good day,
i hope this is the right area to post.

I have a Promox Cluster running with 3 nodes. Its running good and i have many LXC and VMs running. So far so good.
With one node i have from time to time problems whyever. By random in time suddenly its marked red in the GUI und no longer available neither by GUI nor by ssh login via terminal.
Only a hard reset is possible to reacivate it and then all is running fine again.

My first question is what should i post here that is needed to make an investigation ? Please advise and i will post it.

Second, where should i look for logfiles to get an idea what is happening here with this node ? I dont know where to start. Would be nice to get advise!

1777717911958.png
 
Second, where should i look for logfiles to get an idea what is happening here with this node ?

General troubleshooting for Proxmox is to obtain and review the system report and `journalctl` logs.

System Report
- CLI: # pvereport > $(hostname)-pvereport-$(date -I).txt
- GUI: `PVE Node -> Subscription -> System Report`

# journalctl --since "YYYY-MM-DD" --until "YYYY-MM-DD" | gzip > $(hostname)-journal.txt.gz

My first question is what should i post here that is needed to make an investigation ?

The collected logs can be attached from Attach files on the forum.
 
Hello,
thanks for the explanation.
Attached please find the System report.

The journal-logfile for 1 month is too big too upload. how much days should i implement ?

ADDON:
Maybe it has something to do with my backup which is startet at 3am. But this happens not always - very seldom. i get then these messages:
Code:
Apr 30 03:01:34 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:36 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:36 pve2 corosync[1150]:   [TOTEM ] Token has not been received in 63327 ms
Apr 30 03:01:37 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:39 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:40 pve2 sudo[3029673]: telegraf : PWD=/ ; USER=root ; COMMAND=/usr/sbin/smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sda
Apr 30 03:01:40 pve2 sudo[3029673]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=999)
Apr 30 03:01:40 pve2 sudo[3029674]: telegraf : PWD=/ ; USER=root ; COMMAND=/usr/sbin/smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sdb -d sat
Apr 30 03:01:40 pve2 sudo[3029674]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=999)
Apr 30 03:01:40 pve2 sudo[3029673]: pam_unix(sudo:session): session closed for user root
Apr 30 03:01:40 pve2 sudo[3029674]: pam_unix(sudo:session): session closed for user root
Apr 30 03:01:40 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:42 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:43 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:44 pve2 corosync[1150]:   [TOTEM ] Token has not been received in 71357 ms
Apr 30 03:01:45 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:46 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:48 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:49 pve2 corosync[1150]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Apr 30 03:01:50 pve2 sudo[3029681]: telegraf : PWD=/ ; USER=root ; COMMAND=/usr/sbin/smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sda
 

Attachments

Last edited:
According to the journalctl logs, the Corosync token was not received for a certain period of time, and pve2 temporarily dropped out of the cluster.

Code:
Apr 30 03:01:34 pve2 corosync[1150]: [MAIN] Totem is unable to form a cluster ...
Apr 30 03:01:36 pve2 corosync[1150]: [TOTEM] Token has not been received in 63327 ms
...
Apr 30 03:01:44 pve2 corosync[1150]: [TOTEM] Token has not been received in 71357 ms

When a node leaves the cluster, it is recognized by other cluster nodes, as shown in the initially shared image. This also causes pve2 to restart.
At the time this log was taken, the node had already recovered. However, because Corosync communication was interrupted for an extended period at the time of the problem, it is highly likely that the node remained disconnected from the cluster.

Upon checking the configuration, it was found that there was only one NIC, and all traffic was concentrated on that NIC.
Since Corosync communication is sensitive to latency and jitter, it is recommended to provide a dedicated network.
We recommend checking your network configuration by referring to the following manual:

https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_network