VM Status loss after a while

ithrasiel · Aug 30, 2024

Hello together,

i'm using two Proxmox Nodes for my environment. I added a new physical connection to both hosts, which are only used as an Bridge for my VMs.

After the first connect everything worked fine. But after a while the VM State changed to "unknown". They're still running fine, but most management functions doesn't works:

After an Reboot the State comes regular back and all other functions are working again.

But this just holds on for a while - seems like 30 - 60 Minutes. Than the state changed back to "unknown".

My actual network configuration. The configuration is identical on both hosts (except the ip-adresses):

Bash:

auto lo
iface lo inet loopback

iface enp86s0 inet manual

auto enp46s0
iface enp46s0 inet manual

source /etc/network/interfaces.d/*

auto vmbr0
iface vmbr0 inet static
        bridge-ports enp46s0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto vmbr0.5
iface vmbr0.5 inet static
        address 172.16.0.6/24
        gateway 172.16.0.1

auto vmbr0.90
iface vmbr0.90 inet static
        address 99.1.1.10/24

iface wlo1 inet manual

auto vmbr1
iface vmbr1 inet manual
        bridge-ports enp86s0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

Any idea where to troubleshoot here?

dakralex · Aug 30, 2024

It seems like you experience some connectivity issues between your hosts and it would be interesting if you have setup a separate cluster network for corosync. Could you check the status of your cluster with pvecm status when this happens and post the syslog from journalctl -b -u pvestatd -u pve-cluster?

Also just a friendly reminder, if you have a two node cluster setup - if you haven't already - it's very encouraged to also setup a QDevice, so that you're cluster stays functional even though one node is down. This is encouraged for any even-numbered cluster.

ithrasiel · Aug 30, 2024

I have an dedicated Management Network where both hosts have an IP and there also exists an Quorum. Here is the output from pvecm status

Also is here the Output from my first node (i needed to cut out doubled Lines):

Bash:

Aug 30 08:43:19 kd-node01 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Aug 30 08:43:19 kd-node01 pmxcfs[1394]: [main] notice: resolved node name 'kd-node01' to '172.16.0.6' for default node IP address
Aug 30 08:43:19 kd-node01 pmxcfs[1394]: [main] notice: resolved node name 'kd-node01' to '172.16.0.6' for default node IP address
Aug 30 08:43:19 kd-node01 pmxcfs[1401]: [quorum] crit: quorum_initialize failed: 2
Aug 30 08:43:19 kd-node01 pmxcfs[1401]: [quorum] crit: can't initialize service
Aug 30 08:43:19 kd-node01 pmxcfs[1401]: [confdb] crit: cmap_initialize failed: 2
Aug 30 08:43:19 kd-node01 pmxcfs[1401]: [confdb] crit: can't initialize service
Aug 30 08:43:19 kd-node01 pmxcfs[1401]: [dcdb] crit: cpg_initialize failed: 2
Aug 30 08:43:19 kd-node01 pmxcfs[1401]: [dcdb] crit: can't initialize service
Aug 30 08:43:19 kd-node01 pmxcfs[1401]: [status] crit: cpg_initialize failed: 2
Aug 30 08:43:19 kd-node01 pmxcfs[1401]: [status] crit: can't initialize service
Aug 30 08:43:20 kd-node01 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 30 08:43:20 kd-node01 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Aug 30 08:43:20 kd-node01 pvestatd[1435]: starting server
Aug 30 08:43:20 kd-node01 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [status] notice: update cluster info (cluster name  KD-Cluster, version = 3)
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [status] notice: node has quorum
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [dcdb] notice: members: 1/1401, 2/1681
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [dcdb] notice: starting data syncronisation
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [status] notice: members: 1/1401, 2/1681
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [status] notice: starting data syncronisation
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [dcdb] notice: received sync request (epoch 1/1401/00000001)
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [status] notice: received sync request (epoch 1/1401/00000001)
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [dcdb] notice: received all states
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [dcdb] notice: leader is 2/1681
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [dcdb] notice: synced members: 2/1681
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [dcdb] notice: waiting for updates from leader
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [status] notice: received all states
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [status] notice: all data is up to date
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [dcdb] notice: update complete - trying to commit (got 2 inode updates)
Aug 30 08:43:25 kd-node01 pmxcfs[1401]: [dcdb] notice: all data is up to date
Aug 30 08:43:37 kd-node01 pvestatd[1435]: BKP-NAS: error fetching datastores - 500 Can't connect to kd-pbs:8007 (Temporary failure in name resolution)
Aug 30 08:43:37 kd-node01 pvestatd[1435]: status update time (6.799 seconds)
Aug 30 08:43:40 kd-node01 pvestatd[1435]: modified cpu set for lxc/104: 0
Aug 30 08:43:46 kd-node01 pvestatd[1435]: BKP-NAS: error fetching datastores - 500 Can't connect to kd-pbs:8007 (Temporary failure in name resolution)
Aug 30 08:43:46 kd-node01 pvestatd[1435]: status update time (6.314 seconds)
Aug 30 08:43:56 kd-node01 pvestatd[1435]: BKP-NAS: error fetching datastores - 500 Can't connect to kd-pbs:8007 (Temporary failure in name resolution)
Aug 30 08:43:56 kd-node01 pvestatd[1435]: status update time (6.328 seconds)
Aug 30 08:45:00 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 08:52:30 kd-node01 pvestatd[1435]: modified cpu set for lxc/103: 1
Aug 30 09:00:01 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 09:37:02 kd-node01 pmxcfs[1401]: [dcdb] notice: members: 1/1401
Aug 30 09:37:02 kd-node01 pmxcfs[1401]: [status] notice: members: 1/1401
Aug 30 09:37:03 kd-node01 pmxcfs[1401]: [dcdb] notice: members: 1/1401, 2/1681
Aug 30 09:37:03 kd-node01 pmxcfs[1401]: [dcdb] notice: starting data syncronisation
Aug 30 09:37:04 kd-node01 pmxcfs[1401]: [dcdb] notice: cpg_send_message retry 10
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: cpg_send_message retry 10
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: cpg_send_message retried 11 times
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: cpg_send_message retried 12 times
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: members: 1/1401, 2/1681
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: starting data syncronisation
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: received sync request (epoch 1/1401/00000003)
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: received sync request (epoch 1/1401/00000003)
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: received all states
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: leader is 1/1401
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: synced members: 1/1401
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: start sending inode updates
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: sent all (2) updates
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: all data is up to date
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: received all states
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: all data is up to date
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: dfsm_deliver_queue: queue length 16
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [main] notice: ignore insert of duplicate cluster log
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: received all states
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: all data is up to date
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: dfsm_deliver_queue: queue length 16
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [main] notice: ignore insert of duplicate cluster log
Aug 30 09:37:05 kd-node01 pmxcfs[1401]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Aug 30 09:38:04 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 09:39:04 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 09:40:04 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 09:41:04 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 09:41:12 kd-node01 pvestatd[1435]: status update time (51.039 seconds)
Aug 30 09:43:04 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 09:43:19 kd-node01 pmxcfs[1401]: [dcdb] notice: data verification successful
Aug 30 09:44:04 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 09:47:28 kd-node01 pvestatd[1435]: status update time (215.585 seconds)
Aug 30 09:48:04 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 10:01:30 kd-node01 pmxcfs[1401]: [dcdb] notice: members: 1/1401
Aug 30 10:01:30 kd-node01 pmxcfs[1401]: [status] notice: members: 1/1401
Aug 30 10:04:57 kd-node01 pmxcfs[1401]: [dcdb] notice: members: 1/1401, 2/1310
Aug 30 10:04:57 kd-node01 pmxcfs[1401]: [dcdb] notice: starting data syncronisation
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [dcdb] notice: cpg_send_message retried 6 times
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [status] notice: members: 1/1401, 2/1310
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [status] notice: starting data syncronisation
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [dcdb] notice: received sync request (epoch 1/1401/00000005)
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [status] notice: received sync request (epoch 1/1401/00000005)
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [dcdb] notice: received all states
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [dcdb] notice: leader is 1/1401
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [dcdb] notice: synced members: 1/1401
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [dcdb] notice: start sending inode updates
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [dcdb] notice: sent all (2) updates
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [dcdb] notice: all data is up to date
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [status] notice: received all states
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [status] notice: all data is up to date
Aug 30 10:04:58 kd-node01 pmxcfs[1401]: [status] notice: dfsm_deliver_queue: queue length 14
Aug 30 10:34:06 kd-node01 pmxcfs[1401]: [status] notice: received log
Aug 30 10:35:06 kd-node01 pmxcfs[1401]: [status] notice: received log

dakralex · Aug 30, 2024

Thank you for the information on the issue. Do you currently experience the issue? Because I couldn't tell from the logs as your cluster is quorate.

The only thing that is standing out is the "Temporary failure in name resolution", which is probably trying to reach your PBS server. That is usually the error if you're experiencing some issues to your DNS server and could in turn show some signs of connection loss to the Internet if you're using a public DNS.

Have you found any other issues in the syslog when those issues arise (just use journalctl -b <boot-number>, where boot-number is 0 for the current boot, -1 for the one before, etc.)?

ithrasiel · Aug 30, 2024

Like always if represented the cluster restored already the Health States (this time without reboot).

I would reply the next time where the error occurs with an updated journalctl. Maybe we can than deeper inspect the Problem.

Thank you for fast help

sw-omit · Aug 30, 2024

Also, if it is DNS-issue, have you added the IP/Name of the other node in the /etc/hosts file?

In that file there should already be a line for your current host, copy it to a new line to the other host and vice-versa, just add it below the current hosts's line.

Search

Search

VM Status loss after a while

ithrasiel

New Member

dakralex

Proxmox Staff Member

ithrasiel

New Member

dakralex

Proxmox Staff Member

ithrasiel

New Member

sw-omit

Active Member