One node in cluster not starting after backup failure

thex · Apr 22, 2021

Hi,
tonight some of my backups failed withe the following error:

Code:

102: 2021-04-22 04:28:21 ERROR: Backup of VM 102 failed - unable to open file '/etc/pve/nodes/proxmox/qemu-server/102.conf.tmp.3183' - Permission denied

So it seems to be some problem with the lockfiles.
However my guess is that this is only a symptom.

After restarting my node1 of the cluster all VMs on that node run fine and none is locked anymore.

However my node2 seems to be stalled now.
In the web UI I get

Code:

VM6655:1 GET https://192.168.42.23:8006/api2/json/nodes/proxmix/time 401 (permission denied - invalid PVE ticket)

When logging into node2 via SSH it hangs on any command related to VMs. For example qm list or similar.
I am currently assuming it has to do something with corosync.

however this looks fine on first sight

Code:

root@proxmix:~# pvecm status
Cluster information
-------------------
Name:             Virtualizers
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Apr 22 10:12:35 2021
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.a85
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.42.24
0x00000002          1 192.168.42.23 (local)

it looks like the replication runner is not starting. (I'm not using any replications)

Code:

● pvesr.service - Proxmox VE replication runner
   Loaded: loaded (/lib/systemd/system/pvesr.service; static; vendor preset: enabled)
   Active: activating (start) since Thu 2021-04-22 09:56:00 CEST; 19min ago
Main PID: 1618 (pvesr)
    Tasks: 1 (limit: 4915)
   Memory: 72.8M
   CGroup: /system.slice/pvesr.service
           └─1618 /usr/bin/perl -T /usr/bin/pvesr run --mail 1

Apr 22 09:56:00 proxmix systemd[1]: Starting Proxmox VE replication runner...

this way currently no VM starts on my second node and I can't manage it via the web ui of node1. In addition I can't log into it's own web UI Connection failure. Network error or Proxmox VE services not running?

I also checked NTP and the time seems to be the same on both machines and I can't see any errors regarding NTP.

Any ideas?

Code:

root@proxmix:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.11.7-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.11: 7.0-0+3~bpo10
pve-kernel-5.4: 6.3-8
pve-kernel-helper: 6.3-8
pve-kernel-5.11.7-1-pve: 5.11.7-1~bpo10
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.12-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-9
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-5
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-10
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

thex · Apr 22, 2021

I just found that it even hangs when I try to do

Code:

ls /etc/pve/nodes/proxmix/

where proxmix is my node 2. Which I would interpret as an additional clue that it is a corosync issue...

thex · Apr 22, 2021

there were changes in the network lately however the IPs all stayed the same, just DNS changed. Could that have caused a "desynchronization"?
- How can I verify corosync is running fine?
- How can I "re-initiate" the sync?

thex · Apr 22, 2021

found some more info that might be relevant in the journal (read from bottom)

Code:

Apr 22 10:22:10 proxmix pvesr[1618]: trying to acquire cfs lock 'file-replication_cfg' ...
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] crit: cpg_send_message failed: 9
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] crit: cpg_send_message failed: 9
Apr 22 10:22:10 proxmix pve-ha-lrm[1553]: unable to write lrm status file - unable to open file '/etc/pve/nodes/proxmix/lrm_status.tmp.1553' - Device or resource busy
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] crit: cpg_send_message failed: 9
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] crit: cpg_send_message failed: 9
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] crit: cpg_send_message failed: 9
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] crit: cpg_send_message failed: 9
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] crit: cpg_send_message failed: 9
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] crit: cpg_send_message failed: 9
Apr 22 10:22:10 proxmix corosync[1494]:   [CPG   ] *** 0x56038f276680 can't mcast to group pve_dcdb_v1 state:1, error:12
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] crit: can't initialize service
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] crit: cpg_join failed: 14
Apr 22 10:22:10 proxmix pmxcfs[1368]: [dcdb] notice: start cluster connection
Apr 22 10:22:09 proxmix pmxcfs[1368]: [status] notice: received sync request (epoch 1/2283/00000005)
Apr 22 10:22:09 proxmix pmxcfs[1368]: [status] notice: node has quorum
Apr 22 10:22:09 proxmix corosync[1494]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr 22 10:22:09 proxmix corosync[1494]:   [QUORUM] Members[2]: 1 2
Apr 22 10:22:09 proxmix corosync[1494]:   [QUORUM] This node is within the primary component and will provide service.
Apr 22 10:22:09 proxmix pmxcfs[1368]: [status] notice: starting data syncronisation
Apr 22 10:22:09 proxmix pmxcfs[1368]: [status] notice: members: 1/2283, 2/1368
Apr 22 10:22:09 proxmix corosync[1494]:   [TOTEM ] A new membership (1.a8d) was formed. Members joined: 1
Apr 22 10:22:09 proxmix corosync[1494]:   [QUORUM] Sync joined[1]: 1
Apr 22 10:22:09 proxmix corosync[1494]:   [QUORUM] Sync members[2]: 1 2
Apr 22 10:22:09 proxmix systemd[1]: Started Update UTMP about System Runlevel Changes.
Apr 22 10:22:09 proxmix systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Apr 22 10:22:09 proxmix systemd[1]: Starting Update UTMP about System Runlevel Changes...
Apr 22 10:22:09 proxmix systemd[1]: Reached target Graphical Interface.
Apr 22 10:22:09 proxmix systemd[1]: Reached target Multi-User System.
Apr 22 10:22:09 proxmix systemd[1]: Started PVE guests.

the cfs problem might be caused by the fact that one of the VMs that is not starting is hosting the shares

thex · Apr 22, 2021

next investigation which might help resolving it... output from the "dead" node is missing an IP address in the members file.
Can I just edit this or is it somehow generated or just representing a current stat?

Code:

root@proxmox:~# cat /etc/pve/.members
{
"nodename": "proxmox",
"version": 18,
"cluster": { "name": "Virtualizers", "version": 2, "nodes": 2, "quorate": 1 },
"nodelist": {
  "proxmix": { "id": 2, "online": 1, "ip": "192.168.42.23"},
  "proxmox": { "id": 1, "online": 1, "ip": "192.168.42.24"}
  }
}

Code:

root@proxmix:~# cat /etc/pve/.members
{
"nodename": "proxmix",
"version": 16,
"cluster": { "name": "Virtualizers", "version": 2, "nodes": 2, "quorate": 1 },
"nodelist": {
  "proxmix": { "id": 2, "online": 1, "ip": "192.168.42.23"},
  "proxmox": { "id": 1, "online": 1}
  }
}

thex · Apr 22, 2021

ok... it seems to be working again but I have to continue diagnosing later

The problem seems to be the network topology/configuration. There have been a few changes the last week (new router, additional switch).

Both nodes have been on different switches (which both should support IGMP snooping) connected via a router where the bridge also has IGMP snooping turned on but something seems to go wrong.

I have now connected both nodes to the same switch and it seems to work fine again.

Search

Search

One node in cluster not starting after backup failure

thex

Active Member

thex

Active Member

thex

Active Member

thex

Active Member

thex

Active Member

thex

Active Member

We value your privacy