[SOLVED] cluster stop working

piviul

Active Member
Mar 19, 2020
50
3
28
Hi all, I have a 6.4-15 pve 3 nodes cluster. Now all vm (ct or qemu) seems to works but nodes doesn't communicate one each other. The service pvesr doesn't seems to works:
# systemctl status pvesr.service
● pvesr.service - Proxmox VE replication runner
Loaded: loaded (/lib/systemd/system/pvesr.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Sat 2022-11-05 07:13:10 CET; 23s ago
Process: 2330254 ExecStart=/usr/bin/pvesr run --mail 1 (code=exited, status=13)
Main PID: 2330254 (code=exited, status=13)

Nov 05 07:13:04 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:05 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:06 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:07 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:08 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:09 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:10 pve02 pvesr[2330254]: cfs-lock 'file-replication_cfg' error: no quorum!
Nov 05 07:13:10 pve02 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Nov 05 07:13:10 pve02 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Nov 05 07:13:10 pve02 systemd[1]: Failed to start Proxmox VE replication runner.


This is the status of the cluster:

# pvecm status
Cluster information
-------------------
Name: CSA-cluster1
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Sat Nov 5 07:14:56 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.914
Quorate: No

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.255.2 (local)

What is your suggestions? Restarting the nodes should solve the problem?
 
Hi all, I have a 6.4-15 pve 3 nodes cluster. Now all vm (ct or qemu) seems to works but nodes doesn't communicate one each other. The service pvesr doesn't seems to works:
# systemctl status pvesr.service
● pvesr.service - Proxmox VE replication runner
Loaded: loaded (/lib/systemd/system/pvesr.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Sat 2022-11-05 07:13:10 CET; 23s ago
Process: 2330254 ExecStart=/usr/bin/pvesr run --mail 1 (code=exited, status=13)
Main PID: 2330254 (code=exited, status=13)

Nov 05 07:13:04 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:05 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:06 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:07 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:08 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:09 pve02 pvesr[2330254]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 05 07:13:10 pve02 pvesr[2330254]: cfs-lock 'file-replication_cfg' error: no quorum!
Nov 05 07:13:10 pve02 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Nov 05 07:13:10 pve02 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Nov 05 07:13:10 pve02 systemd[1]: Failed to start Proxmox VE replication runner.


This is the status of the cluster:

# pvecm status
Cluster information
-------------------
Name: CSA-cluster1
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Sat Nov 5 07:14:56 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.914
Quorate: No

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.255.2 (local)

What is your suggestions? Restarting the nodes should solve the problem?
i cant recommend you to restart the nodes. but lets give you an advice - i had now round about 5 times issues with the cluster (4 nodes) and i plyed around for hours until and nothing worked (i am very sensetive with prod cluster). i reached the point (out of desperation) as i didnt care what happened next with the cluster - i rebooted, nothing worked again but better outcome as before, i rebooted a second time - cluster came up perfectly and everything works great again. it was all 5 times exactly the same. why ? i cant tell you - to be perfectly honest. but if i face problems now, and i cant solve it in 30 mins, latest then i reboot the whole cluster 2 or 3 times. (4 out of 5 times, one reboot didnt help)

in your case
cfs-lock 'file-replication_cfg' error: no quorum
i would check the var/ folder. i have seen it many times that some locks cant be removed and this is the whole issue. after i deleted the files the service came up.

but be warned: never play with prod servers around, always have backup
 
yes in syslog I can found a lot of:
cfs-lock 'file-replication_cfg' error: no quorum!

This is the content of /var/lock of the nodes:
NODE1
# ls -l /var/lock/
total 0
drwx------ 2 root root 40 Oct 24 09:32 lvm
drwxr-xr-x 2 root root 100 Oct 24 20:04 lxc
-rw-r--r-- 1 root root 0 Oct 24 09:32 pvedaemon.lck
-rw-r--r-- 1 root root 0 Oct 24 09:32 pvefw.lck
-rw-r--r-- 1 root root 0 Oct 24 09:32 pvefw-logger.lck
-rw-r--r-- 1 root root 0 Oct 24 09:38 pve-ports.lck
-rw-r--r-- 1 www-data www-data 0 Oct 24 09:32 pveproxy.lck
-rw-r--r-- 1 root root 0 Oct 24 09:33 pvesr.lck
drwxr-xr-x 2 root root 120 Oct 24 15:07 qemu-server
-rw-r--r-- 1 www-data www-data 0 Oct 24 09:32 spiceproxy.lck
drwxr-xr-x 2 root root 40 Oct 24 09:32 subsys

NODE2
# ls -l /var/lock/
total 0
drwx------ 2 root root 40 Nov 4 16:51 lvm
drwxr-xr-x 2 root root 160 Oct 26 15:21 lxc
-rw-r--r-- 1 root root 0 Oct 24 10:01 pvedaemon.lck
-rw-r--r-- 1 root root 0 Nov 4 16:51 pve-diskmanage.lck
-rw-r--r-- 1 root root 0 Oct 24 10:01 pvefw.lck
-rw-r--r-- 1 root root 0 Oct 24 10:01 pvefw-logger.lck
drwxr-xr-x 2 root root 60 Oct 24 15:07 pve-manager
-rw-r--r-- 1 root root 0 Oct 24 10:01 pve-ports.lck
-rw-r--r-- 1 www-data www-data 0 Oct 24 10:01 pveproxy.lck
-rw-r--r-- 1 root root 0 Oct 24 10:02 pvesr.lck
drwxr-xr-x 2 root root 240 Nov 4 14:57 qemu-server
-rw-r--r-- 1 www-data www-data 0 Oct 24 10:01 spiceproxy.lck
drwxr-xr-x 2 root root 40 Oct 24 10:01 subsys

NODE3
# ls -l /var/lock/
total 0
drwx------ 2 root root 40 Oct 24 08:30 lvm
drwxr-xr-x 2 root root 40 Oct 24 08:30 lxc
-rw-r--r-- 1 root root 0 Oct 24 08:30 pvedaemon.lck
-rw-r--r-- 1 root root 0 Oct 24 08:30 pvefw.lck
-rw-r--r-- 1 root root 0 Oct 24 08:30 pvefw-logger.lck
-rw-r--r-- 1 www-data www-data 0 Oct 24 08:30 pveproxy.lck
-rw-r--r-- 1 root root 0 Oct 24 08:31 pvesr.lck
drwxr-xr-x 2 root root 180 Oct 24 08:36 qemu-server
-rw-r--r-- 1 www-data www-data 0 Oct 24 08:30 spiceproxy.lck
drwxr-xr-x 2 root root 40 Oct 24 08:30 subsys

I don't find any strange lock file, didn't you?

Attached you can find the logs in syslog from a little time before the problem arise for 20 minutes around... The Proxmox VE replication runner (pvesr.service) can't start. Please can you help me to find the problem?

Piviul
 

Attachments

what about reduce the quorum to one node (pvecm expected 1) and then add again the nodes to the cluster (pvecm add IP-ADDRESS-CLUSTER -link0 LOCAL-IP-ADDRESS-LINK0)?

Piviul
 
Problem solved, switching off and on again the switche dedicated to the PVE communication the problem seems to be solved.

Have a great day

Piviul
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!