Cluster down: cman restart doesn't work

Kaya

Member
Jun 20, 2012
111
2
18
Really strange problem.
I have a Intel modular server with proxmox on two blade.
This night backup errors:
Code:
[COLOR=#000000][FONT=Times New Roman]Detailed backup logs:[/FONT][/COLOR]

vzdump [COLOR=#336699][URL="callto:104 111 105 101"]104 111 105 101[/URL][/COLOR] --quiet 1 --mailto [COLOR=#336699]backup@comune.ledro.tn.it[/COLOR] --mode snapshot --compress lzo --storage QNAP
101: Jun 26 23:01:02 INFO: Starting Backup of VM 101 (qemu) 
101: Jun 26 23:01:02 INFO: status = running 
101: Jun 26 23:02:42 INFO: unable to open file '/etc/pve/nodes/proxmox2/qemu-server/101.conf.tmp.459261' - Device or resource busy 
101: Jun 26 23:02:42 INFO: update VM 101: -lock backup 
101: [COLOR=red]Jun 26 23:02:42 ERROR: Backup of VM 101 failed - command 'qm [COLOR=#336699]set 101[/COLOR] --lock backup' failed: exit code 2 

[/COLOR] 105: Jun 26 23:02:42 INFO: Starting Backup of VM 105 (qemu) 105: Jun 26 23:02:42 INFO: status = running 
105: Jun 26 23:03:12 INFO: update VM 105: -lock backup 
105: Jun 26 23:03:12 INFO: unable to open file '/etc/pve/nodes/proxmox2/qemu-server/105.conf.tmp.459348' - Device or resource busy 
105: [COLOR=red]Jun 26 23:03:12 ERROR: Backup of VM 105 failed - command 'qm [COLOR=#336699]set 105[/COLOR] --lock backup' failed: exit code 2 

[/COLOR] 111: Jun 26 23:03:12 INFO: Starting Backup of VM 111 (qemu) 
111: Jun 26 23:03:12 INFO: status = running 
111: Jun 26 23:04:22 INFO: VM is locked (backup) 
111: [COLOR=red]Jun 26 23:04:22 ERROR: Backup of VM 111 failed - command 'qm [COLOR=#336699]set 111[/COLOR] --lock backup' failed: exit code 110 [/COLOR][COLOR=red]
[/COLOR]
After that, access to web was not possibile. SSH inside servers and /etc/init.d/pve-cluster restart and /etc/init/pve-manager restart fixed the login problem.
But inside the web page I can't see other server status.

I run /etc/init.d/cman restart but errors appeared
Code:
Stopping cluster:
   Stopping dlm_controld...
[FAILED]

How can i restart cman service which seem to be the answer for my problem?

On proxmox 1 infos are:
root@proxmox1:/etc/pve/qemu-server# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-96
pve-kernel-2.6.32-13-pve: 2.6.32-72
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-14-pve: 2.6.32-74
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-20
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-7
vncterm: 1.0-4
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-10
ksm-control-daemon: 1.1-1


and on proxmox2
root@proxmox2:/etc/pve/qemu-server# pveversion -v
proxmox-ve-2.6.32: 3.2-126 (running kernel: 2.6.32-29-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-18
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve5
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

(yeah yeah I know are different versions but they worked without problems for a long time)

Thanks for any hints
 
Really strange problem.
I have a Intel modular server with proxmox on two blade.
This night backup errors:
Code:
[COLOR=#000000][FONT=Times New Roman]Detailed backup logs:[/FONT][/COLOR]

vzdump [COLOR=#336699][URL="callto:104 111 105 101"]104 111 105 101[/URL][/COLOR] --quiet 1 --mailto [COLOR=#336699]backup@comune.ledro.tn.it[/COLOR] --mode snapshot --compress lzo --storage QNAP
101: Jun 26 23:01:02 INFO: Starting Backup of VM 101 (qemu) 
101: Jun 26 23:01:02 INFO: status = running 
101: Jun 26 23:02:42 INFO: unable to open file '/etc/pve/nodes/proxmox2/qemu-server/101.conf.tmp.459261' - Device or resource busy 
101: Jun 26 23:02:42 INFO: update VM 101: -lock backup 
101: [COLOR=red]Jun 26 23:02:42 ERROR: Backup of VM 101 failed - command 'qm [COLOR=#336699]set 101[/COLOR] --lock backup' failed: exit code 2 

[/COLOR] 105: Jun 26 23:02:42 INFO: Starting Backup of VM 105 (qemu) 105: Jun 26 23:02:42 INFO: status = running 
105: Jun 26 23:03:12 INFO: update VM 105: -lock backup 
105: Jun 26 23:03:12 INFO: unable to open file '/etc/pve/nodes/proxmox2/qemu-server/105.conf.tmp.459348' - Device or resource busy 
105: [COLOR=red]Jun 26 23:03:12 ERROR: Backup of VM 105 failed - command 'qm [COLOR=#336699]set 105[/COLOR] --lock backup' failed: exit code 2 

[/COLOR] 111: Jun 26 23:03:12 INFO: Starting Backup of VM 111 (qemu) 
111: Jun 26 23:03:12 INFO: status = running 
111: Jun 26 23:04:22 INFO: VM is locked (backup) 
111: [COLOR=red]Jun 26 23:04:22 ERROR: Backup of VM 111 failed - command 'qm [COLOR=#336699]set 111[/COLOR] --lock backup' failed: exit code 110 [/COLOR][COLOR=red]
[/COLOR]
After that, access to web was not possibile. SSH inside servers and /etc/init.d/pve-cluster restart and /etc/init/pve-manager restart fixed the login problem.
But inside the web page I can't see other server status.

I run /etc/init.d/cman restart but errors appeared
Code:
Stopping cluster:
   Stopping dlm_controld...
[FAILED]

How can i restart cman service which seem to be the answer for my problem?

On proxmox 1 infos are:
root@proxmox1:/etc/pve/qemu-server# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-96
pve-kernel-2.6.32-13-pve: 2.6.32-72
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-14-pve: 2.6.32-74
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-20
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-7
vncterm: 1.0-4
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-10
ksm-control-daemon: 1.1-1


and on proxmox2
root@proxmox2:/etc/pve/qemu-server# pveversion -v
proxmox-ve-2.6.32: 3.2-126 (running kernel: 2.6.32-29-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-18
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve5
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

(yeah yeah I know are different versions but they worked without problems for a long time)

Thanks for any hints

Hi,
you lost the quorum and therefor /etc/pve isn't writable.

Look with
Code:
pvecm status
pvecm nodes
on both servers

Udo
 
Hi,
you lost the quorum and therefor /etc/pve isn't writable.
I think I lost the cluster. Quorum is cluster dependant?

Look with
Code:
pvecm status
pvecm nodes
on both servers

Udo
root@proxmox1:/etc/pve/qemu-server# pvecm statusVersion: 6.2.0
Config Version: 2
Cluster Name: ledrocluster
Cluster Id: 39348
Cluster Member: Yes
Cluster Generation: 196
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 6
Flags:
Ports Bound: 0
Node name: proxmox1
Node ID: 1
Multicast addresses: 239.192.153.78
Node addresses: 192.168.10.1
root@proxmox1:/etc/pve/qemu-server# pvecm nodes
Node Sts Inc Joined Name
1 M 164 2014-02-06 09:02:36 proxmox1
2 M 192 2014-05-22 16:14:40 proxmox2
root@proxmox2:/mnt/pve# pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: ledrocluster
Cluster Id: 39348
Cluster Member: Yes
Cluster Generation: 196
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 6
Flags:
Ports Bound: 0
Node name: proxmox2
Node ID: 2
Multicast addresses: 239.192.153.78
Node addresses: 192.168.10.2
root@proxmox2:/mnt/pve# pvecm nodes
Node Sts Inc Joined Name
1 M 192 2014-05-22 16:14:40 proxmox1
2 M 188 2014-05-22 16:14:40 proxmox2



Thanks
 
I think I lost the cluster. Quorum is cluster dependant?


root@proxmox1:/etc/pve/qemu-server# pvecm statusVersion: 6.2.0
Config Version: 2
Cluster Name: ledrocluster
Cluster Id: 39348
Cluster Member: Yes
Cluster Generation: 196
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 6
Flags:
Ports Bound: 0
Node name: proxmox1
Node ID: 1
Multicast addresses: 239.192.153.78
Node addresses: 192.168.10.1
root@proxmox1:/etc/pve/qemu-server# pvecm nodes
Node Sts Inc Joined Name
1 M 164 2014-02-06 09:02:36 proxmox1
2 M 192 2014-05-22 16:14:40 proxmox2
root@proxmox2:/mnt/pve# pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: ledrocluster
Cluster Id: 39348
Cluster Member: Yes
Cluster Generation: 196
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 6
Flags:
Ports Bound: 0
Node name: proxmox2
Node ID: 2
Multicast addresses: 239.192.153.78
Node addresses: 192.168.10.2
root@proxmox2:/mnt/pve# pvecm nodes
Node Sts Inc Joined Name
1 M 192 2014-05-22 16:14:40 proxmox1
2 M 188 2014-05-22 16:14:40 proxmox2



Thanks
Hi,
looks not bad - you have quorum and both nodes show both as Member.
Try an
Code:
/etc/init.d/pve-cluster restart
on both nodes (one after another).

Udo
 
Hi,
looks not bad - you have quorum and both nodes show both as Member.
Try an
Code:
/etc/init.d/pve-cluster restart
on both nodes (one after another).

Udo
Done. First on px1 and the on px2.

Still same error.

Jun 27 13:20:41 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:41 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:41 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:41 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:41 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:42 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 330
Jun 27 13:20:43 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 340
Jun 27 13:20:44 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 350
Jun 27 13:20:45 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 360
Jun 27 13:20:46 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 370
Jun 27 13:20:47 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 380
Jun 27 13:20:47 proxmox2 dlm_controld[4427]: daemon cpg_leave error retrying
Jun 27 13:20:48 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 390
Jun 27 13:20:49 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 400
Jun 27 13:20:50 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 410
Jun 27 13:20:51 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 420
Jun 27 13:20:51 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:51 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:51 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
 
Proper thing in this case would be to restart both nodes. If that is not possible, then run the following command on first node:
# pvecm expected 1

Then try to login to the web interface on that node to see if you can see all VMs status. If this works that may mean your cluster lost quorum and for some reason quorum was not able to reestablish. You will still have to restart the nodes. But at least you will be able to see your VMs. The above command changes expected vote so it is not looking for the other node.

Looks like you have 2 node setup. Do you have a quorum disk setup?
 
Proper thing in this case would be to restart both nodes. If that is not possible, then run the following command on first node:
# pvecm expected 1

Then try to login to the web interface on that node to see if you can see all VMs status. If this works that may mean your cluster lost quorum and for some reason quorum was not able to reestablish. You will still have to restart the nodes. But at least you will be able to see your VMs. The above command changes expected vote so it is not looking for the other node.

Looks like you have 2 node setup. Do you have a quorum disk setup?

So I decided to stop all VM , upgrade node1 and restart both server.

Now it seems to works.

Thanks for help

P.s. above command can't works (I guess) beacuse /etc/pve was unmounted and every command which involve write inside result in a write errore message.
 
Hi,

as symmcom, you have a 2 nodes cluster. Do you have a shared disk to have the quorum ?

If not, I think this problem will rehappen soon.

(you should have a 3 nodes cluster min, or 2nodes cluster + shared quorum disk)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!