Cluster down: cman restart doesn't work

Kaya · Jun 27, 2014

Really strange problem.
I have a Intel modular server with proxmox on two blade.
This night backup errors:

Code:

[COLOR=#000000][FONT=Times New Roman]Detailed backup logs:[/FONT][/COLOR]

vzdump [COLOR=#336699][URL="callto:104 111 105 101"]104 111 105 101[/URL][/COLOR] --quiet 1 --mailto [COLOR=#336699]backup@comune.ledro.tn.it[/COLOR] --mode snapshot --compress lzo --storage QNAP
101: Jun 26 23:01:02 INFO: Starting Backup of VM 101 (qemu) 
101: Jun 26 23:01:02 INFO: status = running 
101: Jun 26 23:02:42 INFO: unable to open file '/etc/pve/nodes/proxmox2/qemu-server/101.conf.tmp.459261' - Device or resource busy 
101: Jun 26 23:02:42 INFO: update VM 101: -lock backup 
101: [COLOR=red]Jun 26 23:02:42 ERROR: Backup of VM 101 failed - command 'qm [COLOR=#336699]set 101[/COLOR] --lock backup' failed: exit code 2 

[/COLOR] 105: Jun 26 23:02:42 INFO: Starting Backup of VM 105 (qemu) 105: Jun 26 23:02:42 INFO: status = running 
105: Jun 26 23:03:12 INFO: update VM 105: -lock backup 
105: Jun 26 23:03:12 INFO: unable to open file '/etc/pve/nodes/proxmox2/qemu-server/105.conf.tmp.459348' - Device or resource busy 
105: [COLOR=red]Jun 26 23:03:12 ERROR: Backup of VM 105 failed - command 'qm [COLOR=#336699]set 105[/COLOR] --lock backup' failed: exit code 2 

[/COLOR] 111: Jun 26 23:03:12 INFO: Starting Backup of VM 111 (qemu) 
111: Jun 26 23:03:12 INFO: status = running 
111: Jun 26 23:04:22 INFO: VM is locked (backup) 
111: [COLOR=red]Jun 26 23:04:22 ERROR: Backup of VM 111 failed - command 'qm [COLOR=#336699]set 111[/COLOR] --lock backup' failed: exit code 110 [/COLOR][COLOR=red]
[/COLOR]

After that, access to web was not possibile. SSH inside servers and /etc/init.d/pve-cluster restart and /etc/init/pve-manager restart fixed the login problem.
But inside the web page I can't see other server status.

I run /etc/init.d/cman restart but errors appeared

Code:

Stopping cluster:
   Stopping dlm_controld...
[FAILED]

How can i restart cman service which seem to be the answer for my problem?

On proxmox 1 infos are:
root@proxmox1:/etc/pve/qemu-server# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-96
pve-kernel-2.6.32-13-pve: 2.6.32-72
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-14-pve: 2.6.32-74
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-20
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-7
vncterm: 1.0-4
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-10
ksm-control-daemon: 1.1-1

and on proxmox2
root@proxmox2:/etc/pve/qemu-server# pveversion -v
proxmox-ve-2.6.32: 3.2-126 (running kernel: 2.6.32-29-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-18
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve5
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

(yeah yeah I know are different versions but they worked without problems for a long time)

Thanks for any hints

udo · Jun 27, 2014

Kaya said:
Really strange problem.
I have a Intel modular server with proxmox on two blade.
This night backup errors:

Code:

[COLOR=#000000][FONT=Times New Roman]Detailed backup logs:[/FONT][/COLOR] vzdump [COLOR=#336699][URL="callto:104 111 105 101"]104 111 105 101[/URL][/COLOR] --quiet 1 --mailto [COLOR=#336699]backup@comune.ledro.tn.it[/COLOR] --mode snapshot --compress lzo --storage QNAP 101: Jun 26 23:01:02 INFO: Starting Backup of VM 101 (qemu) 101: Jun 26 23:01:02 INFO: status = running 101: Jun 26 23:02:42 INFO: unable to open file '/etc/pve/nodes/proxmox2/qemu-server/101.conf.tmp.459261' - Device or resource busy 101: Jun 26 23:02:42 INFO: update VM 101: -lock backup 101: [COLOR=red]Jun 26 23:02:42 ERROR: Backup of VM 101 failed - command 'qm [COLOR=#336699]set 101[/COLOR] --lock backup' failed: exit code 2 [/COLOR] 105: Jun 26 23:02:42 INFO: Starting Backup of VM 105 (qemu) 105: Jun 26 23:02:42 INFO: status = running 105: Jun 26 23:03:12 INFO: update VM 105: -lock backup 105: Jun 26 23:03:12 INFO: unable to open file '/etc/pve/nodes/proxmox2/qemu-server/105.conf.tmp.459348' - Device or resource busy 105: [COLOR=red]Jun 26 23:03:12 ERROR: Backup of VM 105 failed - command 'qm [COLOR=#336699]set 105[/COLOR] --lock backup' failed: exit code 2 [/COLOR] 111: Jun 26 23:03:12 INFO: Starting Backup of VM 111 (qemu) 111: Jun 26 23:03:12 INFO: status = running 111: Jun 26 23:04:22 INFO: VM is locked (backup) 111: [COLOR=red]Jun 26 23:04:22 ERROR: Backup of VM 111 failed - command 'qm [COLOR=#336699]set 111[/COLOR] --lock backup' failed: exit code 110 [/COLOR][COLOR=red] [/COLOR]

After that, access to web was not possibile. SSH inside servers and /etc/init.d/pve-cluster restart and /etc/init/pve-manager restart fixed the login problem.
But inside the web page I can't see other server status.

I run /etc/init.d/cman restart but errors appeared

Code:

Stopping cluster: Stopping dlm_controld... [FAILED]

How can i restart cman service which seem to be the answer for my problem?

On proxmox 1 infos are:
root@proxmox1:/etc/pve/qemu-server# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-96
pve-kernel-2.6.32-13-pve: 2.6.32-72
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-14-pve: 2.6.32-74
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-20
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-7
vncterm: 1.0-4
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-10
ksm-control-daemon: 1.1-1

and on proxmox2
root@proxmox2:/etc/pve/qemu-server# pveversion -v
proxmox-ve-2.6.32: 3.2-126 (running kernel: 2.6.32-29-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-18
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve5
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

(yeah yeah I know are different versions but they worked without problems for a long time)

Thanks for any hints

Hi,
you lost the quorum and therefor /etc/pve isn't writable.

Look with

Code:

pvecm status
pvecm nodes

on both servers

Udo

Kaya · Jun 27, 2014

udo said:
Hi,
you lost the quorum and therefor /etc/pve isn't writable.

I think I lost the cluster. Quorum is cluster dependant?

udo said:
Look with

Code:

pvecm status pvecm nodes

on both servers

Udo

root@proxmox1:/etc/pve/qemu-server# pvecm statusVersion: 6.2.0
Config Version: 2
Cluster Name: ledrocluster
Cluster Id: 39348
Cluster Member: Yes
Cluster Generation: 196
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 6
Flags:
Ports Bound: 0
Node name: proxmox1
Node ID: 1
Multicast addresses: 239.192.153.78
Node addresses: 192.168.10.1
root@proxmox1:/etc/pve/qemu-server# pvecm nodes
Node Sts Inc Joined Name
1 M 164 2014-02-06 09:02:36 proxmox1
2 M 192 2014-05-22 16:14:40 proxmox2
root@proxmox2:/mnt/pve# pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: ledrocluster
Cluster Id: 39348
Cluster Member: Yes
Cluster Generation: 196
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 6
Flags:
Ports Bound: 0
Node name: proxmox2
Node ID: 2
Multicast addresses: 239.192.153.78
Node addresses: 192.168.10.2
root@proxmox2:/mnt/pve# pvecm nodes
Node Sts Inc Joined Name
1 M 192 2014-05-22 16:14:40 proxmox1
2 M 188 2014-05-22 16:14:40 proxmox2

Thanks

udo · Jun 27, 2014

Kaya said:
I think I lost the cluster. Quorum is cluster dependant?

root@proxmox1:/etc/pve/qemu-server# pvecm statusVersion: 6.2.0
Config Version: 2
Cluster Name: ledrocluster
Cluster Id: 39348
Cluster Member: Yes
Cluster Generation: 196
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 6
Flags:
Ports Bound: 0
Node name: proxmox1
Node ID: 1
Multicast addresses: 239.192.153.78
Node addresses: 192.168.10.1
root@proxmox1:/etc/pve/qemu-server# pvecm nodes
Node Sts Inc Joined Name
1 M 164 2014-02-06 09:02:36 proxmox1
2 M 192 2014-05-22 16:14:40 proxmox2
root@proxmox2:/mnt/pve# pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: ledrocluster
Cluster Id: 39348
Cluster Member: Yes
Cluster Generation: 196
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 6
Flags:
Ports Bound: 0
Node name: proxmox2
Node ID: 2
Multicast addresses: 239.192.153.78
Node addresses: 192.168.10.2
root@proxmox2:/mnt/pve# pvecm nodes
Node Sts Inc Joined Name
1 M 192 2014-05-22 16:14:40 proxmox1
2 M 188 2014-05-22 16:14:40 proxmox2

Thanks

Hi,
looks not bad - you have quorum and both nodes show both as Member.
Try an

Code:

/etc/init.d/pve-cluster restart

on both nodes (one after another).

Udo

Kaya · Jun 27, 2014

udo said:
Hi,
looks not bad - you have quorum and both nodes show both as Member.
Try an

Code:

/etc/init.d/pve-cluster restart

on both nodes (one after another).

Udo

Done. First on px1 and the on px2.

Still same error.

Jun 27 13:20:41 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:41 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:41 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:41 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:41 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:42 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 330
Jun 27 13:20:43 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 340
Jun 27 13:20:44 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 350
Jun 27 13:20:45 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 360
Jun 27 13:20:46 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 370
Jun 27 13:20:47 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 380
Jun 27 13:20:47 proxmox2 dlm_controld[4427]: daemon cpg_leave error retrying
Jun 27 13:20:48 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 390
Jun 27 13:20:49 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 400
Jun 27 13:20:50 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 410
Jun 27 13:20:51 proxmox2 pmxcfs[512989]: [dcdb] notice: cpg_join retry 420
Jun 27 13:20:51 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:51 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9
Jun 27 13:20:51 proxmox2 pmxcfs[512989]: [status] crit: cpg_send_message failed: 9

wahmed · Jun 27, 2014

Proper thing in this case would be to restart both nodes. If that is not possible, then run the following command on first node:
# pvecm expected 1

Then try to login to the web interface on that node to see if you can see all VMs status. If this works that may mean your cluster lost quorum and for some reason quorum was not able to reestablish. You will still have to restart the nodes. But at least you will be able to see your VMs. The above command changes expected vote so it is not looking for the other node.

Looks like you have 2 node setup. Do you have a quorum disk setup?

Kaya · Jul 1, 2014

symmcom said:
Proper thing in this case would be to restart both nodes. If that is not possible, then run the following command on first node:
# pvecm expected 1

Then try to login to the web interface on that node to see if you can see all VMs status. If this works that may mean your cluster lost quorum and for some reason quorum was not able to reestablish. You will still have to restart the nodes. But at least you will be able to see your VMs. The above command changes expected vote so it is not looking for the other node.

Looks like you have 2 node setup. Do you have a quorum disk setup?

So I decided to stop all VM , upgrade node1 and restart both server.

Now it seems to works.

Thanks for help

P.s. above command can't works (I guess) beacuse /etc/pve was unmounted and every command which involve write inside result in a write errore message.

spirit · Jul 2, 2014

Hi,

as symmcom, you have a 2 nodes cluster. Do you have a shared disk to have the quorum ?

If not, I think this problem will rehappen soon.

(you should have a 3 nodes cluster min, or 2nodes cluster + shared quorum disk)

Search

Search

Cluster down: cman restart doesn't work

Kaya

Member

udo

Distinguished Member

Kaya

Member

udo

Distinguished Member

Kaya

Member

wahmed

Famous Member

Kaya

Member

spirit

Distinguished Member