/etc/pve not writeable

orangelemon · Jan 20, 2015

Hi,

I have a proxmox 3.2 cluster with a dozen nodes. When I run backups of a couple CTs in the weekend the cluster falls apart. I tried to remove the backup cron, but the /etc/pve is not writeable.

Also creating new containers is not possible anymore.

Any suggestions on how to debug?

Thanks in advance!

udo · Jan 20, 2015

orangelemon said:
Hi,

I have a proxmox 3.2 cluster with a dozen nodes. When I run backups of a couple CTs in the weekend the cluster falls apart. I tried to remove the backup cron, but the /etc/pve is not writeable.

Also creating new containers is not possible anymore.

Any suggestions on how to debug?

Thanks in advance!

Hi,
that's easy. You lost quorum (mean, less than nodes/2+1 pve-clustermember are healthy) and so the /etc/pve is not writable.
Look with

Code:

pvecm status
pvecm nodes

on the different nodes.

If networking is OK, you can restart cman with "/etc/init.d/cman restart".
If you have quorum perhaps you need an "/etc/init.d/pve-cluster restart" to be able to write to /etc/pve again.

Udo

orangelemon · Jan 21, 2015

pvecm nodes shows all nodes online. But on 1 machine it says that only that machine is online.

Node Sts Inc Joined Name
1 M 20356 2015-01-21 08:36:51 srv2
2 M 20356 2015-01-21 08:36:51 srv23
3 M 20356 2015-01-21 08:36:51 srv8
4 M 20356 2015-01-21 08:36:51 srv7
5 M 20356 2015-01-21 08:36:51 srv32
6 M 20356 2015-01-21 08:36:51 srv33
7 M 20356 2015-01-21 08:36:51 srv5
8 M 20352 2015-01-21 08:36:51 fs01
9 M 20368 2015-01-21 08:43:27 srv34
10 M 20356 2015-01-21 08:36:51 srv1
11 M 20356 2015-01-21 08:36:51 srv36
12 M 20356 2015-01-21 08:36:51 srv57
13 M 20356 2015-01-21 08:36:51 srv60

Vs

Node Sts Inc Joined Name
1 X 0 srv2
2 X 0 srv23
3 X 0 srv8
4 X 0 srv7
5 X 0 srv32
6 X 0 srv33
7 X 0 srv5
8 X 0 fs01
9 X 0 srv34
10 M 20428 2015-01-21 12:44:56 srv1
11 X 0 srv36
12 X 0 srv57
13 X 0 srv60

So I tried restarting cman and pve-cluster, but cman won't stop... propaby because it cant get a disconnect from srv1?

I've tried restarting cman on all servers at the same time, but since they can't stop it doesn't work. There is no pid called cman to force it to stop? If I just start it without stopping it fencing and dlm get confussed...

Any ideas?

orangelemon · Jan 22, 2015

I Finaly got it working... then an hour later 1 servers load went us and everything broke AGAIN! Getting quite annoyed with this.

The most stupid part is that most of the servers clustat say that all servers are online, which is not the case, there is not 1 server in the cluster able to restart cman.

This makes no sense at all!

Any ideas anyone? Proxmox crew??

dietmar · Jan 22, 2015

What is the output of

# pveversion -v

orangelemon · Jan 22, 2015

proxmox-ve-2.6.32: 3.2-126 (running kernel: 2.6.32-29-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-15
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-25
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

orangelemon · Jan 22, 2015

srv2; pvecm nodes

Node Sts Inc Joined Name
1 M 20284 2015-01-21 08:35:42 srv2
2 M 20300 2015-01-21 08:35:42 srv23
3 M 20348 2015-01-21 08:36:50 srv8
4 M 20340 2015-01-21 08:36:38 srv7
5 M 20296 2015-01-21 08:35:42 srv32
6 M 20308 2015-01-21 08:35:45 srv33
7 M 20316 2015-01-21 08:35:57 srv5
8 M 20356 2015-01-21 08:36:56 fs01
9 M 20368 2015-01-21 08:43:33 srv34
10 M 20296 2015-01-21 08:35:42 srv1
11 M 20304 2015-01-21 08:35:43 srv36
12 M 20324 2015-01-21 08:36:09 srv57
13 M 20332 2015-01-21 08:36:21 srv60

While srv1; pvecm nodes:

Node Sts Inc Joined Name
1 X 0 srv2
2 X 0 srv23
3 X 0 srv8
4 X 0 srv7
5 X 0 srv32
6 X 0 srv33
7 X 0 srv5
8 M 20836 2015-01-22 19:29:21 fs01
9 X 0 srv34
10 M 20832 2015-01-22 19:29:21 srv1
11 M 20848 2015-01-22 19:32:45 srv36
12 M 20836 2015-01-22 19:29:21 srv57
13 M 20836 2015-01-22 19:29:21 srv60

On srv2 cman won't restart.

service cman restart
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... Timed-out waiting for cluster
[FAILED]

dietmar · Jan 22, 2015

You need to analyze the syslog to see what happened exactly.

orangelemon · Jan 23, 2015

Hi Dietmar,

A lot of messages, I assume the important one is "crit: cpg_send_message failed", but how can I fix this?

A partial dump:

Jan 23 10:31:43 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:43 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:43 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:44 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669180
Jan 23 10:31:44 srv5 kernel: sr 5:0:0:0: [sr0] Unhandled error code
Jan 23 10:31:44 srv5 kernel: sr 5:0:0:0: [sr0] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 23 10:31:44 srv5 kernel: sr 5:0:0:0: [sr0] Sense Key : Medium Error [deferred]
Jan 23 10:31:44 srv5 kernel: sr 5:0:0:0: [sr0] Add. Sense: L-EC uncorrectable error
Jan 23 10:31:44 srv5 kernel: sr 5:0:0:0: [sr0] CDB: Read(10): 28 00 00 00 00 00 00 00 02 00
Jan 23 10:31:44 srv5 kernel: Buffer I/O error on device sr0, logical block 0
Jan 23 10:31:45 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669190
Jan 23 10:31:46 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669200
Jan 23 10:31:47 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669210
Jan 23 10:31:48 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669220
Jan 23 10:31:49 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669230
Jan 23 10:31:50 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669240
Jan 23 10:31:51 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669250
Jan 23 10:31:52 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669260
Jan 23 10:31:52 srv5 kernel: sr 5:0:0:0: [sr0] Unhandled error code
Jan 23 10:31:52 srv5 kernel: sr 5:0:0:0: [sr0] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 23 10:31:52 srv5 kernel: sr 5:0:0:0: [sr0] Sense Key : Medium Error [deferred]
Jan 23 10:31:52 srv5 kernel: sr 5:0:0:0: [sr0] Add. Sense: L-EC uncorrectable error
Jan 23 10:31:52 srv5 kernel: sr 5:0:0:0: [sr0] CDB: Read(10): 28 00 00 00 00 00 00 00 02 00
Jan 23 10:31:52 srv5 kernel: Buffer I/O error on device sr0, logical block 0
Jan 23 10:31:53 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669270
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:54 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669280
Jan 23 10:31:55 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669290
Jan 23 10:31:56 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669300
Jan 23 10:31:57 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669310

dietmar · Jan 23, 2015

orangelemon said:
A lot of messages, I assume the important one is "crit: cpg_send_message failed", but how can I fix this?

This just indicates that corosync is not running. Please verify that.

orangelemon · Jan 23, 2015

Hi Dietmar,

If you mean cman, it's not running I assume:

root@srv5:~# service cman restart
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... Timed-out waiting for cluster
[FAILED]

I've tried starting cman without quiting it first but then dlm and fencing seam to hang since I can't restart it afterwards.

dietmar · Jan 23, 2015

doe you use HA, and have non-working fencing device?

orangelemon · Jan 23, 2015

Hi Dietmar,

No HA, and no fencing device.

dietmar · Jan 23, 2015

Seems there is something wrong with multicast on the network. Already tested multicast with omping?

orangelemon · Jan 23, 2015

That's highly unlikely because if that was the case the cluster could never work. I verified is, and omping works:

xxx.xxx.xxx.108 : unicast, seq=1, size=69 bytes, dist=0, time=0.123ms
xxx.xxx.xxx.108 : multicast, seq=1, size=69 bytes, dist=0, time=0.138ms
xxx.xxx.xxx.108 : unicast, seq=2, size=69 bytes, dist=0, time=0.150ms
xxx.xxx.xxx.108 : multicast, seq=2, size=69 bytes, dist=0, time=0.163ms
xxx.xxx.xxx.108 : unicast, seq=3, size=69 bytes, dist=0, time=0.198ms
xxx.xxx.xxx.108 : multicast, seq=3, size=69 bytes, dist=0, time=0.209ms
xxx.xxx.xxx.108 : unicast, seq=4, size=69 bytes, dist=0, time=0.241ms
xxx.xxx.xxx.108 : multicast, seq=4, size=69 bytes, dist=0, time=0.246ms
xxx.xxx.xxx.108 : unicast, seq=5, size=69 bytes, dist=0, time=0.124ms
xxx.xxx.xxx.108 : multicast, seq=5, size=69 bytes, dist=0, time=0.131ms

dietmar · Jan 23, 2015

Do you run a firewall, or maybe you set MTU size? I saw such strange effect when packet gets fragmented/suppressed by wrong MTU settings.

orangelemon · Jan 23, 2015

No nothing special. All traffic runs over 1 switch within 1 vlan. All quite basic.

What locks the cman from restarting? Because I think if you just loses it's current sync completely it can rejoin.

orangelemon · Jan 26, 2015

We I finally got a lot of the servers joined, but 2 aren't able to join. They have decided to join to gether instead of with the rest.

So got a set of 2 and a set of 9 server in the same cluster in the same IP space at the same switch... odd to say the least.

Syslog states, so is it would apear to be corosync related

Jan 26 19:51:38 srv5 pmxcfs[24168]: [status] crit: cpg_send_message failed: 9
Jan 26 19:51:38 srv5 pmxcfs[24168]: [status] crit: cpg_send_message failed: 9
Jan 26 19:51:38 srv5 pmxcfs[24168]: [status] crit: cpg_send_message failed: 9

But corosync is running:

root@srv5:~# ps aux | grep coro
root 24719 0.1 0.1 182564 41368 ? S<Lsl 19:49 0:00 corosync -f
root 29327 0.0 0.0 7764 896 pts/4 S+ 19:55 0:00 grep coro

One of the 2 servers has issues to gain quorum, the other one gets it. I've rebooted this server but it didn't make any diff.

root@srv5:~# service cman restart
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown:[ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... Timed-out waiting for cluster
[FAILED]

All servers have the same Cluster ID, but different Cluster Generation IDs.

Any ideas?

orangelemon · Jan 27, 2015

Is it just me or is a cluster of more then 8 servers a complete fantasy? Everytime it some what works it dies for no reason.

/etc/pve not writeable

New Member

Distinguished Member

New Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

New Member