/etc/pve not writeable

orangelemon

New Member
Sep 7, 2013
20
0
1
Hi,

I have a proxmox 3.2 cluster with a dozen nodes. When I run backups of a couple CTs in the weekend the cluster falls apart. I tried to remove the backup cron, but the /etc/pve is not writeable.

Also creating new containers is not possible anymore.

Any suggestions on how to debug?

Thanks in advance!
 
Hi,

I have a proxmox 3.2 cluster with a dozen nodes. When I run backups of a couple CTs in the weekend the cluster falls apart. I tried to remove the backup cron, but the /etc/pve is not writeable.

Also creating new containers is not possible anymore.

Any suggestions on how to debug?

Thanks in advance!

Hi,
that's easy. You lost quorum (mean, less than nodes/2+1 pve-clustermember are healthy) and so the /etc/pve is not writable.
Look with
Code:
pvecm status
pvecm nodes
on the different nodes.

If networking is OK, you can restart cman with "/etc/init.d/cman restart".
If you have quorum perhaps you need an "/etc/init.d/pve-cluster restart" to be able to write to /etc/pve again.

Udo
 
pvecm nodes shows all nodes online. But on 1 machine it says that only that machine is online.

Node Sts Inc Joined Name
1 M 20356 2015-01-21 08:36:51 srv2
2 M 20356 2015-01-21 08:36:51 srv23
3 M 20356 2015-01-21 08:36:51 srv8
4 M 20356 2015-01-21 08:36:51 srv7
5 M 20356 2015-01-21 08:36:51 srv32
6 M 20356 2015-01-21 08:36:51 srv33
7 M 20356 2015-01-21 08:36:51 srv5
8 M 20352 2015-01-21 08:36:51 fs01
9 M 20368 2015-01-21 08:43:27 srv34
10 M 20356 2015-01-21 08:36:51 srv1
11 M 20356 2015-01-21 08:36:51 srv36
12 M 20356 2015-01-21 08:36:51 srv57
13 M 20356 2015-01-21 08:36:51 srv60

Vs

Node Sts Inc Joined Name
1 X 0 srv2
2 X 0 srv23
3 X 0 srv8
4 X 0 srv7
5 X 0 srv32
6 X 0 srv33
7 X 0 srv5
8 X 0 fs01
9 X 0 srv34
10 M 20428 2015-01-21 12:44:56 srv1
11 X 0 srv36
12 X 0 srv57
13 X 0 srv60

So I tried restarting cman and pve-cluster, but cman won't stop... propaby because it cant get a disconnect from srv1?

I've tried restarting cman on all servers at the same time, but since they can't stop it doesn't work. There is no pid called cman to force it to stop? If I just start it without stopping it fencing and dlm get confussed...

Any ideas?
 
I Finaly got it working... then an hour later 1 servers load went us and everything broke AGAIN! Getting quite annoyed with this.

The most stupid part is that most of the servers clustat say that all servers are online, which is not the case, there is not 1 server in the cluster able to restart cman.

This makes no sense at all!

Any ideas anyone? Proxmox crew??
 
proxmox-ve-2.6.32: 3.2-126 (running kernel: 2.6.32-29-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-15
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-25
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1
 
srv2; pvecm nodes

Node Sts Inc Joined Name
1 M 20284 2015-01-21 08:35:42 srv2
2 M 20300 2015-01-21 08:35:42 srv23
3 M 20348 2015-01-21 08:36:50 srv8
4 M 20340 2015-01-21 08:36:38 srv7
5 M 20296 2015-01-21 08:35:42 srv32
6 M 20308 2015-01-21 08:35:45 srv33
7 M 20316 2015-01-21 08:35:57 srv5
8 M 20356 2015-01-21 08:36:56 fs01
9 M 20368 2015-01-21 08:43:33 srv34
10 M 20296 2015-01-21 08:35:42 srv1
11 M 20304 2015-01-21 08:35:43 srv36
12 M 20324 2015-01-21 08:36:09 srv57
13 M 20332 2015-01-21 08:36:21 srv60

While srv1; pvecm nodes:

Node Sts Inc Joined Name
1 X 0 srv2
2 X 0 srv23
3 X 0 srv8
4 X 0 srv7
5 X 0 srv32
6 X 0 srv33
7 X 0 srv5
8 M 20836 2015-01-22 19:29:21 fs01
9 X 0 srv34
10 M 20832 2015-01-22 19:29:21 srv1
11 M 20848 2015-01-22 19:32:45 srv36
12 M 20836 2015-01-22 19:29:21 srv57
13 M 20836 2015-01-22 19:29:21 srv60

On srv2 cman won't restart.

service cman restart
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... Timed-out waiting for cluster
[FAILED]
 
Hi Dietmar,

A lot of messages, I assume the important one is "crit: cpg_send_message failed", but how can I fix this?

A partial dump:

Jan 23 10:31:43 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:43 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:43 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:44 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669180
Jan 23 10:31:44 srv5 kernel: sr 5:0:0:0: [sr0] Unhandled error code
Jan 23 10:31:44 srv5 kernel: sr 5:0:0:0: [sr0] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 23 10:31:44 srv5 kernel: sr 5:0:0:0: [sr0] Sense Key : Medium Error [deferred]
Jan 23 10:31:44 srv5 kernel: sr 5:0:0:0: [sr0] Add. Sense: L-EC uncorrectable error
Jan 23 10:31:44 srv5 kernel: sr 5:0:0:0: [sr0] CDB: Read(10): 28 00 00 00 00 00 00 00 02 00
Jan 23 10:31:44 srv5 kernel: Buffer I/O error on device sr0, logical block 0
Jan 23 10:31:45 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669190
Jan 23 10:31:46 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669200
Jan 23 10:31:47 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669210
Jan 23 10:31:48 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669220
Jan 23 10:31:49 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669230
Jan 23 10:31:50 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669240
Jan 23 10:31:51 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669250
Jan 23 10:31:52 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669260
Jan 23 10:31:52 srv5 kernel: sr 5:0:0:0: [sr0] Unhandled error code
Jan 23 10:31:52 srv5 kernel: sr 5:0:0:0: [sr0] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 23 10:31:52 srv5 kernel: sr 5:0:0:0: [sr0] Sense Key : Medium Error [deferred]
Jan 23 10:31:52 srv5 kernel: sr 5:0:0:0: [sr0] Add. Sense: L-EC uncorrectable error
Jan 23 10:31:52 srv5 kernel: sr 5:0:0:0: [sr0] CDB: Read(10): 28 00 00 00 00 00 00 00 02 00
Jan 23 10:31:52 srv5 kernel: Buffer I/O error on device sr0, logical block 0
Jan 23 10:31:53 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669270
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:53 srv5 pmxcfs[558599]: [status] crit: cpg_send_message failed: 9
Jan 23 10:31:54 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669280
Jan 23 10:31:55 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669290
Jan 23 10:31:56 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669300
Jan 23 10:31:57 srv5 pmxcfs[558599]: [dcdb] notice: cpg_join retry 1669310
 
Hi Dietmar,

If you mean cman, it's not running I assume:

root@srv5:~# service cman restart
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... Timed-out waiting for cluster
[FAILED]

I've tried starting cman without quiting it first but then dlm and fencing seam to hang since I can't restart it afterwards.
 
Seems there is something wrong with multicast on the network. Already tested multicast with omping?
 
That's highly unlikely because if that was the case the cluster could never work. I verified is, and omping works:

xxx.xxx.xxx.108 : unicast, seq=1, size=69 bytes, dist=0, time=0.123ms
xxx.xxx.xxx.108 : multicast, seq=1, size=69 bytes, dist=0, time=0.138ms
xxx.xxx.xxx.108 : unicast, seq=2, size=69 bytes, dist=0, time=0.150ms
xxx.xxx.xxx.108 : multicast, seq=2, size=69 bytes, dist=0, time=0.163ms
xxx.xxx.xxx.108 : unicast, seq=3, size=69 bytes, dist=0, time=0.198ms
xxx.xxx.xxx.108 : multicast, seq=3, size=69 bytes, dist=0, time=0.209ms
xxx.xxx.xxx.108 : unicast, seq=4, size=69 bytes, dist=0, time=0.241ms
xxx.xxx.xxx.108 : multicast, seq=4, size=69 bytes, dist=0, time=0.246ms
xxx.xxx.xxx.108 : unicast, seq=5, size=69 bytes, dist=0, time=0.124ms
xxx.xxx.xxx.108 : multicast, seq=5, size=69 bytes, dist=0, time=0.131ms
 
Do you run a firewall, or maybe you set MTU size? I saw such strange effect when packet gets fragmented/suppressed by wrong MTU settings.
 
No nothing special. All traffic runs over 1 switch within 1 vlan. All quite basic.

What locks the cman from restarting? Because I think if you just loses it's current sync completely it can rejoin.
 
We I finally got a lot of the servers joined, but 2 aren't able to join. They have decided to join to gether instead of with the rest.

So got a set of 2 and a set of 9 server in the same cluster in the same IP space at the same switch... odd to say the least.

Syslog states, so is it would apear to be corosync related

Jan 26 19:51:38 srv5 pmxcfs[24168]: [status] crit: cpg_send_message failed: 9
Jan 26 19:51:38 srv5 pmxcfs[24168]: [status] crit: cpg_send_message failed: 9
Jan 26 19:51:38 srv5 pmxcfs[24168]: [status] crit: cpg_send_message failed: 9

But corosync is running:

root@srv5:~# ps aux | grep coro
root 24719 0.1 0.1 182564 41368 ? S<Lsl 19:49 0:00 corosync -f
root 29327 0.0 0.0 7764 896 pts/4 S+ 19:55 0:00 grep coro

One of the 2 servers has issues to gain quorum, the other one gets it. I've rebooted this server but it didn't make any diff.

root@srv5:~# service cman restart
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown:[ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... Timed-out waiting for cluster
[FAILED]

All servers have the same Cluster ID, but different Cluster Generation IDs.

Any ideas?
 
Is it just me or is a cluster of more then 8 servers a complete fantasy? Everytime it some what works it dies for no reason.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!