Hello,
I just completed building a new cluster of 12 nodes and I ran into a big problem where the ProxMox cluster fails!
Some details…
This cluster consists of 12 servers with: 128gigs ram, 16 cores, 10gbit net, 2drives in raid1 for OS and 8 4tb drives for ceph.
On each server I installed PVE v3.3 on the raid1 volume, created the cluster and ran updates. All good at this point and everything is working nicely.
I then created a vm on the first PM & local raid1 volume for ceph admin and deploy. I Like doing it this way so I can upgrade ceph via ceph's native tools (more details to follow).
I then install ubuntu 14.04 on this vm and applied all updates. On this node I followed cephs instructions for installing ceph-deploy. I used the firefly deployment with the expectation to upgrade to giant.
I then ran the deployment to all 12 PM nodes using the 8 4tb drives as OSDs (~360TB total). All went well and testing looked good.
So that PM can see the ceph cluster in its web app, I copied the /etc/ceph/ceph.conf to /etc/pve/. I also copied the admin key to /etc/pve/priv/ along with a proper keyring for the rbd pool.
PM is then able to see the cluster and I can use it to view and create new pools. Again all is working great!
I then change the sources file for ceph from firefly to giant and do an apt-get update && apt-get dist-upgrade. Once upgraded I do a restart of all ceph serveces on all nodes one at a time.
Still good and I see a nearly 100% increase on writes with the new caching that giant has for RBDs. At this point testing all looks great! Live migration works, My seq write to disk on vms are about 700MB/s and with changing blockdev read ahead to something like 8192 I get about 900MB/s reads! So far I’m happy!
Now the failure, I now want to reboot the PM nodes as part of some stress testing and when I reboot, I'm unable to get to the management console of PM.
I look and see that some of the PM services are failing so I try to restart them via:
# service cman stop; service pve-cluster restart; service cman start
Here is the output:
Please note that I have another cluster similar to this one but using ceph Firefly and its have been running fine. So the concern I have is that upgrading to giant might be the issue here causing PM cluster to fail.
Maybe due to library updates. So now I'm very worried about upgrading my other cluster since this might bring down all of my running vms!
Any thoughts on what might be the issue would be appreciated!
-Glen
I just completed building a new cluster of 12 nodes and I ran into a big problem where the ProxMox cluster fails!
Some details…
This cluster consists of 12 servers with: 128gigs ram, 16 cores, 10gbit net, 2drives in raid1 for OS and 8 4tb drives for ceph.
On each server I installed PVE v3.3 on the raid1 volume, created the cluster and ran updates. All good at this point and everything is working nicely.
I then created a vm on the first PM & local raid1 volume for ceph admin and deploy. I Like doing it this way so I can upgrade ceph via ceph's native tools (more details to follow).
I then install ubuntu 14.04 on this vm and applied all updates. On this node I followed cephs instructions for installing ceph-deploy. I used the firefly deployment with the expectation to upgrade to giant.
I then ran the deployment to all 12 PM nodes using the 8 4tb drives as OSDs (~360TB total). All went well and testing looked good.
So that PM can see the ceph cluster in its web app, I copied the /etc/ceph/ceph.conf to /etc/pve/. I also copied the admin key to /etc/pve/priv/ along with a proper keyring for the rbd pool.
PM is then able to see the cluster and I can use it to view and create new pools. Again all is working great!
I then change the sources file for ceph from firefly to giant and do an apt-get update && apt-get dist-upgrade. Once upgraded I do a restart of all ceph serveces on all nodes one at a time.
Still good and I see a nearly 100% increase on writes with the new caching that giant has for RBDs. At this point testing all looks great! Live migration works, My seq write to disk on vms are about 700MB/s and with changing blockdev read ahead to something like 8192 I get about 900MB/s reads! So far I’m happy!
Now the failure, I now want to reboot the PM nodes as part of some stress testing and when I reboot, I'm unable to get to the management console of PM.
I look and see that some of the PM services are failing so I try to restart them via:
# service cman stop; service pve-cluster restart; service cman start
Here is the output:
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown:[ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Restarting pve cluster filesystem: pve-cluster.
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... /usr/share/cluster/cluster.rng:1002: element ref: Relax-NG parser error : Reference PVEVM has no matching definition
/usr/share/cluster/cluster.rng:1002: element ref: Relax-NG parser error : Internal found no define for ref PVEVM
Relax-NG schema /usr/share/cluster/cluster.rng failed to compile
[ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Unfencing self... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown:[ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Restarting pve cluster filesystem: pve-cluster.
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... /usr/share/cluster/cluster.rng:1002: element ref: Relax-NG parser error : Reference PVEVM has no matching definition
/usr/share/cluster/cluster.rng:1002: element ref: Relax-NG parser error : Internal found no define for ref PVEVM
Relax-NG schema /usr/share/cluster/cluster.rng failed to compile
[ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Unfencing self... [ OK ]
Please note that I have another cluster similar to this one but using ceph Firefly and its have been running fine. So the concern I have is that upgrading to giant might be the issue here causing PM cluster to fail.
Maybe due to library updates. So now I'm very worried about upgrading my other cluster since this might bring down all of my running vms!
Any thoughts on what might be the issue would be appreciated!
-Glen