!! New Cluster build and crash with ceph !!

glena

Member
Sep 10, 2012
35
0
6
Hello,

I just completed building a new cluster of 12 nodes and I ran into a big problem where the ProxMox cluster fails!

Some details…

This cluster consists of 12 servers with: 128gigs ram, 16 cores, 10gbit net, 2drives in raid1 for OS and 8 4tb drives for ceph.

On each server I installed PVE v3.3 on the raid1 volume, created the cluster and ran updates. All good at this point and everything is working nicely.

I then created a vm on the first PM & local raid1 volume for ceph admin and deploy. I Like doing it this way so I can upgrade ceph via ceph's native tools (more details to follow).
I then install ubuntu 14.04 on this vm and applied all updates. On this node I followed cephs instructions for installing ceph-deploy. I used the firefly deployment with the expectation to upgrade to giant.

I then ran the deployment to all 12 PM nodes using the 8 4tb drives as OSDs (~360TB total). All went well and testing looked good.
So that PM can see the ceph cluster in its web app, I copied the /etc/ceph/ceph.conf to /etc/pve/. I also copied the admin key to /etc/pve/priv/ along with a proper keyring for the rbd pool.
PM is then able to see the cluster and I can use it to view and create new pools. Again all is working great!

I then change the sources file for ceph from firefly to giant and do an apt-get update && apt-get dist-upgrade. Once upgraded I do a restart of all ceph serveces on all nodes one at a time.
Still good and I see a nearly 100% increase on writes with the new caching that giant has for RBDs. At this point testing all looks great! Live migration works, My seq write to disk on vms are about 700MB/s and with changing blockdev read ahead to something like 8192 I get about 900MB/s reads! So far I’m happy!

Now the failure, I now want to reboot the PM nodes as part of some stress testing and when I reboot, I'm unable to get to the management console of PM.
I look and see that some of the PM services are failing so I try to restart them via:

# service cman stop; service pve-cluster restart; service cman start

Here is the output:
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown:[ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Restarting pve cluster filesystem: pve-cluster.
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... /usr/share/cluster/cluster.rng:1002: element ref: Relax-NG parser error : Reference PVEVM has no matching definition
/usr/share/cluster/cluster.rng:1002: element ref: Relax-NG parser error : Internal found no define for ref PVEVM
Relax-NG schema /usr/share/cluster/cluster.rng failed to compile
[ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Unfencing self... [ OK ]


Please note that I have another cluster similar to this one but using ceph Firefly and its have been running fine. So the concern I have is that upgrading to giant might be the issue here causing PM cluster to fail.
Maybe due to library updates. So now I'm very worried about upgrading my other cluster since this might bring down all of my running vms!

Any thoughts on what might be the issue would be appreciated!

-Glen
 
Looks pve-manager package is not correctly installed? What is the output of:

# pveversion -v
 
Looks like I found the issue, for some reason, the packages were uninstalled!

When I did the pveversion -v it said that the program was not installed so I looked into this and found a thread that said to install proxmox-ve-2.6.32 so I did apt-get install proxmox-ve-2.6.32. I was then able to see the management console!

I then did this to the remaining nodes and restarted the cluster and all seems good.

I know I did not remove the package so I wonder what happened... ?

Anyway, thanks for the help! :)

-Glen
 
I know I did not remove the package so I wonder what happened... ?

Anyway, thanks for the help! :)

-Glen
Did you uninstall your 10gb NIC drivers? It happened to me when i installed Infiniband drivers supplied by Mellanox site. When i uninstalled the drivers it uninstalled big chunk of PVE programs with it.
 
I think this may have something to do with the repositories.

I setup the following repository:

# PVE pve-no-subscription repository provided by proxmox.com, NOT recommended for production use
deb http://download.proxmox.com/debian wheezy pve-no-subscription

so maybe the original install was the production and when I commented out the production repository it resided to remove the current install when I did an update. Not sure if this is it but its the only explanation I can think of.

Anyway its all working now... :)

-Glen
 
Last edited:
No, changing the repo will not remove anything. All default installation comes with Enterprise repo turned on. You will have to manually enter the free repo as you have already done.
Still odd that it lost the pve programs. Hopefully it wont happen again.
 
glen, your read prformnace is inside a VM? 900mb/sec is really cool!
how did you measure that? with big enogh files to bypass cache? i get only around 100mb/sec read and 500mb/sec write inside...
ho much did the speed change for you setting blockdev read ahead to 8192?
 
To test sequential read/write performance I use the following script:

=============================================================
#!/bin/bash

cnt=1000

rm foo? >/dev/null 2>&1
sync
sleep 1

printf "Writing 10 files:\n"
for c in 0 1 2 3 4 5 6 7 8 9
do
dd if=/dev/zero bs=1M count=$cnt of=foo$c oflag=direct 2>&1 | grep copied # no fs caching
# dd if=/dev/zero bs=1M count=$cnt of=foo$c 2>&1 | grep copied # with fs caching
done

sync

printf "\nNow reading the 10 files:\n"
for c in 0 1 2 3 4 5 6 7 8 9
do
dd of=/dev/null bs=128k if=foo$c 2>&1 | grep copied
done

printf "\nDone...\n"

rm foo? >/dev/null 2>&1
=============================================================

With blockdev read ahead set to 8192 I get the following in side a VM:

=============================================================
# /root/perftest
Writing 10 files:
1048576000 bytes (1.0 GB) copied, 1.61296 s, 650 MB/s
1048576000 bytes (1.0 GB) copied, 1.77415 s, 591 MB/s
1048576000 bytes (1.0 GB) copied, 1.63696 s, 641 MB/s
1048576000 bytes (1.0 GB) copied, 1.74977 s, 599 MB/s
1048576000 bytes (1.0 GB) copied, 1.71328 s, 612 MB/s
1048576000 bytes (1.0 GB) copied, 1.56663 s, 669 MB/s
1048576000 bytes (1.0 GB) copied, 1.64793 s, 636 MB/s
1048576000 bytes (1.0 GB) copied, 1.70893 s, 614 MB/s
1048576000 bytes (1.0 GB) copied, 1.68809 s, 621 MB/s
1048576000 bytes (1.0 GB) copied, 1.63912 s, 640 MB/s

Now reading the 10 files:
1048576000 bytes (1.0 GB) copied, 1.62093 s, 647 MB/s
1048576000 bytes (1.0 GB) copied, 1.60234 s, 654 MB/s
1048576000 bytes (1.0 GB) copied, 1.60541 s, 653 MB/s
1048576000 bytes (1.0 GB) copied, 1.58073 s, 663 MB/s
1048576000 bytes (1.0 GB) copied, 1.58674 s, 661 MB/s
1048576000 bytes (1.0 GB) copied, 1.57833 s, 664 MB/s
1048576000 bytes (1.0 GB) copied, 1.58262 s, 663 MB/s
1048576000 bytes (1.0 GB) copied, 1.61791 s, 648 MB/s
1048576000 bytes (1.0 GB) copied, 1.59739 s, 656 MB/s
1048576000 bytes (1.0 GB) copied, 1.5855 s, 661 MB/s

Done...
=============================================================

If I set the blockdev read ahead to something like 32768 I get ~950MB/s on reads but that will probably hurt smaller reads so I like to leave it at 8192.

Note: This is on 12 nodes with 8 x 4tb drives each (96 OSDs total) on a 10GB network.

-Glen :)

 
To test sequential read/write performance I use the following script:

=============================================================
#!/bin/bash

cnt=1000

rm foo? >/dev/null 2>&1
sync
sleep 1

printf "Writing 10 files:\n"
for c in 0 1 2 3 4 5 6 7 8 9
do
dd if=/dev/zero bs=1M count=$cnt of=foo$c oflag=direct 2>&1 | grep copied # no fs caching
# dd if=/dev/zero bs=1M count=$cnt of=foo$c 2>&1 | grep copied # with fs caching
done

sync

printf "\nNow reading the 10 files:\n"
for c in 0 1 2 3 4 5 6 7 8 9
do
dd of=/dev/null bs=128k if=foo$c 2>&1 | grep copied
done

printf "\nDone...\n"

rm foo? >/dev/null 2>&1
=============================================================

With blockdev read ahead set to 8192 I get the following in side a VM:

=============================================================
# /root/perftest
Writing 10 files:
1048576000 bytes (1.0 GB) copied, 1.61296 s, 650 MB/s
1048576000 bytes (1.0 GB) copied, 1.77415 s, 591 MB/s
1048576000 bytes (1.0 GB) copied, 1.63696 s, 641 MB/s
1048576000 bytes (1.0 GB) copied, 1.74977 s, 599 MB/s
1048576000 bytes (1.0 GB) copied, 1.71328 s, 612 MB/s
1048576000 bytes (1.0 GB) copied, 1.56663 s, 669 MB/s
1048576000 bytes (1.0 GB) copied, 1.64793 s, 636 MB/s
1048576000 bytes (1.0 GB) copied, 1.70893 s, 614 MB/s
1048576000 bytes (1.0 GB) copied, 1.68809 s, 621 MB/s
1048576000 bytes (1.0 GB) copied, 1.63912 s, 640 MB/s

Now reading the 10 files:
1048576000 bytes (1.0 GB) copied, 1.62093 s, 647 MB/s
1048576000 bytes (1.0 GB) copied, 1.60234 s, 654 MB/s
1048576000 bytes (1.0 GB) copied, 1.60541 s, 653 MB/s
1048576000 bytes (1.0 GB) copied, 1.58073 s, 663 MB/s
1048576000 bytes (1.0 GB) copied, 1.58674 s, 661 MB/s
1048576000 bytes (1.0 GB) copied, 1.57833 s, 664 MB/s
1048576000 bytes (1.0 GB) copied, 1.58262 s, 663 MB/s
1048576000 bytes (1.0 GB) copied, 1.61791 s, 648 MB/s
1048576000 bytes (1.0 GB) copied, 1.59739 s, 656 MB/s
1048576000 bytes (1.0 GB) copied, 1.5855 s, 661 MB/s

Done...
=============================================================

If I set the blockdev read ahead to something like 32768 I get ~950MB/s on reads but that will probably hurt smaller reads so I like to leave it at 8192.

Note: This is on 12 nodes with 8 x 4tb drives each (96 OSDs total) on a 10GB network.

-Glen :)

Hi,
but your data is cached on the OSD-Nodes, so your read speed are not realy realistic!
If you drop the cache on all OSD-Nodes and the proxmox-node (if you have rbd cache enabled) before reading, the result should be different.

btw. you know fio?

Udo
 
Udo,

You have an interesting point. To test this, I went to a host that has a 11GB dump file on it that was written and untouched over a week ago so it should not be in any buffers on the OSDs or other. I did a read test and got the following:

/d1/dump# dd of=/dev/null bs=1M if=vzdump-qemu-50214-2014_12_06-15_29_13.vma.lzo
10067+1 records in
10067+1 records out
10556042548 bytes (11 GB) copied, 25.8167 s, 409 MB/s

So ~400MB is slower then the 600+ I was seeing but still not too bad.

-Glen
 
Udo,

You have an interesting point. To test this, I went to a host that has a 11GB dump file on it that was written and untouched over a week ago so it should not be in any buffers on the OSDs or other. I did a read test and got the following:

/d1/dump# dd of=/dev/null bs=1M if=vzdump-qemu-50214-2014_12_06-15_29_13.vma.lzo
10067+1 records in
10067+1 records out
10556042548 bytes (11 GB) copied, 25.8167 s, 409 MB/s

So ~400MB is slower then the 600+ I was seeing but still not too bad.

-Glen
Hi Glen,
just tested your script compared to fio... very strange - with fio I'm much faster on writes and half so fast on reads (with droped caches)
Code:
fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=1M --size=1G --direct=1 --name=fiojob
- normaly I use blocksize 4M and size=4G
Code:
echo 3 > /proc/sys/vm/drop_caches # on VM and all nodes (pve + osds)
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=1M --size=1G --direct=1 --name=fiojob
IOPs I test with
Code:
fio --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=80 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --size=128m
Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!