!! New Cluster build and crash with ceph !!

glena · Dec 10, 2014

Hello,

I just completed building a new cluster of 12 nodes and I ran into a big problem where the ProxMox cluster fails!

Some details…

This cluster consists of 12 servers with: 128gigs ram, 16 cores, 10gbit net, 2drives in raid1 for OS and 8 4tb drives for ceph.

On each server I installed PVE v3.3 on the raid1 volume, created the cluster and ran updates. All good at this point and everything is working nicely.

I then created a vm on the first PM & local raid1 volume for ceph admin and deploy. I Like doing it this way so I can upgrade ceph via ceph's native tools (more details to follow).
I then install ubuntu 14.04 on this vm and applied all updates. On this node I followed cephs instructions for installing ceph-deploy. I used the firefly deployment with the expectation to upgrade to giant.

I then ran the deployment to all 12 PM nodes using the 8 4tb drives as OSDs (~360TB total). All went well and testing looked good.
So that PM can see the ceph cluster in its web app, I copied the /etc/ceph/ceph.conf to /etc/pve/. I also copied the admin key to /etc/pve/priv/ along with a proper keyring for the rbd pool.
PM is then able to see the cluster and I can use it to view and create new pools. Again all is working great!

I then change the sources file for ceph from firefly to giant and do an apt-get update && apt-get dist-upgrade. Once upgraded I do a restart of all ceph serveces on all nodes one at a time.
Still good and I see a nearly 100% increase on writes with the new caching that giant has for RBDs. At this point testing all looks great! Live migration works, My seq write to disk on vms are about 700MB/s and with changing blockdev read ahead to something like 8192 I get about 900MB/s reads! So far I’m happy!

Now the failure, I now want to reboot the PM nodes as part of some stress testing and when I reboot, I'm unable to get to the management console of PM.
I look and see that some of the PM services are failing so I try to restart them via:

# service cman stop; service pve-cluster restart; service cman start

Here is the output:

Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown:[ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Restarting pve cluster filesystem: pve-cluster.
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... /usr/share/cluster/cluster.rng:1002: element ref: Relax-NG parser error : Reference PVEVM has no matching definition
/usr/share/cluster/cluster.rng:1002: element ref: Relax-NG parser error : Internal found no define for ref PVEVM
Relax-NG schema /usr/share/cluster/cluster.rng failed to compile
[ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Unfencing self... [ OK ]

Please note that I have another cluster similar to this one but using ceph Firefly and its have been running fine. So the concern I have is that upgrading to giant might be the issue here causing PM cluster to fail.
Maybe due to library updates. So now I'm very worried about upgrading my other cluster since this might bring down all of my running vms!

Any thoughts on what might be the issue would be appreciated!

-Glen

dietmar · Dec 10, 2014

Looks pve-manager package is not correctly installed? What is the output of:

# pveversion -v

glena · Dec 10, 2014

Looks like I found the issue, for some reason, the packages were uninstalled!

When I did the pveversion -v it said that the program was not installed so I looked into this and found a thread that said to install proxmox-ve-2.6.32 so I did apt-get install proxmox-ve-2.6.32. I was then able to see the management console!

I then did this to the remaining nodes and restarted the cluster and all seems good.

I know I did not remove the package so I wonder what happened... ?

Anyway, thanks for the help!

-Glen

wahmed · Dec 10, 2014

glena said:
I know I did not remove the package so I wonder what happened... ?

Anyway, thanks for the help!

-Glen

Did you uninstall your 10gb NIC drivers? It happened to me when i installed Infiniband drivers supplied by Mellanox site. When i uninstalled the drivers it uninstalled big chunk of PVE programs with it.

glena · Dec 10, 2014

I think this may have something to do with the repositories.

I setup the following repository:

# PVE pve-no-subscription repository provided by proxmox.com, NOT recommended for production use
deb http://download.proxmox.com/debian wheezy pve-no-subscription

so maybe the original install was the production and when I commented out the production repository it resided to remove the current install when I did an update. Not sure if this is it but its the only explanation I can think of.

Anyway its all working now...

-Glen

wahmed · Dec 10, 2014

No, changing the repo will not remove anything. All default installation comes with Enterprise repo turned on. You will have to manually enter the free repo as you have already done.
Still odd that it lost the pve programs. Hopefully it wont happen again.

felipe · Dec 16, 2014

glen, your read prformnace is inside a VM? 900mb/sec is really cool!
how did you measure that? with big enogh files to bypass cache? i get only around 100mb/sec read and 500mb/sec write inside...
ho much did the speed change for you setting blockdev read ahead to 8192?

glena · Dec 16, 2014

To test sequential read/write performance I use the following script:

=============================================================
#!/bin/bash

cnt=1000

rm foo? >/dev/null 2>&1
sync
sleep 1

printf "Writing 10 files:\n"
for c in 0 1 2 3 4 5 6 7 8 9
do
dd if=/dev/zero bs=1M count=$cnt of=foo$c oflag=direct 2>&1 | grep copied # no fs caching
# dd if=/dev/zero bs=1M count=$cnt of=foo$c 2>&1 | grep copied # with fs caching
done

sync

printf "\nNow reading the 10 files:\n"
for c in 0 1 2 3 4 5 6 7 8 9
do
dd of=/dev/null bs=128k if=foo$c 2>&1 | grep copied
done

printf "\nDone...\n"

rm foo? >/dev/null 2>&1
=============================================================

With blockdev read ahead set to 8192 I get the following in side a VM:

=============================================================
# /root/perftest
Writing 10 files:
1048576000 bytes (1.0 GB) copied, 1.61296 s, 650 MB/s
1048576000 bytes (1.0 GB) copied, 1.77415 s, 591 MB/s
1048576000 bytes (1.0 GB) copied, 1.63696 s, 641 MB/s
1048576000 bytes (1.0 GB) copied, 1.74977 s, 599 MB/s
1048576000 bytes (1.0 GB) copied, 1.71328 s, 612 MB/s
1048576000 bytes (1.0 GB) copied, 1.56663 s, 669 MB/s
1048576000 bytes (1.0 GB) copied, 1.64793 s, 636 MB/s
1048576000 bytes (1.0 GB) copied, 1.70893 s, 614 MB/s
1048576000 bytes (1.0 GB) copied, 1.68809 s, 621 MB/s
1048576000 bytes (1.0 GB) copied, 1.63912 s, 640 MB/s

Now reading the 10 files:
1048576000 bytes (1.0 GB) copied, 1.62093 s, 647 MB/s
1048576000 bytes (1.0 GB) copied, 1.60234 s, 654 MB/s
1048576000 bytes (1.0 GB) copied, 1.60541 s, 653 MB/s
1048576000 bytes (1.0 GB) copied, 1.58073 s, 663 MB/s
1048576000 bytes (1.0 GB) copied, 1.58674 s, 661 MB/s
1048576000 bytes (1.0 GB) copied, 1.57833 s, 664 MB/s
1048576000 bytes (1.0 GB) copied, 1.58262 s, 663 MB/s
1048576000 bytes (1.0 GB) copied, 1.61791 s, 648 MB/s
1048576000 bytes (1.0 GB) copied, 1.59739 s, 656 MB/s
1048576000 bytes (1.0 GB) copied, 1.5855 s, 661 MB/s

Done...
=============================================================

If I set the blockdev read ahead to something like 32768 I get ~950MB/s on reads but that will probably hurt smaller reads so I like to leave it at 8192.

Note: This is on 12 nodes with 8 x 4tb drives each (96 OSDs total) on a 10GB network.

-Glen

udo · Dec 16, 2014

glena said:
To test sequential read/write performance I use the following script:

=============================================================
#!/bin/bash

cnt=1000

rm foo? >/dev/null 2>&1
sync
sleep 1

printf "Writing 10 files:\n"
for c in 0 1 2 3 4 5 6 7 8 9
do
dd if=/dev/zero bs=1M count=$cnt of=foo$c oflag=direct 2>&1 | grep copied # no fs caching
# dd if=/dev/zero bs=1M count=$cnt of=foo$c 2>&1 | grep copied # with fs caching
done

sync

printf "\nNow reading the 10 files:\n"
for c in 0 1 2 3 4 5 6 7 8 9
do
dd of=/dev/null bs=128k if=foo$c 2>&1 | grep copied
done

printf "\nDone...\n"

rm foo? >/dev/null 2>&1
=============================================================

With blockdev read ahead set to 8192 I get the following in side a VM:

=============================================================
# /root/perftest
Writing 10 files:
1048576000 bytes (1.0 GB) copied, 1.61296 s, 650 MB/s
1048576000 bytes (1.0 GB) copied, 1.77415 s, 591 MB/s
1048576000 bytes (1.0 GB) copied, 1.63696 s, 641 MB/s
1048576000 bytes (1.0 GB) copied, 1.74977 s, 599 MB/s
1048576000 bytes (1.0 GB) copied, 1.71328 s, 612 MB/s
1048576000 bytes (1.0 GB) copied, 1.56663 s, 669 MB/s
1048576000 bytes (1.0 GB) copied, 1.64793 s, 636 MB/s
1048576000 bytes (1.0 GB) copied, 1.70893 s, 614 MB/s
1048576000 bytes (1.0 GB) copied, 1.68809 s, 621 MB/s
1048576000 bytes (1.0 GB) copied, 1.63912 s, 640 MB/s

Now reading the 10 files:
1048576000 bytes (1.0 GB) copied, 1.62093 s, 647 MB/s
1048576000 bytes (1.0 GB) copied, 1.60234 s, 654 MB/s
1048576000 bytes (1.0 GB) copied, 1.60541 s, 653 MB/s
1048576000 bytes (1.0 GB) copied, 1.58073 s, 663 MB/s
1048576000 bytes (1.0 GB) copied, 1.58674 s, 661 MB/s
1048576000 bytes (1.0 GB) copied, 1.57833 s, 664 MB/s
1048576000 bytes (1.0 GB) copied, 1.58262 s, 663 MB/s
1048576000 bytes (1.0 GB) copied, 1.61791 s, 648 MB/s
1048576000 bytes (1.0 GB) copied, 1.59739 s, 656 MB/s
1048576000 bytes (1.0 GB) copied, 1.5855 s, 661 MB/s

Done...
=============================================================

If I set the blockdev read ahead to something like 32768 I get ~950MB/s on reads but that will probably hurt smaller reads so I like to leave it at 8192.

Note: This is on 12 nodes with 8 x 4tb drives each (96 OSDs total) on a 10GB network.

-Glen

Hi,
but your data is cached on the OSD-Nodes, so your read speed are not realy realistic!
If you drop the cache on all OSD-Nodes and the proxmox-node (if you have rbd cache enabled) before reading, the result should be different.

btw. you know fio?

Udo

glena · Dec 16, 2014

Udo,

You have an interesting point. To test this, I went to a host that has a 11GB dump file on it that was written and untouched over a week ago so it should not be in any buffers on the OSDs or other. I did a read test and got the following:

/d1/dump# dd of=/dev/null bs=1M if=vzdump-qemu-50214-2014_12_06-15_29_13.vma.lzo
10067+1 records in
10067+1 records out
10556042548 bytes (11 GB) copied, 25.8167 s, 409 MB/s

So ~400MB is slower then the 600+ I was seeing but still not too bad.

-Glen

udo · Dec 16, 2014

glena said:
Udo,

You have an interesting point. To test this, I went to a host that has a 11GB dump file on it that was written and untouched over a week ago so it should not be in any buffers on the OSDs or other. I did a read test and got the following:

/d1/dump# dd of=/dev/null bs=1M if=vzdump-qemu-50214-2014_12_06-15_29_13.vma.lzo
10067+1 records in
10067+1 records out
10556042548 bytes (11 GB) copied, 25.8167 s, 409 MB/s

So ~400MB is slower then the 600+ I was seeing but still not too bad.

-Glen

Hi Glen,
just tested your script compared to fio... very strange - with fio I'm much faster on writes and half so fast on reads (with droped caches)

Code:

fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=1M --size=1G --direct=1 --name=fiojob

- normaly I use blocksize 4M and size=4G

Code:

echo 3 > /proc/sys/vm/drop_caches # on VM and all nodes (pve + osds)
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=1M --size=1G --direct=1 --name=fiojob

IOPs I test with

Code:

fio --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=80 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --size=128m

Udo

Search

Search

!! New Cluster build and crash with ceph !!

glena

Member

dietmar

Proxmox Staff Member

glena

Member

wahmed

Famous Member

glena

Member

wahmed

Famous Member

felipe

Well-Known Member

glena

Member

udo

Distinguished Member

glena

Member

udo

Distinguished Member

We value your privacy