CEPH storage corrupting disks when a CEPH node goes down..

Feb 2, 2016
7
0
21
43
Hey everybody ,

I have a huge problem that i can't seem to fix with CEPH cluster.

My configuration is like this : 3 Proxmox VE 4.1 server acting as compute servers and 5 CEPH servers x 2 OSD each, running Proxmox VE 4.1

The problem is, that each time i stop ANY of the CEPH servers for maintenance or other reason, the disks that i have on the CEPH storage corrupt and i need to run FSCK on each and every one ..

I have about 100 virtual servers stored on CEPH, so you can imagine the workload..

Did i configure anything wrong ? I followed the oficial documentation in setting up the storage , nothing special there .

Thanks for your help !!
 
I'm noticing this with up-to-date enterprise PVE 4.1 but with Gluster file system. If I reboot a Gluster node, I tend to get a corrupted disk. It is not recoverable because the actual files are corrupted, not just the file structures. This has happened to me twice now, and not even on a very busy VM.
 
Hey everybody ,

I have a huge problem that i can't seem to fix with CEPH cluster.

My configuration is like this : 3 Proxmox VE 4.1 server acting as compute servers and 5 CEPH servers x 2 OSD each, running Proxmox VE 4.1

The problem is, that each time i stop ANY of the CEPH servers for maintenance or other reason, the disks that i have on the CEPH storage corrupt and i need to run FSCK on each and every one ..

I have about 100 virtual servers stored on CEPH, so you can imagine the workload..

Did i configure anything wrong ? I followed the oficial documentation in setting up the storage , nothing special there .

Thanks for your help !!

Hi, I never have seen this with ceph.
what is your pool configuration ?
how many monitor do you have ?
 
Hi, I never have seen this with ceph.
what is your pool configuration ?
how many monitor do you have ?

We have 5 monitors, each with 2 OSD's , for a total of 10 OSD's
The pool configuration has pg_num set to 256

### Start config #####

[global]

auth client required = cephx

auth cluster required = cephx

auth service required = cephx

cluster network = 10.10.10.0/24

filestore xattr use omap = true

fsid = b959b08a-0827-4840-89b0-da9f40d6ff22

keyring = /etc/pve/priv/$cluster.$name.keyring

mon osd min down reporters = 3

mon osd min down reports = 6

mon osd report timeout = 1800

osd client op priority = 63

osd disk thread ioprio class = idle

osd disk thread ioprio priority = 7

osd heartbeat grace = 40

osd journal size = 5120

osd pool default min size = 2

public network = 10.10.10.0/24


[osd]

keyring = /var/lib/ceph/osd/ceph-$id/keyring

osd recovery max active = 1

osd recovery max single start = 1

osd max backfills = 1

osd recovery op priority = 1

max open files = 327680

osd op threads = 2

filestore op threads = 2


[mon.2]

host = ceph05

mon addr = 10.10.10.8:6789


[mon.1]

host = ceph02

mon addr = 10.10.10.2:6789


[mon.0]

host = ceph03

mon addr = 10.10.10.3:6789


[mon.4]

host = ceph06

mon addr = 10.10.10.9:6789


[mon.3]

host = ceph04

mon addr = 10.10.10.4:6789

## End config ###
 
What is the replication of the pool ? x2 ? x3 ?

when you say that you need to run fsck, it's inside the vm ? or on the osds filesystem ?


do you have any special logs in ceph ?

No special logs , the replication is x2

I need to run FSCK on the VM's disk
 
I'm noticing this with up-to-date enterprise PVE 4.1 but with Gluster file system. If I reboot a Gluster node, I tend to get a corrupted disk. It is not recoverable because the actual files are corrupted, not just the file structures. This has happened to me twice now, and not even on a very busy VM.


Have you git the virt group settings set? can you post your gluster volume info and gluster version?

Code:
$ gluster --version
$ gluster volume info
 
Have you git the virt group settings set? can you post your gluster volume info and gluster version?

Yes, I did that as per the gluster wiki. I'm not 100% sure if the corruption comes from the actual KVM migration or from rebooting one of the nodes. I have to pay closer attention next time (and make sure I have a full VM snapshot before moving, so recovery can be faster, presuming the corruption doesn't happen over the "older" data.

Code:
root@pve0:~# gluster volume info

Volume Name: datastore
Type: Replicate
Volume ID: d8809597-c8c5-4b30-b585-633168889cbd
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: pve1-cluster:/tank/gluster/brick
Brick2: pve2-cluster:/tank/gluster/brick
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server

root@pve0:~# gluster --version
glusterfs 3.5.2 built on Jul 29 2015 18:55:57
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
 
We changed to min_size=2 size=3 and after that we did not see any special hdd activity. After this modification shouldn't the cluster rebuild some data alocation?
 
We changed to min_size=2 size=3 and after that we did not see any special hdd activity. After this modification shouldn't the cluster rebuild some data alocation?

Do you have change it with:

http://docs.ceph.com/docs/hammer/rados/operations/pools/
#ceph osd pool set {pool-name} {key} {value} ?

Changing min_size should do nothing, but changing size should replicate more objects in the cluster, so you'll see cluster activity
 
Hey,

I issued this command :
ceph osd pool set rbd min_size 2

With no cluster activity afterwards . My pool is now : size 3 min 2 .

Should i try something else ??