CEPH storage corrupting disks when a CEPH node goes down..

Ganescu Theodot · Apr 9, 2016

Hey everybody ,

I have a huge problem that i can't seem to fix with CEPH cluster.

My configuration is like this : 3 Proxmox VE 4.1 server acting as compute servers and 5 CEPH servers x 2 OSD each, running Proxmox VE 4.1

The problem is, that each time i stop ANY of the CEPH servers for maintenance or other reason, the disks that i have on the CEPH storage corrupt and i need to run FSCK on each and every one ..

I have about 100 virtual servers stored on CEPH, so you can imagine the workload..

Did i configure anything wrong ? I followed the oficial documentation in setting up the storage , nothing special there .

Thanks for your help !!

vkhera · Apr 9, 2016

I'm noticing this with up-to-date enterprise PVE 4.1 but with Gluster file system. If I reboot a Gluster node, I tend to get a corrupted disk. It is not recoverable because the actual files are corrupted, not just the file structures. This has happened to me twice now, and not even on a very busy VM.

spirit · Apr 11, 2016

Ganescu Theodot said:
Hey everybody ,

I have a huge problem that i can't seem to fix with CEPH cluster.

My configuration is like this : 3 Proxmox VE 4.1 server acting as compute servers and 5 CEPH servers x 2 OSD each, running Proxmox VE 4.1

The problem is, that each time i stop ANY of the CEPH servers for maintenance or other reason, the disks that i have on the CEPH storage corrupt and i need to run FSCK on each and every one ..

I have about 100 virtual servers stored on CEPH, so you can imagine the workload..

Did i configure anything wrong ? I followed the oficial documentation in setting up the storage , nothing special there .

Thanks for your help !!

Hi, I never have seen this with ceph.
what is your pool configuration ?
how many monitor do you have ?

Ganescu Theodot · Apr 12, 2016

spirit said:
Hi, I never have seen this with ceph.
what is your pool configuration ?
how many monitor do you have ?

We have 5 monitors, each with 2 OSD's , for a total of 10 OSD's
The pool configuration has pg_num set to 256

### Start config #####

[global]

auth client required = cephx

auth cluster required = cephx

auth service required = cephx

cluster network = 10.10.10.0/24

filestore xattr use omap = true

fsid = b959b08a-0827-4840-89b0-da9f40d6ff22

keyring = /etc/pve/priv/$cluster.$name.keyring

mon osd min down reporters = 3

mon osd min down reports = 6

mon osd report timeout = 1800

osd client op priority = 63

osd disk thread ioprio class = idle

osd disk thread ioprio priority = 7

osd heartbeat grace = 40

osd journal size = 5120

osd pool default min size = 2

public network = 10.10.10.0/24

[osd]

keyring = /var/lib/ceph/osd/ceph-$id/keyring

osd recovery max active = 1

osd recovery max single start = 1

osd max backfills = 1

osd recovery op priority = 1

max open files = 327680

osd op threads = 2

filestore op threads = 2

[mon.2]

host = ceph05

mon addr = 10.10.10.8:6789

[mon.1]

host = ceph02

mon addr = 10.10.10.2:6789

[mon.0]

host = ceph03

mon addr = 10.10.10.3:6789

[mon.4]

host = ceph06

mon addr = 10.10.10.9:6789

[mon.3]

host = ceph04

mon addr = 10.10.10.4:6789

## End config ###

spirit · Apr 12, 2016

What is the replication of the pool ? x2 ? x3 ?

when you say that you need to run fsck, it's inside the vm ? or on the osds filesystem ?

do you have any special logs in ceph ?

Ganescu Theodot · Apr 12, 2016

spirit said:
What is the replication of the pool ? x2 ? x3 ?

when you say that you need to run fsck, it's inside the vm ? or on the osds filesystem ?

do you have any special logs in ceph ?

No special logs , the replication is x2

I need to run FSCK on the VM's disk

spirit · Apr 12, 2016

Ganescu Theodot said:
No special logs , the replication is x2

I need to run FSCK on the VM's disk

So, min_size=2 with size=2...

if you lose 1node, you can't reach min_size=2 .

try min_size=1 size=2
or mind_size=2 size=3

Ganescu Theodot · Apr 13, 2016

Would it make sense to increase pg_num to 512 also ?

spirit · Apr 13, 2016

Ganescu Theodot said:
Would it make sense to increase pg_num to 512 also ?

yes, you can use http://ceph.com/pgcalc/ to known how much pg.

(be carefull, because changing it online, will rebuild all the pgs, so it'll be io read/write intensive during the change)

blackpaw · Apr 22, 2016

vkhera said:
I'm noticing this with up-to-date enterprise PVE 4.1 but with Gluster file system. If I reboot a Gluster node, I tend to get a corrupted disk. It is not recoverable because the actual files are corrupted, not just the file structures. This has happened to me twice now, and not even on a very busy VM.

Have you git the virt group settings set? can you post your gluster volume info and gluster version?

Code:

$ gluster --version
$ gluster volume info

vkhera · Apr 24, 2016

blackpaw said:
Have you git the virt group settings set? can you post your gluster volume info and gluster version?

Yes, I did that as per the gluster wiki. I'm not 100% sure if the corruption comes from the actual KVM migration or from rebooting one of the nodes. I have to pay closer attention next time (and make sure I have a full VM snapshot before moving, so recovery can be faster, presuming the corruption doesn't happen over the "older" data.

Code:

root@pve0:~# gluster volume info

Volume Name: datastore
Type: Replicate
Volume ID: d8809597-c8c5-4b30-b585-633168889cbd
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: pve1-cluster:/tank/gluster/brick
Brick2: pve2-cluster:/tank/gluster/brick
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server

root@pve0:~# gluster --version
glusterfs 3.5.2 built on Jul 29 2015 18:55:57
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

Dan Nicolae · Apr 27, 2016

We changed to min_size=2 size=3 and after that we did not see any special hdd activity. After this modification shouldn't the cluster rebuild some data alocation?

spirit · Apr 28, 2016

Dan Nicolae said:
We changed to min_size=2 size=3 and after that we did not see any special hdd activity. After this modification shouldn't the cluster rebuild some data alocation?

Do you have change it with:

http://docs.ceph.com/docs/hammer/rados/operations/pools/
#ceph osd pool set {pool-name} {key} {value} ?

Changing min_size should do nothing, but changing size should replicate more objects in the cluster, so you'll see cluster activity

Ganescu Theodot · Apr 29, 2016

Hey,

I issued this command :
ceph osd pool set rbd min_size 2

With no cluster activity afterwards . My pool is now : size 3 min 2 .

Should i try something else ??

Search

Search

CEPH storage corrupting disks when a CEPH node goes down..

Ganescu Theodot

Member

vkhera

Member

spirit

Distinguished Member

Ganescu Theodot

Member

spirit

Distinguished Member

Ganescu Theodot

Member

spirit

Distinguished Member

Ganescu Theodot

Member

spirit

Distinguished Member

blackpaw

Renowned Member

vkhera

Member

Dan Nicolae

Renowned Member

spirit

Distinguished Member

Ganescu Theodot

Member

We value your privacy