Ceph raw usage grows by itself

Ozz · Nov 29, 2017

Hi,

I have a new cluster of 4 nodes, 3 of them have ceph.

Code:

root@pve3:~# pveversion -v
proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-35 (running version: 5.1-35/722cc488)
pve-kernel-4.13.4-1-pve: 4.13.4-25
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.2-pve1~bpo90
openvswitch-switch: 2.7.0-2
ceph: 12.2.1-pve3

Code:

ceph -v
ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)

I'm using all SSD single pool.
Bluestore, no rocks-db or WAL.
The "journal" or whatever it's called now is 100MB a disk.
ceph-cache is enabled.
Cache per VM set at no-cache.

I transferred 4 VMs from VMware vsphere over and testing them.
The machines are doing nothing. I mean, they do have CentOS 6 on them and apache but nobody communicates with them.
I'm doing automatic backup every night to an NFS share.

Now, I noticed that even though the machines are just sitting there - the raw usage of ceph is constantly growing.
This is the output of "ceph df detail":

Code:

GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED     OBJECTS
    8941G     8888G       54546M          0.60        3960
POOLS:
    NAME       ID     QUOTA OBJECTS     QUOTA BYTES     USED       %USED     MAX AVAIL     OBJECTS     DIRTY     READ     WRITE     RAW USED
    VMpool     1      N/A               N/A             14337M      0.17         2812G        3960      3960     490k      525k       43013M

So the pool usage with replica 3 is 43013 MB, which is fine and it grows very slowly, i.e. several MB a day.
But the "RAW USED" 54546M in the GLOBAL section grows much faster - about 1GB/day.

If I run fstrim on the VMs - it helps a little ( 5-20MBs in total).

So what's with the 11GB difference between the GLOBAL and POOL usage?
How is the GLOBAL usage calculated?
And the most important - why does it grow by itself?
If I transfer all of my 50 VMs over, and there are about 20 VMs with 100GB-800GB - what are the consequences?

Code:

ceph -s
  cluster:
    id:     a1ba7570-38aa-4410-9318-92f3788ef7ef
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum pve1,pve2,pve3
    mgr: pve3(active), standbys: pve2, pve1
    osd: 12 osds: 12 up, 12 in
 
  data:
    pools:   1 pools, 1024 pgs
    objects: 3960 objects, 14337 MB
    usage:   54546 MB used, 8888 GB / 8941 GB avail
    pgs:     1024 active+clean
 
  io:
    client:   1364 B/s wr, 0 op/s rd, 0 op/s wr

Code:

ceph -w
  cluster:
    id:     a1ba7570-38aa-4410-9318-92f3788ef7ef
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum pve1,pve2,pve3
    mgr: pve3(active), standbys: pve2, pve1
    osd: 12 osds: 12 up, 12 in
 
  data:
    pools:   1 pools, 1024 pgs
    objects: 3960 objects, 14337 MB
    usage:   54569 MB used, 8888 GB / 8941 GB avail
    pgs:     1024 active+clean
 
  io:
    client:   1023 B/s wr, 0 op/s rd, 0 op/s wr

Please assist, I must know what I'm getting into before I go on.

Thanks!

Ozz · Dec 1, 2017

Alwin · Dec 1, 2017

Ozz said:
ceph -v

To see from all nodes the ceph version, do a 'ceph versions'.

Ozz said:
Bluestore, no rocks-db or WAL.

You still have a RocksDB and WAL, just not on a separate device.
http://ceph.com/community/new-luminous-bluestore/

Ozz said:
The "journal" or whatever it's called now is 100MB a disk.

That is a xfs partition that holds the needed metadata and links for the OSD.
http://ceph.com/community/new-luminous-bluestore/

Ozz said:
ceph-cache is enabled.
Cache per VM set at no-cache.

The librbd cache is activated by default. With the qemu setting (cache:none/writeback/writethrough) you overrule the ceph settings.
http://docs.ceph.com/docs/master/rbd/qemu-rbd/#qemu-cache-options

Ozz said:
I transferred 4 VMs from VMware vsphere over and testing them.
The machines are doing nothing. I mean, they do have CentOS 6 on them and apache but nobody communicates with them.
I'm doing automatic backup every night to an NFS share.

Now, I noticed that even though the machines are just sitting there - the raw usage of ceph is constantly growing.

Not true, they are sure doing something, like writing logfiles, moving unused data to swap, updating files (eg in /temp).

Ozz said:
So what's with the 11GB difference between the GLOBAL and POOL usage?
How is the GLOBAL usage calculated?
And the most important - why does it grow by itself?

It not only holds your RAW USED data, but also includes the DB+WAL and by default they are 1GB+512MB, the 1GB for DB is allocated on OSD creation. The GLOBAL also reflects the whole cluster and doesn't need to correspond with the RAW AVAILABLE/USED of the pool. And as more data is added to the OSD (objects + DB), it grows.
http://ceph.com/planet/understanding-bluestore-cephs-new-storage-backend/
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

Ozz said:
If I transfer all of my 50 VMs over, and there are about 20 VMs with 100GB-800GB - what are the consequences?

I guess now, you can do the math.
To calculate how many PGs you might need for your pool: http://ceph.com/pgcalc/
https://pve.proxmox.com/pve-docs/chapter-pveceph.html

Ozz said:
Please assist, I must know what I'm getting into before I go on.

As always, if all works well, then it is strait forward, but if there is a disaster you need to be prepared. Please find the following links as a help to understand Ceph more deeply.

Our docs to Ceph: https://pve.proxmox.com/pve-docs/
If you are looking for a PVE support subscription: https://www.proxmox.com/en/proxmox-ve/pricing
Intro to Ceph: http://docs.ceph.com/docs/master/start/intro/
Hardware recommendations + useful tips: http://docs.ceph.com/docs/master/start/hardware-recommendations/
How to operate: http://docs.ceph.com/docs/master/rados/operations/
Architecture of Ceph, this goes in deep: http://docs.ceph.com/docs/master/architecture/
To get in touch with Ceph people, mailling lists, IRC : http://docs.ceph.com/docs/master/start/get-involved/

You always can ask questions here, in the PVE forum. But if they are very Ceph specific, you will find a wider audience on the Ceph mailling lists.

High-level intro to ceph:

Ozz · Dec 4, 2017

Hi,
Thanks a lot for such a detailed response.

I'd like to clarify some things though.
I couldn't find anywhere the numbers you mentioned: the default size of rocksdb is 1GB and WAL - 500MB. Can you please direct me to these?

If I have a total of 12 OSDs in cluster, will I be right to assume that the difference between the GLOBAL and POOL usage values will never be larger than 12*(1+0.5)+0.1*12=19.2GB?
0.1 is the 100MB XFS partition on each OSD.

Also, can you please elaborate on the cache aspect? I was under impression that I have to leave the default "no-cache" on each disk or VM (don't remember exactly where it's set) but set the option to "true" in ceph.conf.
Am I wrong?

Thanks a lot!

Alwin · Dec 4, 2017

Ozz said:
I couldn't find anywhere the numbers you mentioned: the default size of rocksdb is 1GB and WAL - 500MB. Can you please direct me to these?

Sadly it is not on the docs. You can find information on the mailling list and check the source code.

Ozz said:
If I have a total of 12 OSDs in cluster, will I be right to assume that the difference between the GLOBAL and POOL usage values will never be larger than 12*(1+0.5)+0.1*12=19.2GB?
0.1 is the 100MB XFS partition on each OSD.

Yes, but WAL is allocated on use, AFAIK it is 100 MB on the beginning.

Ozz said:
Also, can you please elaborate on the cache aspect? I was under impression that I have to leave the default "no-cache" on each disk or VM (don't remember exactly where it's set) but set the option to "true" in ceph.conf.
Am I wrong?

Set per disk, further see section: QEMU CACHE OPTIONS -> http://docs.ceph.com/docs/master/rbd/qemu-rbd/#qemu-cache-options

Search

Search

Ceph raw usage grows by itself

Ozz

Member

Ozz

Member

Alwin

Proxmox Retired Staff

Ozz

Member

Alwin

Proxmox Retired Staff