Ceph raw usage grows by itself

Ozz

New Member
Nov 29, 2017
10
0
1
43
Hi,

I have a new cluster of 4 nodes, 3 of them have ceph.
Code:
root@pve3:~# pveversion -v
proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-35 (running version: 5.1-35/722cc488)
pve-kernel-4.13.4-1-pve: 4.13.4-25
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.2-pve1~bpo90
openvswitch-switch: 2.7.0-2
ceph: 12.2.1-pve3
Code:
ceph -v
ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)
I'm using all SSD single pool.
Bluestore, no rocks-db or WAL.
The "journal" or whatever it's called now is 100MB a disk.
ceph-cache is enabled.
Cache per VM set at no-cache.

I transferred 4 VMs from VMware vsphere over and testing them.
The machines are doing nothing. I mean, they do have CentOS 6 on them and apache but nobody communicates with them.
I'm doing automatic backup every night to an NFS share.

Now, I noticed that even though the machines are just sitting there - the raw usage of ceph is constantly growing.
This is the output of "ceph df detail":
Code:
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED     OBJECTS
    8941G     8888G       54546M          0.60        3960
POOLS:
    NAME       ID     QUOTA OBJECTS     QUOTA BYTES     USED       %USED     MAX AVAIL     OBJECTS     DIRTY     READ     WRITE     RAW USED
    VMpool     1      N/A               N/A             14337M      0.17         2812G        3960      3960     490k      525k       43013M
So the pool usage with replica 3 is 43013 MB, which is fine and it grows very slowly, i.e. several MB a day.
But the "RAW USED" 54546M in the GLOBAL section grows much faster - about 1GB/day.

If I run fstrim on the VMs - it helps a little ( 5-20MBs in total).


So what's with the 11GB difference between the GLOBAL and POOL usage?
How is the GLOBAL usage calculated?
And the most important - why does it grow by itself?
If I transfer all of my 50 VMs over, and there are about 20 VMs with 100GB-800GB - what are the consequences?

Code:
ceph -s
  cluster:
    id:     a1ba7570-38aa-4410-9318-92f3788ef7ef
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum pve1,pve2,pve3
    mgr: pve3(active), standbys: pve2, pve1
    osd: 12 osds: 12 up, 12 in
 
  data:
    pools:   1 pools, 1024 pgs
    objects: 3960 objects, 14337 MB
    usage:   54546 MB used, 8888 GB / 8941 GB avail
    pgs:     1024 active+clean
 
  io:
    client:   1364 B/s wr, 0 op/s rd, 0 op/s wr
Code:
ceph -w
  cluster:
    id:     a1ba7570-38aa-4410-9318-92f3788ef7ef
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum pve1,pve2,pve3
    mgr: pve3(active), standbys: pve2, pve1
    osd: 12 osds: 12 up, 12 in
 
  data:
    pools:   1 pools, 1024 pgs
    objects: 3960 objects, 14337 MB
    usage:   54569 MB used, 8888 GB / 8941 GB avail
    pgs:     1024 active+clean
 
  io:
    client:   1023 B/s wr, 0 op/s rd, 0 op/s wr
Please assist, I must know what I'm getting into before I go on.

Thanks!
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,020
261
88
To see from all nodes the ceph version, do a 'ceph versions'.

Bluestore, no rocks-db or WAL.
You still have a RocksDB and WAL, just not on a separate device.
http://ceph.com/community/new-luminous-bluestore/

The "journal" or whatever it's called now is 100MB a disk.
That is a xfs partition that holds the needed metadata and links for the OSD.
http://ceph.com/community/new-luminous-bluestore/

ceph-cache is enabled.
Cache per VM set at no-cache.
The librbd cache is activated by default. With the qemu setting (cache:none/writeback/writethrough) you overrule the ceph settings.
http://docs.ceph.com/docs/master/rbd/qemu-rbd/#qemu-cache-options

I transferred 4 VMs from VMware vsphere over and testing them.
The machines are doing nothing. I mean, they do have CentOS 6 on them and apache but nobody communicates with them.
I'm doing automatic backup every night to an NFS share.

Now, I noticed that even though the machines are just sitting there - the raw usage of ceph is constantly growing.
Not true, they are sure doing something, like writing logfiles, moving unused data to swap, updating files (eg in /temp).

So what's with the 11GB difference between the GLOBAL and POOL usage?
How is the GLOBAL usage calculated?
And the most important - why does it grow by itself?
It not only holds your RAW USED data, but also includes the DB+WAL and by default they are 1GB+512MB, the 1GB for DB is allocated on OSD creation. The GLOBAL also reflects the whole cluster and doesn't need to correspond with the RAW AVAILABLE/USED of the pool. And as more data is added to the OSD (objects + DB), it grows.
http://ceph.com/planet/understanding-bluestore-cephs-new-storage-backend/
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

If I transfer all of my 50 VMs over, and there are about 20 VMs with 100GB-800GB - what are the consequences?
I guess now, you can do the math.
To calculate how many PGs you might need for your pool: http://ceph.com/pgcalc/
https://pve.proxmox.com/pve-docs/chapter-pveceph.html

Please assist, I must know what I'm getting into before I go on.
As always, if all works well, then it is strait forward, but if there is a disaster you need to be prepared. Please find the following links as a help to understand Ceph more deeply.

Our docs to Ceph: https://pve.proxmox.com/pve-docs/
If you are looking for a PVE support subscription: https://www.proxmox.com/en/proxmox-ve/pricing
Intro to Ceph: http://docs.ceph.com/docs/master/start/intro/
Hardware recommendations + useful tips: http://docs.ceph.com/docs/master/start/hardware-recommendations/
How to operate: http://docs.ceph.com/docs/master/rados/operations/
Architecture of Ceph, this goes in deep: http://docs.ceph.com/docs/master/architecture/
To get in touch with Ceph people, mailling lists, IRC : http://docs.ceph.com/docs/master/start/get-involved/

You always can ask questions here, in the PVE forum. But if they are very Ceph specific, you will find a wider audience on the Ceph mailling lists. ;)

High-level intro to ceph:
 

Ozz

New Member
Nov 29, 2017
10
0
1
43
Hi,
Thanks a lot for such a detailed response.

I'd like to clarify some things though.
I couldn't find anywhere the numbers you mentioned: the default size of rocksdb is 1GB and WAL - 500MB. Can you please direct me to these?

If I have a total of 12 OSDs in cluster, will I be right to assume that the difference between the GLOBAL and POOL usage values will never be larger than 12*(1+0.5)+0.1*12=19.2GB?
0.1 is the 100MB XFS partition on each OSD.

Also, can you please elaborate on the cache aspect? I was under impression that I have to leave the default "no-cache" on each disk or VM (don't remember exactly where it's set) but set the option to "true" in ceph.conf.
Am I wrong?

Thanks a lot!
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,020
261
88
I couldn't find anywhere the numbers you mentioned: the default size of rocksdb is 1GB and WAL - 500MB. Can you please direct me to these?
Sadly it is not on the docs. You can find information on the mailling list and check the source code.

If I have a total of 12 OSDs in cluster, will I be right to assume that the difference between the GLOBAL and POOL usage values will never be larger than 12*(1+0.5)+0.1*12=19.2GB?
0.1 is the 100MB XFS partition on each OSD.
Yes, but WAL is allocated on use, AFAIK it is 100 MB on the beginning.

Also, can you please elaborate on the cache aspect? I was under impression that I have to leave the default "no-cache" on each disk or VM (don't remember exactly where it's set) but set the option to "true" in ceph.conf.
Am I wrong?
Set per disk, further see section: QEMU CACHE OPTIONS -> http://docs.ceph.com/docs/master/rbd/qemu-rbd/#qemu-cache-options
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!