Hello,
tl;dr: we have a Ceph cluster in use and use an SSD cache pool which runs full. Apparently some cache pool bugs are in the Ceph version 0.80.7 we use (Proxmox 3.3). When can we expect Proxmox to deliver Ceph 0.90?
Long version (excuse my bad english):
We bought 3 servers each with Intel Xeon E5-2620V3, 64 GB RAM, 8x Western Digital 4TB RED + 4x 0.25TB Samsung 850 Pro + Dual-10g-Ethernet as our new hypervisor-host. With an out-of-the-box Windows Hyper-V (without clustering) we reach in VM guests around 700 MByte/s read speed and 600 MByte/s write speed. Also, we’re able to copy around 600 MByte/s via our network.
Using Proxmox + Ceph (without SSD cache pool), we achieved about 50 MByte/s data transfer speed. Considering our hardware-invest, that speed wasn’t enough for us at all.
So we put a cache pool with 2 SSDs in and finally reached 400 MByte/s read speed and 200 MByte/s write speed. This is not as good as Hyper-V, but acceptable (because of the cluster-advantage).
The tests have worked well, but quite soon after the go-live the Ceph Health changed to "Error" and the cache SSD pools (and thus the entire memory) have stopped working with "OSD full".
There seem to be several bugs in the Ceph 0.80.7. Among others, target_max_bytes and/or cache_target_dirty_ratio seems/cache_target_full_ratio seems to be ignored. Any cache-flushing does not seem to be working.
Others have these problems too, e.g.:
http://thread.gmane.org/gmane.comp.file-systems.ceph.user/13252
http://thread.gmane.org/gmane.comp.file-systems.ceph.user/16238
We checked our configuration and documentation, actually it fits.
The new Ceph versions have some improvements regarding cache-handling, so we hope the problems disappear with a new Ceph version.
What's also strange in Proxmox is that the SSD cache pool despite a 448GB usage indicates just 1.47% used. Obviously the base 30300 GB (8x 4TB) is used and not the size of the cache-pool (2x 238 GB)
Another strange thing in Proxmox is, that in the OSD display only the 3 SSD cache pools are shown and not the actual Ceph storage.
So our first question is: When can we expect Proxmox to deliver Ceph 0.90?
Thx,
kks
tl;dr: we have a Ceph cluster in use and use an SSD cache pool which runs full. Apparently some cache pool bugs are in the Ceph version 0.80.7 we use (Proxmox 3.3). When can we expect Proxmox to deliver Ceph 0.90?
Long version (excuse my bad english):
We bought 3 servers each with Intel Xeon E5-2620V3, 64 GB RAM, 8x Western Digital 4TB RED + 4x 0.25TB Samsung 850 Pro + Dual-10g-Ethernet as our new hypervisor-host. With an out-of-the-box Windows Hyper-V (without clustering) we reach in VM guests around 700 MByte/s read speed and 600 MByte/s write speed. Also, we’re able to copy around 600 MByte/s via our network.
Using Proxmox + Ceph (without SSD cache pool), we achieved about 50 MByte/s data transfer speed. Considering our hardware-invest, that speed wasn’t enough for us at all.
So we put a cache pool with 2 SSDs in and finally reached 400 MByte/s read speed and 200 MByte/s write speed. This is not as good as Hyper-V, but acceptable (because of the cluster-advantage).
The tests have worked well, but quite soon after the go-live the Ceph Health changed to "Error" and the cache SSD pools (and thus the entire memory) have stopped working with "OSD full".
There seem to be several bugs in the Ceph 0.80.7. Among others, target_max_bytes and/or cache_target_dirty_ratio seems/cache_target_full_ratio seems to be ignored. Any cache-flushing does not seem to be working.
Others have these problems too, e.g.:
http://thread.gmane.org/gmane.comp.file-systems.ceph.user/13252
http://thread.gmane.org/gmane.comp.file-systems.ceph.user/16238
We checked our configuration and documentation, actually it fits.
The new Ceph versions have some improvements regarding cache-handling, so we hope the problems disappear with a new Ceph version.
Code:
V0.90 December 19th, 2014
osd: clean up internal ObjectStore interface (Sage Weil)
osd: flush snapshots from cache tier immediately (Sage Weil)
osd: fix object age eviction (Zhiqiang Wang)
V0.89 December 04th, 2014
osd: cache pool: ignore min flush age when cache is full (Xinze Chi)
V0.88 November 12th, 2014
mon: new ‘ceph pool ls [detail]’ command (Sage Weil)
osd: misc optimizations (Xinxin Shu, Zhiqiang Wang, Xinze Chi)
V0.87 GIANT October 29th, 2014
mon: fix set cache_target_full_ratio (#8440, Geoffrey Hartz)
V0.80.7 FIREFLY October 16th, 2014
osd: fix invalid memory reference in log trimming (#9731 Samuel Just)
osd: fix use-after-free in cache tiering code (#7588 Sage Weil)
What's also strange in Proxmox is that the SSD cache pool despite a 448GB usage indicates just 1.47% used. Obviously the base 30300 GB (8x 4TB) is used and not the size of the cache-pool (2x 238 GB)
Another strange thing in Proxmox is, that in the OSD display only the 3 SSD cache pools are shown and not the actual Ceph storage.
So our first question is: When can we expect Proxmox to deliver Ceph 0.90?
Thx,
kks
Last edited: