Ceph bug - patch inclusion request

Which version of ceph (ceph versions) and which version of PVE (pveversion -v) are you running on?
 
Latest from Proxmox and Proxmox Ceph Repo:
Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.18-1-pve)
pve-manager: 5.2-6 (running version: 5.2-6/bcd5f008)
pve-kernel-4.15: 5.2-4
pve-kernel-4.15.18-1-pve: 4.15.18-17
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
ceph: 12.2.7-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.12.12-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-37
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-24
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-29
pve-container: 2.0-24
pve-docs: 5.2-5
pve-firewall: 3.0-13
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-30
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
 
With 'ceph versions' you see, if all nodes run the current Ceph code.
 
Unsurprisingly:
Code:
root@vub-host-01:~# ceph versions
{
    "mon": {
        "ceph version 12.2.7 (94ce186ac93bb28c3c444bccfefb8a31eb0748e4) luminous (stable)": 5
    },
    "mgr": {
        "ceph version 12.2.7 (94ce186ac93bb28c3c444bccfefb8a31eb0748e4) luminous (stable)": 5
    },
    "osd": {
        "ceph version 12.2.7 (94ce186ac93bb28c3c444bccfefb8a31eb0748e4) luminous (stable)": 14
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.7 (94ce186ac93bb28c3c444bccfefb8a31eb0748e4) luminous (stable)": 24
    }
}
 
Here is some more information (the bug report is quite long): On recent kernels, it seems like the kernel sometimes returns all zeros when a block is read from disk. The bug report states that this happens under memory pressure, but our hosts are not particularily loaded, yet we see this bug. The bugfix/workaround is to retry the read (by default up to 3 times).

The bug has three indicators:
  • on the host using the KRBD device: this entry in dmesg:
    Code:
    libceph: get_reply osd7 tid 65355112 data 1966080 > preallocated 65536, skipping
    (the osd number and tid/data numbers vary)
  • on the host running the mentioned OSD, in the corresponding ceph-osd.xx.log:
    Code:
    2018-08-19 08:37:09.224534 7f2837594700 -1 bluestore(/var/lib/ceph/osd/ceph-14) _verify_csum bad crc32c/0x1000 checksum at blob offset 0xd000, got 0x6706be76, expected 0x9034ab13, device location [0x2b0e075000~1000], logical extent 0xcd000~1000, object #1:b267826c:::rbd_data.0b725f2ae8944a.0000000000007cbd:head#
    The important clue here is got 0x6706be76, which is the checksum for an all zeros block
  • in the VM and on the host running the VM: 100% busy time on the RBD/block device, meaning that VM I/O will never continue, and that even sync on the host will not complete, making it necessary to (a) force stop the VM and (b) force/hard reboot the host
Our environment is 5 hosts with 64GB RAM (one has 128GB though), hosts have 2 or 3 OSDs, all SSD, about 2TB per host, using a replicated ceph pool and KRBD, and quite busy databases running in the VMs. The bug occurs randomly, maybe on average once per 2 weeks or so? Haven't really kept track.

I think the busyness and size of the VMs increase the chances of seeing this bug, since we mostly see it with our largest and busiest MySQL DB of about 300GB, which is almost always 100% I/O bound.
 
The patch is written against the current master (Nautilus). It is still a pull-request and was not reviewed or found to be complete. With those two points alone, there will be no support from upstream (on whatever side-effects) and a possible backport will be hard to do (or maintain).

Further, I believe, this makes reads slower. As the proposed workaround, merely adds a retry for N times, making a read in worst case N times slower on the bluestore side.

IMHO, the problem is that the VM stalls, when it couldn't read a couple of blocks. Where the actual retry should happen. As you are using KRBD, this might be a Kernel issue. Maybe for a different reason than in Sage's statement.

As the above is no solution, maybe the below might help to mitigate the issue.
  • Did you try the suggestions from the tracker? Reducing the memory usage of the OSDs or disabling swap. Buffer/caches using almost all available RAM can lead to memory pressure too.
  • We have seen similar issues in nature, when RAID controller are used for OSDs. If you run them, try IT-mode or HBAs.
  • Maybe you can switch the VMs to librbd, so Qemu communicates through librbd with the cluster. To get away from the Kernel client, if its a Kernel issue.
  • Increase logging, if you haven't done so already. Maybe only specific OSDs are involved.
    http://docs.ceph.com/docs/luminous/...g-and-debug/#subsystem-log-and-debug-settings
 
  • Like
Reactions: AlexLup
The patch seems very trivial to me, merely retrying reads a specified number of times in case of a checksum failure. It will be "slower" when there is a checksum failure. How often do you think a checksum failure will/should occur?
  • Host memory usage is around 30-40%. How much lower should it get? More than 40GB RAM free? [EDIT]I see that you mention caches, which I consider "free", but there is also at least 1GB actual "free" (as from /proc/meminfo MemFree) memory and we have set vm.min_free_kbytes to 512MB[/EDIT]
  • We do not use RAID cards, only plain SATA or NVMe.
  • librbd is simply too slow. We run a serious workload and librbd is factors slow than KRBD, so that's not an option. Also, KRBD would be required for LXC. I can try and run a workload on LXC to reproduce the issue there too.
  • Not all OSDs have been affected, probably only because it hasn't occured that often yet. However, those that have been affected are varied across SATA/NVMe, Intel/AMD, so I can't see a specific pattern there.
I do think that it is a conceptual problem that Ceph/RBD will block if a failure occur. The Ceph bug tracker is full of "i/o stuck" reports. Note that this is only a general remark about Ceph, nothing to do with Proxmox. Still, as Proxmox advertises Ceph as a supported storage backend, it is somewhat revelant in my opinion.

(As another note, sometimes I think we are the only ones to run a serious workload on Proxmox, since often when I see bugs related to heavy usage, there seems to be little feedback and traction about resolving them.)
 
  • Like
Reactions: AlexLup
Happened again last night with a lesser used VM:
Code:
2018-08-21 22:28:37.298133 7f9fc6eae700 -1 bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x3000, got 0x6706be76, expected 0x77722c59, device location [0x40a89bb000~1000], logical extent 0x233000~1000, object #1:d41fdeea:::rbd_data.6753fa2ae8944a.0000000000008b5c:head#
The terrible thing is that the next message is
Code:
2018-08-21 22:28:37.299809 7f9fc6eae700 -1 log_channel(cluster) log [ERR] : 1.2b missing primary copy of 1:d41fdeea:::rbd_data.6753fa2ae8944a.0000000000008b5c:head, will try copies on 12,14
which would be fine - if it would work. I do wonder why it says it would read from one of the remaining copies, but I/O still stops.

MemFree was 1.2GB around that time, evidenced by the monitoring. MemAvailable around 41.8 GB.

And of course, if this alleged kernel bug would be fixed, that would be good as well. Still I think that it's not unwise to not have a single OSDs I/O failure result in a total blocking I/O on the client.
 
This bug was attached to the 22464 one. Add your situation to the bug report, this way Sage sees, that there is more than one installation affected (also different ceph version) http://tracker.ceph.com/issues/25006

As of now, we will not backport this patch, as a) there is no upstream blessing and b) it is not done easily (two ceph versions ahead; doesn't build).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!