CEPH bad checksum

James Crook · Oct 26, 2018

So when trying to run a backup of containers on ceph storage we get a container backup vzdump just hang, until we kill it with from the host.

After some digging I changed the dump location to NFS (from CIFS) and added the patch talked about below.
https://forum.proxmox.com/threads/lxc-backups-hang-via-nfs-and-cifs.46669/#post-224815
and bug logged here
https://bugzilla.proxmox.com/show_bug.cgi?id=1911

I'm now getting "_verify_csum bad crc32c/0x1000 checksum" but it only happens when doing a backup.

24th Host syslog.1
Oct 23 22:05:16 PM1 ceph-osd[7119]: 2018-10-23 22:05:16.764978 7f0c9a798700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x69000, got 0x6706be76, expected 0x8bb9d12a, device location [0xa181a29000~1000], logical extent 0x3e9000~1000, object #4:3f53234e:::rbd_data.3ab6c374b0dc51.0000000000000373:head#
Oct 23 22:05:16 PM1 ceph-osd[7119]: 2018-10-23 22:05:16.765098 7f0c9a798700 -1 log_channel(cluster) log [ERR] : 4.fc missing primary copy of 4:3f53234e:::rbd_data.3ab6c374b0dc51.0000000000000373:head, will try copies on 4

24th Host osd-ceph-5
2018-10-23 22:05:16.764978 7f0c9a798700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x69000, got 0x6706be76, expected 0x8bb9d12a, device location [0xa181a29000~1000], logical extent 0x3e9000~1000, object #4:3f53234e:::rbd_data.3ab6c374b0dc51.0000000000000373:head#
2018-10-23 22:05:16.765098 7f0c9a798700 -1 log_channel(cluster) log [ERR] : 4.fc missing primary copy of 4:3f53234e:::rbd_data.3ab6c374b0dc51.0000000000000373:head, will try copies on 4

26th Host syslog.1
Oct 26 04:01:23 PM1 ceph-osd[7903]: 2018-10-26 04:01:23.663112 7f5c972c8700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1000, got 0x6706be76, expected 0xcd6157f, device location [0x7d422a1000~1000], logical extent 0x1000~1000, object #4:4b97b22c:::rbd_data.3ab7c374b0dc51.00000000000046ad:head#
Oct 26 04:01:23 PM1 ceph-osd[7903]: 2018-10-26 04:01:23.722940 7f5c972c8700 -1 log_channel(cluster) log [ERR] : 4.1d2 missing primary copy of 4:4b97b22c:::rbd_data.3ab7c374b0dc51.00000000000046ad:head, will try copies on 2

26th Host osd-ceph-0
2018-10-26 03:52:53.716689 7f5c97ac9700 0 log_channel(cluster) log [DBG] : 4.1e scrub starts
2018-10-26 03:52:54.115551 7f5c97ac9700 0 log_channel(cluster) log [DBG] : 4.1e scrub ok
2018-10-26 04:01:23.663112 7f5c972c8700 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x1000, got 0x6706be76, expected 0xcd6157f, device location [0x7d422a1000~1000], logical extent 0x1000~1000, object #4:4b97b22c:::rbd_data.3ab7c374b0dc51.00000000000046ad:head#
2018-10-26 04:01:23.722940 7f5c972c8700 -1 log_channel(cluster) log [ERR] : 4.1d2 missing primary copy of 4:4b97b22c:::rbd_data.3ab7c374b0dc51.00000000000046ad:head, will try copies on 2
2018-10-26 04:05:56.797317 7f5c97ac9700 0 log_channel(cluster) log [DBG] : 4.140 scrub starts
2018-10-26 04:05:57.092745 7f5c97ac9700 0 log_channel(cluster) log [DBG] : 4.140 scrub ok
2018-10-26 04:06:57.659367 7f5c9f2d8700 4 rocksdb: [/home/builder/source/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:684] reusing log 8443 from recycle list

Any ideas ?

I tried talking to the powers above me about using ZFS, but they decided CEPH sounded better to them.

James Crook · Oct 26, 2018

Think I might have found my own answer if no one else comes back to me.
https://ceph-users.ceph.narkive.com/t0epfP0o/no-fix-for-0x6706be76-crcs
https://tracker.ceph.com/issues/22464

Alwin · Oct 29, 2018

James Crook said:
Think I might have found my own answer if no one else comes back to me.
https://ceph-users.ceph.narkive.com/t0epfP0o/no-fix-for-0x6706be76-crcs
https://tracker.ceph.com/issues/22464

This might have an impact. Please also update to Ceph 12.2.8, as there is a known bug that can lead to CRC errors and was fixed with 12.2.8.

James Crook · Oct 29, 2018

Alwin said:
This might have an impact. Please also update to Ceph 12.2.8, as there is a known bug that can lead to CRC errors and was fixed with 12.2.8.

Seems 12.2.4 is the latest, gonna see if I can upgrade as I found a slightly confusing wiki about upstream ceph and proxmox ceph...

Alwin · Oct 30, 2018

Upstream and our repository has the 12.2.8, ATM. But if you are using Ceph on Proxmox, always use our Ceph repository.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html

Repository: deb http://download.proxmox.com/debian/ceph-luminous stretch main

James Crook · Oct 30, 2018

B>I>N>G>O
Seems I was missing /etc/apt/sources.list.d/ceph.list so I've created that with the above info.

Will do a backup of the host and containers before I upgrade, and will report back.

Search

Search

CEPH bad checksum

James Crook

Well-Known Member

James Crook

Well-Known Member

Alwin

Proxmox Retired Staff

James Crook

Well-Known Member

Alwin

Proxmox Retired Staff

James Crook

Well-Known Member

We value your privacy