Hyperconverged cluster logging seemingly random crc errors

lifeboy

Renowned Member
We have 4 nodes (dual Xeon CPU's, 256G RAM, 4 NVMe SSD's, 4 HDD's and dual Melanox 25Gb/s sfp's) in a cluster. Randomly I have started noticing crc errors in the osd logs.

Node B, osd.6
2025-10-23T10:32:59.808+0200 7f22a75bf700 0 bad crc in data 3330350463 != exp 677417498 from v1:192.168.131.4:0/3121668685
192.168.131.4 is node D

Node B, osd.7
2025-10-23T09:35:12.995+0200 7fbcdbcd7700 0 bad crc in data 3922083958 != exp 3479198006 from v1:192.168.131.2:0/2732728486
192.168.131.2 is node B, which is the node osd.7 is on.

and so there are others on other nodes and osd's. From what I understand this means that data copied from some other osd to this one that logs the error fails the crc test. However, I have taken (as a test) one of these ssd's out of the cluster and it tests just fine. I put it back and no crc errors are logged for it.

Question: Can something else we causing this? A network connector? It seems pretty random so me, so how can I trace this sources of this?
 
Hello,

The error always are on the same OSD or it's really random ?
Do you have an hardware raid controller on your servers ? If it's the case check via your IPMI system that your disk are not in the hardware raid because it can be a problem.
 
a hint, maybe:
Are there Windows VMs involved?
IIRC it can be observed, if using krbd instead of librdb.
 
Hello,

The error always are on the same OSD or it's really random ?
Do you have an hardware raid controller on your servers ? If it's the case check via your IPMI system that your disk are not in the hardware raid because it can be a problem.
All 8 SSD's record crc errors, between 5 and 20 per day. There's no hardware raid involved. The carrier card is the one supplied by SuperMicro, not even a 3rd party one.

We do use krbd on our ceph storage pool for the improved performance it offers. There's aren't many Windows VM's, but there are a few. @wigor, could you elaborate a little on what the issue is with Windows guests wrt to crc error in the logs please?

Update: I disabled krbb on the storage volumes yesterday, but I see quite a few crc errors again today, so I'm re-enabling it.
 
Last edited:
Hi, this is likely due to an interaction between KRBD and Windows VMs, see [1] for more information and a workaround. As mentioned by @wigor, using librbd instead of KRBD is a possible workaround.
Update: I disabled krbb on the storage volumes yesterday, but I see quite a few crc errors again today, so I'm re-enabling it.
Did you restart or live-migrate all running VMs after changing the setting? Note that KRBD vs librbd is only effective on VM start (or live-migration). In other words, if you have KRBD enabled on your storage and a VM is already running, disabling KRBD on the storage will not have any effect on the running VM (so you might still encounter the errors) -- only when you stop and start (or live-migrate) the VM, it will not use KRBD anymore.

[1] https://forum.proxmox.com/threads/f...bd-storage-for-windows-vms.155741/post-714951
 
Hi, this is likely due to an interaction between KRBD and Windows VMs, see [1] for more information and a workaround. As mentioned by @wigor, using librbd instead of KRBD is a possible workaround.

The error message I get looks somewhat different:
I see in ceph-osd.7.log:
2025-11-03T15:45:22.665+0200 7fbcdbcd7700 0 bad crc in data 1513571956 != exp 3330889006 from v1:192.168.131.3:0/3917894537

In that post the error indicates it's libceph:
Code:
[Thu Oct 10 13:23:42 2024] libceph: read_partial_message 00000000d9278a57 data crc 623151463 != exp. 3643241286
[Thu Oct 10 13:23:42 2024] libceph: osd39 (1)158.*.*.*:7049 bad crc/signature
[Thu Oct 10 13:23:42 2024] libceph: osd76 (1)158.*.*.*:6978 bad crc/signature

Is that the same error?

Did you restart or live-migrate all running VMs after changing the setting? Note that KRBD vs librbd is only effective on VM start (or live-migration). In other words, if you have KRBD enabled on your storage and a VM is already running, disabling KRBD on the storage will not have any effect on the running VM (so you might still encounter the errors) -- only when you stop and start (or live-migrate) the VM, it will not use KRBD anymore.

No, I did not restart/live-migrate. I don't have many windows machines running, so I could do that later tonight to test it.

 
No, I did not restart/live-migrate. I don't have many windows machines running, so I could do that later tonight to test it.

I live-migrated most of the Windows machines last night. I still see the crc errors (235 in total over all the osd's on one particular node). How can I check if they Win VM's are actually now not using krbd anymore?

Update: I found how to: qm showcmd 143 outputs ".. .-drive 'file=rbd:speedy/vm-143-disk-0 ..." which indicates that it's using librbd now.
Likewise for the other Win VM's.
 
Last edited:
Hi,
I live-migrated most of the Windows machines last night. I still see the crc errors (235 in total over all the osd's on one particular node). How can I check if they Win VM's are actually now not using krbd anymore?

Update: I found how to: qm showcmd 143 outputs ".. .-drive 'file=rbd:speedy/vm-143-disk-0 ..." which indicates that it's using librbd now.
Likewise for the other Win VM's.
Not quite, this approach only shows you the command line Proxmox VE would use if you started the VM now, it does not tell you whether a running VM uses KRBD or librbd. For this, you can navigate in the GUI to the "Monitor" tab of your VM and run the following command:
Code:
info block -v
for each image, you can check the image: list entry:
  • if its value is a path starting with /dev/rbd-pve/, the image is mapped with KRBD
  • if its value is a JSON object mentioning "driver": "rbd", the image is attached via librbd
The error message I get looks somewhat different:
I see in ceph-osd.7.log:
2025-11-03T15:45:22.665+0200 7fbcdbcd7700 0 bad crc in data 1513571956 != exp 3330889006 from v1:192.168.131.3:0/3917894537


In that post the error indicates it's libceph:
Code:
[Thu Oct 10 13:23:42 2024] libceph: read_partial_message 00000000d9278a57 data crc 623151463 != exp. 3643241286
[Thu Oct 10 13:23:42 2024] libceph: osd39 (1)158.*.*.*:7049 bad crc/signature
[Thu Oct 10 13:23:42 2024] libceph: osd76 (1)158.*.*.*:6978 bad crc/signature

Is that the same error?
Good point, it might be a different error, I'd need to double-check. Can you confirm that you do not see the libceph errors in your host journals?
FWIW, I found a (German) post with similar "bad crc in data" errors where the culprit was faulty RAM [1], though this seems somewhat less likely if the warnings appear on all nodes.

[1] https://forum.proxmox.com/threads/meldung-bad-crc-in-data-in-ceph-osd-log.77741/#post-346185