Hyperconverged cluster logging seemingly random crc errors

lifeboy · Oct 23, 2025

We have 4 nodes (dual Xeon CPU's, 256G RAM, 4 NVMe SSD's, 4 HDD's and dual Melanox 25Gb/s sfp's) in a cluster. Randomly I have started noticing crc errors in the osd logs.

Node B, osd.6

2025-10-23T10:32:59.808+0200 7f22a75bf700  0 bad crc in data 3330350463 != exp 677417498 from v1:192.168.131.4:0/3121668685

192.168.131.4 is node D

Node B, osd.7

2025-10-23T09:35:12.995+0200 7fbcdbcd7700  0 bad crc in data 3922083958 != exp 3479198006 from v1:192.168.131.2:0/2732728486

192.168.131.2 is node B, which is the node osd.7 is on.

and so there are others on other nodes and osd's. From what I understand this means that data copied from some other osd to this one that logs the error fails the crc test. However, I have taken (as a test) one of these ssd's out of the cluster and it tests just fine. I put it back and no crc errors are logged for it.

Question: Can something else we causing this? A network connector? It seems pretty random so me, so how can I trace this sources of this?

lifeboy · Oct 27, 2025

Doesn't anyone have an idea why this is happening? It's doesn't happen much, but it's persistent across all nodes.

apoyen · Oct 28, 2025

Hello,

The error always are on the same OSD or it's really random ?
Do you have an hardware raid controller on your servers ? If it's the case check via your IPMI system that your disk are not in the hardware raid because it can be a problem.

wigor · Oct 28, 2025

a hint, maybe:
Are there Windows VMs involved?
IIRC it can be observed, if using krbd instead of librdb.

lifeboy · Nov 2, 2025

apoyen said:
Hello,

The error always are on the same OSD or it's really random ?
Do you have an hardware raid controller on your servers ? If it's the case check via your IPMI system that your disk are not in the hardware raid because it can be a problem.

All 8 SSD's record crc errors, between 5 and 20 per day. There's no hardware raid involved. The carrier card is the one supplied by SuperMicro, not even a 3rd party one.

We do use krbd on our ceph storage pool for the improved performance it offers. There's aren't many Windows VM's, but there are a few. @wigor, could you elaborate a little on what the issue is with Windows guests wrt to crc error in the logs please?

Update: I disabled krbb on the storage volumes yesterday, but I see quite a few crc errors again today, so I'm re-enabling it.

fweber · Nov 3, 2025

Hi, this is likely due to an interaction between KRBD and Windows VMs, see [1] for more information and a workaround. As mentioned by @wigor, using librbd instead of KRBD is a possible workaround.

lifeboy said:
Update: I disabled krbb on the storage volumes yesterday, but I see quite a few crc errors again today, so I'm re-enabling it.

Did you restart or live-migrate all running VMs after changing the setting? Note that KRBD vs librbd is only effective on VM start (or live-migration). In other words, if you have KRBD enabled on your storage and a VM is already running, disabling KRBD on the storage will not have any effect on the running VM (so you might still encounter the errors) -- only when you stop and start (or live-migrate) the VM, it will not use KRBD anymore.

[1] https://forum.proxmox.com/threads/f...bd-storage-for-windows-vms.155741/post-714951

lifeboy · Nov 3, 2025

fweber said:
Hi, this is likely due to an interaction between KRBD and Windows VMs, see [1] for more information and a workaround. As mentioned by @wigor, using librbd instead of KRBD is a possible workaround.

The error message I get looks somewhat different:
I see in ceph-osd.7.log:

2025-11-03T15:45:22.665+0200 7fbcdbcd7700  0 bad crc in data 1513571956 != exp 3330889006 from v1:192.168.131.3:0/3917894537

In that post the error indicates it's libceph:

Code:

[Thu Oct 10 13:23:42 2024] libceph: read_partial_message 00000000d9278a57 data crc 623151463 != exp. 3643241286
[Thu Oct 10 13:23:42 2024] libceph: osd39 (1)158.*.*.*:7049 bad crc/signature
[Thu Oct 10 13:23:42 2024] libceph: osd76 (1)158.*.*.*:6978 bad crc/signature

Is that the same error?

fweber said:
Did you restart or live-migrate all running VMs after changing the setting? Note that KRBD vs librbd is only effective on VM start (or live-migration). In other words, if you have KRBD enabled on your storage and a VM is already running, disabling KRBD on the storage will not have any effect on the running VM (so you might still encounter the errors) -- only when you stop and start (or live-migrate) the VM, it will not use KRBD anymore.

No, I did not restart/live-migrate. I don't have many windows machines running, so I could do that later tonight to test it.

fweber said:
[1] https://forum.proxmox.com/threads/f...bd-storage-for-windows-vms.155741/post-714951

lifeboy · Nov 4, 2025

lifeboy said:
No, I did not restart/live-migrate. I don't have many windows machines running, so I could do that later tonight to test it.

I live-migrated most of the Windows machines last night. I still see the crc errors (235 in total over all the osd's on one particular node). How can I check if they Win VM's are actually now not using krbd anymore?

Update: I found how to: qm showcmd 143 outputs ".. .-drive 'file=rbd:speedy/vm-143-disk-0 ..." which indicates that it's using librbd now.
Likewise for the other Win VM's.

fweber · Nov 5, 2025

Hi,

lifeboy said:
I live-migrated most of the Windows machines last night. I still see the crc errors (235 in total over all the osd's on one particular node). How can I check if they Win VM's are actually now not using krbd anymore?

Update: I found how to: qm showcmd 143 outputs ".. .-drive 'file=rbd:speedy/vm-143-disk-0 ..." which indicates that it's using librbd now.
Likewise for the other Win VM's.

Not quite, this approach only shows you the command line Proxmox VE would use if you started the VM now, it does not tell you whether a running VM uses KRBD or librbd. For this, you can navigate in the GUI to the "Monitor" tab of your VM and run the following command:

Code:

info block -v

for each image, you can check the image: list entry:

if its value is a path starting with /dev/rbd-pve/, the image is mapped with KRBD
if its value is a JSON object mentioning "driver": "rbd", the image is attached via librbd

lifeboy said:
The error message I get looks somewhat different:
I see in ceph-osd.7.log:
2025-11-03T15:45:22.665+0200 7fbcdbcd7700 0 bad crc in data 1513571956 != exp 3330889006 from v1:192.168.131.3:0/3917894537

In that post the error indicates it's libceph:

Code:

[Thu Oct 10 13:23:42 2024] libceph: read_partial_message 00000000d9278a57 data crc 623151463 != exp. 3643241286 [Thu Oct 10 13:23:42 2024] libceph: osd39 (1)158.*.*.*:7049 bad crc/signature [Thu Oct 10 13:23:42 2024] libceph: osd76 (1)158.*.*.*:6978 bad crc/signature

Is that the same error?

Good point, it might be a different error, I'd need to double-check. Can you confirm that you do not see the libceph errors in your host journals?
FWIW, I found a (German) post with similar "bad crc in data" errors where the culprit was faulty RAM [1], though this seems somewhat less likely if the warnings appear on all nodes.

[1] https://forum.proxmox.com/threads/meldung-bad-crc-in-data-in-ceph-osd-log.77741/#post-346185

lifeboy · Nov 11, 2025

fweber said:
Good point, it might be a different error, I'd need to double-check. Can you confirm that you do not see the libceph errors in your host journals?
FWIW, I found a (German) post with similar "bad crc in data" errors where the culprit was faulty RAM [1], though this seems somewhat less likely if the warnings appear on all nodes.

[1] https://forum.proxmox.com/threads/meldung-bad-crc-in-data-in-ceph-osd-log.77741/#post-346185

I don't find any libceph crc errors in host logs, but I did find these in the ceph/ceph.log

Code:

2025-11-11T10:30:54.426416+0200 mon.FT1-NodeA (mon.0) 2740072 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc {logm=3381248390})
2025-11-11T10:30:54.427538+0200 mon.FT1-NodeA (mon.0) 2740073 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc {logm=3796081378})
2025-11-11T10:30:54.428438+0200 mon.FT1-NodeA (mon.0) 2740074 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc {logm=3937336173})
2025-11-11T10:30:54.429396+0200 mon.FT1-NodeA (mon.0) 2740075 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc {logm=168070422})
2025-11-11T10:30:54.430106+0200 mon.FT1-NodeA (mon.0) 2740076 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc {logm=236030887})
2025-11-11T10:30:54.430990+0200 mon.FT1-NodeA (mon.0) 2740077 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc {logm=1132165809})
2025-11-11T10:30:54.431737+0200 mon.FT1-NodeA (mon.0) 2740078 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc {logm=80396629})
2025-11-11T10:30:54.432538+0200 mon.FT1-NodeA (mon.0) 2740079 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc {logm=4281340412})
2025-11-11T10:30:54.433313+0200 mon.FT1-NodeA (mon.0) 2740080 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc {logm=285088603})
2025-11-11T10:30:54.434096+0200 mon.FT1-NodeA (mon.0) 2740081 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc {logm=4280034302})

These are just a sample, there are many more.

Search

Search

Hyperconverged cluster logging seemingly random crc errors

lifeboy

Renowned Member

lifeboy

Renowned Member

apoyen

New Member

wigor

Well-Known Member

lifeboy

Renowned Member

fweber

Proxmox Staff Member

lifeboy

Renowned Member

lifeboy

Renowned Member

fweber

Proxmox Staff Member

lifeboy

Renowned Member

We value your privacy