Recent content by lifeboy

  1. L

    Hyperconverged cluster logging seemingly random crc errors

    I live-migrated most of the Windows machines last night. I still see the crc errors (235 in total over all the osd's on one particular node). How can I check if they Win VM's are actually now not using krbd anymore? Update: I found how to: qm showcmd 143 outputs ".. .-drive...
  2. L

    Hyperconverged cluster logging seemingly random crc errors

    The error message I get looks somewhat different: I see in ceph-osd.7.log: 2025-11-03T15:45:22.665+0200 7fbcdbcd7700 0 bad crc in data 1513571956 != exp 3330889006 from v1:192.168.131.3:0/3917894537 In that post the error indicates it's libceph: [Thu Oct 10 13:23:42 2024] libceph...
  3. L

    Hyperconverged cluster logging seemingly random crc errors

    All 8 SSD's record crc errors, between 5 and 20 per day. There's no hardware raid involved. The carrier card is the one supplied by SuperMicro, not even a 3rd party one. We do use krbd on our ceph storage pool for the improved performance it offers. There's aren't many Windows VM's, but...
  4. L

    Hyperconverged cluster logging seemingly random crc errors

    Doesn't anyone have an idea why this is happening? It's doesn't happen much, but it's persistent across all nodes.
  5. L

    Hyperconverged cluster logging seemingly random crc errors

    We have 4 nodes (dual Xeon CPU's, 256G RAM, 4 NVMe SSD's, 4 HDD's and dual Melanox 25Gb/s sfp's) in a cluster. Randomly I have started noticing crc errors in the osd logs. Node B, osd.6 2025-10-23T10:32:59.808+0200 7f22a75bf700 0 bad crc in data 3330350463 != exp 677417498 from...
  6. L

    Node went down - unclear why - log attached

    That's inconsequential. That Node was down and had started up, but PBS1 and InfluxDB were not started yet. I have attached all the records from journalctl between 12:00 and the completed shutdown of the Node. I don't see any reason why the Node shutdown. It looks like an orderly shutdown so...
  7. L

    Node went down - unclear why - log attached

    We had a node go down two days ago and I'm at a loss figuring out why. I attached the log. This happened at 12:30. The other nodes simply show that the OSD's when down and feverishly started rebalancing the cluster. Is there any indication as to why? Sep 8 12:29:56 FT1-NodeA...
  8. L

    NVMe OSD generates crc error. Failing drive?

    I have relatively new Samsung Enterprise NVMe in a node that is generating the following error: ... 2025-08-26T15:56:43.870+0200 7fe8ac968700 0 bad crc in data 3326000616 != exp 1246001655 from v1:192.168.131.4:0/1799093090 2025-08-26T16:03:54.757+0200 7fe8ad96a700 0 bad crc in data...
  9. L

    enabling ceph image replication: how to set up host addresses?

    No, @birdflewza. I didn't pursue this any further, since the customer that requested it didn't want it anymore. It's on our list though, so we'll visit this again some time.
  10. L

    CPU frequency

    Proxmox 7 also runs in "Performance mode", so unless you're on an older version this tweak should not be necessary. You can check this with the following: cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor performance performance ...
  11. L

    Perplexing: When a node is turned off, the whole cluster looses it's network

    Yes, we have checked that in great detail. The VLAN's on the Mellanox switch all have all the active ports joined to every VLAN, so no matter where the VM runs, the VLANs are active there.
  12. L

    Perplexing: When a node is turned off, the whole cluster looses it's network

    Indeed, some VMs crashed. However, the 2 pfSense VMs are 100 and 101 and neither crashed. That was the first thing I checked for in the logs. The reason for taking the nodes down was exactly that: We doubled the RAM in each node.
  13. L

    Perplexing: When a node is turned off, the whole cluster looses it's network

    Logs of the 10 Nev attached. The first node to be shutdown was NodeC at about 12:40, then NodeD, then A and last B
  14. L

    Perplexing: When a node is turned off, the whole cluster looses it's network

    That's why I have two instances of pfSense who poll each other with CARP. If I shut one down, the other takes over within seconds. So it's not that. The VM's on the nodes stay on, but they don't communicate with the control plane anymore as far as I can tell. So if I check the logs on a...
  15. L

    Perplexing: When a node is turned off, the whole cluster looses it's network

    Hmm... that is the only clue I have been able to find about what happens. Or maybe it's unrelated then?