Search results

  1. J

    "Move Disk" data corruption on 4.3

    Something interesting that we discovered late yesterday... After we finished the ceph upgrades and shuffled everything back into its proper place using Move Disk without the delete option, we gave it 24-48 hours to make sure all was working and then went through to purge all the unused disk...
  2. J

    "Move Disk" data corruption on 4.3

    We, too, have used it hundreds of times and have had only a handful of problems. Ceph is designed for large-scale storage, so it does replicate and regularly scrub data looking for bitrot. "Move disk" isn't a disruptive operation, so it doesn't cause the guest to check/reread anything...
  3. J

    "Move Disk" data corruption on 4.3

    Just to follow up on this, after updating our Ceph clusters and Proxmox nodes to Jewel 10.2.3, we have moved over 100 disk images without incident or (detected) corruption, including dozens of MySQL servers. The only other change we have made is that we also changed policy to forbid use of...
  4. J

    Request: Serial Terminal logging

    This would be a great feature, though such logs probably belong in /var/log rather than /var/lib.
  5. J

    Ability to group hosts

    While upgrading proxmox to research some problems (discussed elsewhere on here), at one point we had five proxmox clusters going due to various combinations of CPU type and Proxmox version. This makes it really tough to find a particular VM. Even at the best of times we have two production...
  6. J

    "Move Disk" data corruption on 4.3

    Another server that had its disks moved around the same time popped up with serious filesystem corruption today. This was an email server, not MySQL, and again the files that got corrupted were largely static system configuration files that had not been updated in months -- years in a couple of...
  7. J

    "Move Disk" data corruption on 4.3

    So far I also have not been able to reproduce the problem, although I haven't had as much time for testing as I would want. Still haven't conducted the read-workload tests I hope to try. As many of the tests involve reinstalling Proxmox over remote IPMI at a glacial pace, it's a very slow...
  8. J

    "Move Disk" data corruption on 4.3

    Yes, my situation was much the same. :( Being focused on recovery, I did not gather nearly the amount of information in retrospect it would be good to have now.
  9. J

    "Move Disk" data corruption on 4.3

    BHM, what error message did MySQL give you when it crashed, and which specific files were corrupted in your case?
  10. J

    "Move Disk" data corruption on 4.3

    Right now, I am a little suspicious of the "delete source" option, though I do not have a strong basis for that. It's just mainly what we immediately stopped doing it as a safety precaution and suddenly we can no longer reproduce the issue. Also Black Knight MHT used it and had the problem...
  11. J

    "Move Disk" data corruption on 4.3

    I'm sorry you feel that way. The frm files that are being corrupted are structural files that have a special purpose and are not written during ordinary operation. They do not contain table data. They do not contain indexes. They are not part of the binlog. They are not part of the innodb...
  12. J

    "Move Disk" data corruption on 4.3

    Please let go of the idea that MySQL is somehow writing to files that aren't being written to, and that this is somehow causing corruption. Your understanding of what, when and how MySQL writes is not correct. The binlog is a special-purpose feature used for replication and has nothing to do...
  13. J

    "Move Disk" data corruption on 4.3

    It is definitely strange that this (so far) affects only servers running MySQL. But in our case, in addition to files that happen not to have been written, even some .frm files got corrupted and those are never written unless the database schema is manually changed, which definitely wasn't the...
  14. J

    "Move Disk" data corruption on 4.3

    Is this reproducible? Did you use the "Delete Source" option? How many VM's did you move at a time? Does your PVE cluster have ECC? Do both your source and target fileservers have ECC? Are there any network errors on the interface of any involved server or switch port? Were you able to get...
  15. J

    "Move Disk" data corruption on 4.3

    It may not be that fixed then, as our test was conducted with librbd 10.2.3-1~bpo80+1. But as of right now I really don't see any reason to lay this issue at the door of ceph at all.
  16. J

    "Move Disk" data corruption on 4.3

    Correct. Correct. Correct.
  17. J

    "Move Disk" data corruption on 4.3

    A "repro" is a way to reliably reproduce the problem so it can be further examined and resolved. (And if the repro is any good, other people will be able to follow it and reproduce the problem as well.) As it stands, nothing I do outside of production has been able to reproduce the problem, up...
  18. J

    "Move Disk" data corruption on 4.3

    Per the Ceph developers, mismatched client/server versions should not cause any issues without supplemental stupidity (such as using features or tunables not supported by the older client), and that they work very hard to preserve compatibility. And even in the case of supplemental stupidity...
  19. J

    "Move Disk" data corruption on 4.3

    Whether or not the ceph versioning issue is a problem, it is a separate problem, unlikely to be the cause of the issue. This is not a new situation. These two SSD-based ceph clusters are our smallest ones. Our largest 55TB storage cluster is also running Jewel and has been under heavy load...
  20. J

    "Move Disk" data corruption on 4.3

    What about using new librbd with proxmox in violation of version-specific dependencies? Is that tested? That is not an area where I want to blaze experimental new trails.