Search results

  1. J

    "Move Disk" data corruption on 4.3

    Whoops, I thought spirit posted the 0.80.8 versions. Mir, where did *you* get 0.80.8 on 4.3? Also, I read the release notes for 0.80.8 vs. 0.80.7 and indeed none of the client-side fixes (and there weren't many) sound relevant so I agree 0.80.7 vs. 0.80.8 is not likely to be a big deal.
  2. J

    "Move Disk" data corruption on 4.3

    apt-cache policy shows the PVE enterprise repo is simply not offering any ceph packages on 4.3. For 3.4: # apt-cache policy librbd1 librbd1: Installed: 0.80.8-1~bpo70+1 Candidate: 0.80.8-1~bpo70+1 Version table: *** 0.80.8-1~bpo70+1 0 500 https://enterprise.proxmox.com/debian/...
  3. J

    "Move Disk" data corruption on 4.3

    There is definitely some versioning weirdness. On the 3.4 servers where this problem did not occur, they have ceph 0.80.8 from the enterprise.proxmox.com repository. On the 4.3 servers where this problem did occur, they have ceph 0.80.7 from the debian repository. They don't show anything...
  4. J

    "Move Disk" data corruption on 4.3

    No; there are no pending upgrades for the ceph stuff on the proxmox nodes.
  5. J

    "Move Disk" data corruption on 4.3

    Unless you are claiming MySQL backdates file timestamps, the corrupted files were not written to for weeks before they became corrupted. The modification times and sizes were identical to the backups; only the contents differed. Please stop trying to blame this on MySQL. Spirit: # dpkg -l |...
  6. J

    "Move Disk" data corruption on 4.3

    As referenced in the thread linked above, rbd_cache is set based on the Qemu cache setting. If you are referring to some other cache setting, please be more specific as this is the only relevant setting I am aware of. The filesystem is ext4 (rw,relatime,data=ordered). And to reiterate, the...
  7. J

    "Move Disk" data corruption on 4.3

    Also worth nothing: we have two Proxmox cluster, one running 3.4 and one running 4.3. The 3.4 cluster did a lot more migrations yesterday, including doing up to 5 at once, and thusfar they are all fine. The 4.3 cluster was doing them later, one at a time, and we have found multiple problems...
  8. J

    "Move Disk" data corruption on 4.3

    "cache=writeback" is the supported/recommended option for VM's using Ceph pools and sets the RBD cache behavior. ( https://forum.proxmox.com/threads/virtio-disk-caching-or-not.20945/#post-106899 )
  9. J

    "Move Disk" data corruption on 4.3

    Yes, the old disk was removed due to "delete source." No, the corruption was not caused by MySQL crashing as the corrupt tables have not been written in many weeks. We have now found a second case on another MySQL VM where the process didn't crash until a half an hour after the migration...
  10. J

    "Move Disk" data corruption on 4.3

    Yes, it's large, so I put it here: http://pastebin.com/bn4VnTf8 Moving disk images is very common for us as well; we have done a lot of RBD-to-RBD moves as well since the last instance of this issue a couple years ago, which is why an issue like this sucks the oxygen right out of the room.
  11. J

    "Move Disk" data corruption on 4.3

    Data corruption appears to have occurred while using "Move Disk" on a KVM VM running Ubuntu Xenial that is a database server. The MySQL server crashed during the migration and refused to start, citing InnoDB checksum errors in several tables, many of which had not been written to in months...
  12. J

    On 4.3 pve LVM volume groups are inactive at boot

    Hmm, I wouldn't regard these SSD's as particularly slow, and the same machines did work with previous versions with no trouble, but I will certainly give that a shot. Thanks for the suggestion!
  13. J

    On 4.3 pve LVM volume groups are inactive at boot

    We have reinstalled two Proxmox 3.x machines to 4.3 and on both machines, after the install, they come up with this error when booting: Loading, please wait... Volume group "pve" not found Cannot process volume group pve Unable to find LVM volume pve/root Gave up waiting for root device...
  14. J

    Netboot PXE to ceph disk image?

    Is it theoretically possible to PXE boot a diskless server into Proxmox and then have it use a ceph block device as its root filesystem? The incremental cost of putting at least one SSD in a node, however small, doesn't seem like a lot in terms of cost, power, and failure point, but over a...
  15. J

    storage migration virtio failed

    Sure, sorry if I offended. It's sufficient to say that certain causes of migration failure can & should be addressed by the software, and others (like inferior hardware) really cannot.
  16. J

    storage migration virtio failed

    Points 1 & 2 are reasonable where they apply. Points 3 & 4 are, IMO, not. Likewise, risk of storage corruption is also not reasonable. Those are all things that need to be addressed on the software side rather than blaming the user for using too much space or I/O. The average size of disk...
  17. J

    storage migration virtio failed

    This appears to be helping. So far, I have done a handful of migrations that failed, then apply this change, and then no more migrations on that server fail. The failure rate isn't high enough to rule out coincidence, but it's leaning toward unlikely. Note also that this morning a got a...
  18. J

    storage migration virtio failed

    One of the clusters uses SSD's for journals. The other is 100% SSD. Write operations from inside VM's on the same proxmox node that gets 20MB/sec during migration can easily exceed 60MB/sec. Which is still lower than I'd like, but it'd be very nice to get at least that much from storage...
  19. J

    storage migration virtio failed

    Is it possibly related that storage migration performance is pretty awful? It seems to peak at 20MB/sec and often stalls entirely for several seconds at a time. This is on a very lightly utilized Proxmox node with bonded gigabit links to the storage clusters, and the two Ceph clusters are...
  20. J

    storage migration virtio failed

    The chance of migration failure definitely seems proportional to the I/O activity of the image being migrated.