Hi there,
Back a few months ago I started having problems with my Proxmox 4.4 server: corrupted disks while doing or deleting snapshots / backups and things like that. Because of this and the need to go for LVM-thin to improve snapshots capabilities, a few weeks ago I upgraded to 5.1 by using the OVH template.
The problems not only persist, but they're worst. As I already told in other threads, my Windows 10 machine get locked as fast as once per day. Screen gets corrupted so I can't see anything and CPU on the VM is at 100%. I plan to change disks from VirtIO to IDE as this VM was performing ultra-slow anyway, so lets see if this improves something.
In the other side, my Linux VMs are failing too. From time to time a customer tells me that it's website is not working, so I go and find the VM using 0% CPU and I cannot enter through SSH, so I have to stop the machine and start it again.
Some times I've found that I start a snapshot or I order to delete a snapshot and it fails:
So now I cannot start the VM again because the qcow2 image is corrupted, so I need to "qemu-img convert qcow2 to qcow2" to recover the disc.
I've been unable to see anything useful on PVE logs or the VM logs.
Thinking that it could be a hardware problem, I've been using several test utilities to test and stress the hardware. I've done that while the server is online, so the tests are not 100% finished (I cannot stress all the RAM as VMs are online). I plan to do a full test this weekend using the Rescue system of OVH, but I would say it's not a hardware problem: RAID reports everything is ok, "badsectors" can't find bad sectors, several RAM tests can't find problems, and putting CPU to pressure don't make the temperature go too high or fail.
Any tip on what can I do to guess what's happeninig?
Thanks a lot,
Luis Miguel
Back a few months ago I started having problems with my Proxmox 4.4 server: corrupted disks while doing or deleting snapshots / backups and things like that. Because of this and the need to go for LVM-thin to improve snapshots capabilities, a few weeks ago I upgraded to 5.1 by using the OVH template.
The problems not only persist, but they're worst. As I already told in other threads, my Windows 10 machine get locked as fast as once per day. Screen gets corrupted so I can't see anything and CPU on the VM is at 100%. I plan to change disks from VirtIO to IDE as this VM was performing ultra-slow anyway, so lets see if this improves something.
In the other side, my Linux VMs are failing too. From time to time a customer tells me that it's website is not working, so I go and find the VM using 0% CPU and I cannot enter through SSH, so I have to stop the machine and start it again.
Some times I've found that I start a snapshot or I order to delete a snapshot and it fails:
Code:
Task viewer: VM 104 - Delete Snapshot
qcow2: Marking image as corrupt: Data cluster offset 0x1017800 unaligned (L2 offset: 0x160110000, L2 index: 0x200); further corruption events will be suppressed
qemu-img: Could not delete snapshot 'a20171110': Failed to free the cluster and L1 table: Input/output error
TASK ERROR: command '/usr/bin/qemu-img snapshot -d a20171110 /var/lib/vz/images/104/vm-104-disk-1.qcow2' failed: exit code 1
So now I cannot start the VM again because the qcow2 image is corrupted, so I need to "qemu-img convert qcow2 to qcow2" to recover the disc.
I've been unable to see anything useful on PVE logs or the VM logs.
Thinking that it could be a hardware problem, I've been using several test utilities to test and stress the hardware. I've done that while the server is online, so the tests are not 100% finished (I cannot stress all the RAM as VMs are online). I plan to do a full test this weekend using the Rescue system of OVH, but I would say it's not a hardware problem: RAID reports everything is ok, "badsectors" can't find bad sectors, several RAM tests can't find problems, and putting CPU to pressure don't make the temperature go too high or fail.
Any tip on what can I do to guess what's happeninig?
Thanks a lot,
Luis Miguel
Last edited: