General unstability PVE 5.1

ktecho · Nov 23, 2017

Hi there,

Back a few months ago I started having problems with my Proxmox 4.4 server: corrupted disks while doing or deleting snapshots / backups and things like that. Because of this and the need to go for LVM-thin to improve snapshots capabilities, a few weeks ago I upgraded to 5.1 by using the OVH template.

The problems not only persist, but they're worst. As I already told in other threads, my Windows 10 machine get locked as fast as once per day. Screen gets corrupted so I can't see anything and CPU on the VM is at 100%. I plan to change disks from VirtIO to IDE as this VM was performing ultra-slow anyway, so lets see if this improves something.

In the other side, my Linux VMs are failing too. From time to time a customer tells me that it's website is not working, so I go and find the VM using 0% CPU and I cannot enter through SSH, so I have to stop the machine and start it again.

Some times I've found that I start a snapshot or I order to delete a snapshot and it fails:

Code:

Task viewer: VM 104 - Delete Snapshot
qcow2: Marking image as corrupt: Data cluster offset 0x1017800 unaligned (L2 offset: 0x160110000, L2 index: 0x200); further corruption events will be suppressed
qemu-img: Could not delete snapshot 'a20171110': Failed to free the cluster and L1 table: Input/output error
TASK ERROR: command '/usr/bin/qemu-img snapshot -d a20171110 /var/lib/vz/images/104/vm-104-disk-1.qcow2' failed: exit code 1

So now I cannot start the VM again because the qcow2 image is corrupted, so I need to "qemu-img convert qcow2 to qcow2" to recover the disc.

I've been unable to see anything useful on PVE logs or the VM logs.

Thinking that it could be a hardware problem, I've been using several test utilities to test and stress the hardware. I've done that while the server is online, so the tests are not 100% finished (I cannot stress all the RAM as VMs are online). I plan to do a full test this weekend using the Rescue system of OVH, but I would say it's not a hardware problem: RAID reports everything is ok, "badsectors" can't find bad sectors, several RAM tests can't find problems, and putting CPU to pressure don't make the temperature go too high or fail.

Any tip on what can I do to guess what's happeninig?

Thanks a lot,

Luis Miguel

LnxBil · Nov 23, 2017

Is this a "real" server, so do you have a management interface, real hardware logs and such? What do you mean by pve logs, also including syslog and dmesg?

ktecho · Nov 23, 2017

LnxBil said:
Is this a "real" server, so do you have a management interface, real hardware logs and such? What do you mean by pve logs, also including syslog and dmesg?

Yes, it's a real production server. By PVE logs I mean "logs in the proxmox host". Yes, I also give a look to syslog and dmesg.

This happened this night:

Code:

Nov 23 04:30:03 ns204651 vzdump[10705]: INFO: starting new backup job: vzdump 104 100 --mailto ktecho@ktecho.com --mode snapshot --compress lzo --quiet 1 --storage backupDiarios --mailnotification failure
Nov 23 04:30:03 ns204651 vzdump[10705]: INFO: Starting Backup of VM 100 (qemu)
Nov 23 04:30:04 ns204651 qm[10708]: <root@pam> update VM 100: -lock backup
Nov 23 05:30:05 ns204651 vzdump[10705]: VM 100 qmp command failed - VM 100 qmp command 'guest-fsfreeze-freeze' failed - got timeout
Nov 23 05:30:15 ns204651 vzdump[10705]: VM 100 qmp command failed - VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout
Nov 23 05:34:21 ns204651 vzdump[10705]: INFO: Finished Backup of VM 100 (01:04:18)

When I woke up, the VM's web servers were not listening, so I had to "stop" and "start" the VM.

LnxBil · Nov 23, 2017

Yes, that's clear. I also experienced this in the past. The backup process was not able to freeze the guest fs, so it claims. In my case, the guest os filesystem was freezed and I got a lot of timeouts from the guest kernel and the machine seems to hang - only a reset to the guest helped.

Do you use qemu-agent inside of the guest?

ktecho · Nov 23, 2017

LnxBil said:
Yes, that's clear. I also experienced this in the past. The backup process was not able to freeze the guest fs, so it claims. In my case, the guest os filesystem was freezed and I got a lot of timeouts from the guest kernel and the machine seems to hang - only a reset to the guest helped.

Do you use qemu-agent inside of the guest?

Yes, I do:

Code:

root@ktechoubuntu:/var/www# ps axwww | grep qemu
  956 ?        Ss     0:04 /usr/sbin/qemu-ga --daemonize -m virtio-serial -p /dev/virtio-ports/org.qemu.guest_agent.0

Search

Search

General unstability PVE 5.1

ktecho

Active Member

LnxBil

Distinguished Member

ktecho

Active Member

LnxBil

Distinguished Member

ktecho

Active Member