General unstability PVE 5.1

ktecho

Active Member
Jun 6, 2016
49
0
26
43
Hi there,

Back a few months ago I started having problems with my Proxmox 4.4 server: corrupted disks while doing or deleting snapshots / backups and things like that. Because of this and the need to go for LVM-thin to improve snapshots capabilities, a few weeks ago I upgraded to 5.1 by using the OVH template.

The problems not only persist, but they're worst. As I already told in other threads, my Windows 10 machine get locked as fast as once per day. Screen gets corrupted so I can't see anything and CPU on the VM is at 100%. I plan to change disks from VirtIO to IDE as this VM was performing ultra-slow anyway, so lets see if this improves something.

In the other side, my Linux VMs are failing too. From time to time a customer tells me that it's website is not working, so I go and find the VM using 0% CPU and I cannot enter through SSH, so I have to stop the machine and start it again.

Some times I've found that I start a snapshot or I order to delete a snapshot and it fails:

Code:
Task viewer: VM 104 - Delete Snapshot
qcow2: Marking image as corrupt: Data cluster offset 0x1017800 unaligned (L2 offset: 0x160110000, L2 index: 0x200); further corruption events will be suppressed
qemu-img: Could not delete snapshot 'a20171110': Failed to free the cluster and L1 table: Input/output error
TASK ERROR: command '/usr/bin/qemu-img snapshot -d a20171110 /var/lib/vz/images/104/vm-104-disk-1.qcow2' failed: exit code 1

So now I cannot start the VM again because the qcow2 image is corrupted, so I need to "qemu-img convert qcow2 to qcow2" to recover the disc.

I've been unable to see anything useful on PVE logs or the VM logs.

Thinking that it could be a hardware problem, I've been using several test utilities to test and stress the hardware. I've done that while the server is online, so the tests are not 100% finished (I cannot stress all the RAM as VMs are online). I plan to do a full test this weekend using the Rescue system of OVH, but I would say it's not a hardware problem: RAID reports everything is ok, "badsectors" can't find bad sectors, several RAM tests can't find problems, and putting CPU to pressure don't make the temperature go too high or fail.

Any tip on what can I do to guess what's happeninig?

Thanks a lot,

Luis Miguel
 
Last edited:
Is this a "real" server, so do you have a management interface, real hardware logs and such? What do you mean by pve logs, also including syslog and dmesg?
 
Is this a "real" server, so do you have a management interface, real hardware logs and such? What do you mean by pve logs, also including syslog and dmesg?

Yes, it's a real production server. By PVE logs I mean "logs in the proxmox host". Yes, I also give a look to syslog and dmesg.

This happened this night:

Code:
Nov 23 04:30:03 ns204651 vzdump[10705]: INFO: starting new backup job: vzdump 104 100 --mailto ktecho@ktecho.com --mode snapshot --compress lzo --quiet 1 --storage backupDiarios --mailnotification failure
Nov 23 04:30:03 ns204651 vzdump[10705]: INFO: Starting Backup of VM 100 (qemu)
Nov 23 04:30:04 ns204651 qm[10708]: <root@pam> update VM 100: -lock backup
Nov 23 05:30:05 ns204651 vzdump[10705]: VM 100 qmp command failed - VM 100 qmp command 'guest-fsfreeze-freeze' failed - got timeout
Nov 23 05:30:15 ns204651 vzdump[10705]: VM 100 qmp command failed - VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout
Nov 23 05:34:21 ns204651 vzdump[10705]: INFO: Finished Backup of VM 100 (01:04:18)

When I woke up, the VM's web servers were not listening, so I had to "stop" and "start" the VM.
 
Last edited:
Yes, that's clear. I also experienced this in the past. The backup process was not able to freeze the guest fs, so it claims. In my case, the guest os filesystem was freezed and I got a lot of timeouts from the guest kernel and the machine seems to hang - only a reset to the guest helped.

Do you use qemu-agent inside of the guest?
 
Yes, that's clear. I also experienced this in the past. The backup process was not able to freeze the guest fs, so it claims. In my case, the guest os filesystem was freezed and I got a lot of timeouts from the guest kernel and the machine seems to hang - only a reset to the guest helped.

Do you use qemu-agent inside of the guest?

Yes, I do:

Code:
root@ktechoubuntu:/var/www# ps axwww | grep qemu
  956 ?        Ss     0:04 /usr/sbin/qemu-ga --daemonize -m virtio-serial -p /dev/virtio-ports/org.qemu.guest_agent.0
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!