Problem with rollbacking snapshot from Ceph with HA enabled

PiotrD

Active Member
Apr 10, 2014
44
1
28
Hi,
I have a really weird problem. I have 2 node cluster with qdisk and HA enabled on vms. I use ceph as a storage. Nodes are just clients. When i try to rollback vm snapshot there is an error.
Code:
Rolling back to snapshot: 93% complete...
Rolling back to snapshot: 94% complete...
Rolling back to snapshot: 95% complete...
Rolling back to snapshot: 96% complete...
Rolling back to snapshot: 97% complete...
Rolling back to snapshot: 98% complete...
Rolling back to snapshot: 99% complete...
Rolling back to snapshot: 100% complete...done.
TASK ERROR: no such VM ('100')

And this error is caused by HA daemon starting this vm on the second node and it can not start it.
Code:
task started by HA resource agent
TASK ERROR: VM is locked (rollback)

How to avoid HA agent starting vm on the other node during rollback of the snapshot ? Maybe i missed something in the configuration.

Kind regards,
Piotr D
 
Do you use separate network for cluster traffic?

No. Cluster and whole traffic (storage and public connection to vms) are using only one network - vmbr0 bridge.

Maybe there is something wrong with my cluster.conf ?

Code:
<?xml version="1.0"?>
<cluster config_version="7" name="prd">
  <cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <quorumd allow_kill="0" interval="1" label="proxmox1_qdisk" tko="10" votes="1"/>
  <totem token="54000"/>
  <fencedevices>
    <fencedevice agent="fence_ipmilan" ipaddr="10.2.31.135" lanplus="1" login="*******" name="ipmi-compute01" passwd="*****" power_wait="5"/>
    <fencedevice agent="fence_ipmilan" ipaddr="10.2.31.136" lanplus="1" login="*******" name="ipmi-compute02" passwd="*****" power_wait="5"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="compute01" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="ipmi-compute01"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="compute02" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="ipmi-compute02"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="100"/>
    <pvevm autostart="1" vmid="105"/>
  </rm>
</cluster>
 
Well, this is a bad idea (expected to fail under high load).

Yeah, I agree that there should be separate network for cluster traffic. However, in my network I only have two nodes with 2 test vms and only three ceph separate nodes - there is no traffic now because it is test environment. Moreover, all nodes have 4x1 Gb/s network bonds so I am sure that there is no traffic overload now. Please explain to me why RG manager even tries to move that vm during rollback. Is there any timeout ? Because during rollback the vm is stopped. Maybe I should set something or maybe there is some bug.

Kind regards,
Piotr D
 
Yeah, I agree that there should be separate network for cluster traffic. However, in my network I only have two nodes with 2 test vms and only three ceph separate nodes - there is no traffic now because it is test environment. Moreover, all nodes have 4x1 Gb/s network bonds so I am sure that there is no traffic overload now. Please explain to me why RG manager even tries to move that vm during rollback. Is there any timeout ? Because during rollback the vm is stopped. Maybe I should set something or maybe there is some bug.

Kind regards,
Piotr D

snapshot restore takes too long here. please open a bug report on https://bugzilla.proxmox.com
 
I assume the network load gets too large when you do a rollback, so cluster communication fails.

I did some additional tests and the problem is not network, but RG manager and rollback feature. During rollback the vm is being turned off so that vm can be set to previous snapshot. However, RG manager see it as stopped and tries to bring it back. My rgmanager log shows that :

Code:
Apr 14 11:02:15 rgmanager [pvevm] VM 100 is not running
Apr 14 11:02:15 rgmanager status on pvevm "100" returned 7 (unspecified)
Apr 14 11:02:16 rgmanager Stopping service pvevm:100
Apr 14 11:02:17 rgmanager [pvevm] VM 100 is already stopped
Apr 14 11:02:17 rgmanager Service pvevm:100 is recovering
Apr 14 11:02:17 rgmanager Recovering failed service pvevm:100
Apr 14 11:02:18 rgmanager start on pvevm "100" returned 1 (generic error)
Apr 14 11:02:18 rgmanager #68: Failed to start pvevm:100; return value: 1
Apr 14 11:02:18 rgmanager Stopping service pvevm:100
Apr 14 11:02:18 rgmanager [pvevm] VM 100 is already stopped
Apr 14 11:02:19 rgmanager Service pvevm:100 is recovering
Apr 14 11:02:19 rgmanager #71: Relocating failed service pvevm:100
Apr 14 11:02:21 rgmanager Service pvevm:100 is stopped

So RG manager tries to bring back vm beforce rollback is finished and it cannot do that because of vm lock. This behaviour is incorrect, because there should be some lock for RG manager not to do anything untill rollback is finished.

Kind regards,
Piotr D
 
Last edited:
Hi,
I created bug report https://bugzilla.proxmox.com/show_bug.cgi?id=512. and I would like to know how long it usually takes to get any info about it. For example basic info like "It is not a bug" or something like that. There was not any update on it. I do not want to hasten anyone I am just curious.

Kind regards,
Piotr D
 
Just a quick frequent "bug" related to HA. When you add a VM/CT to HA then the VM/CT in question must be powered off before adding it and have rgmanager automatically start the VM/CT. Adding a running VM/CT to HA gives undefined behavior.
 
  • Like
Reactions: Oscar Calles
@PiotrD, have you tried to detach the VM from Proxmox HA then roll back to see if it is successful without error?

Yes. This will work that way. However, the biggest problem with that solution is that you have to readd that vm everytime to HA and if you use failover domains you have to do it via console everytime. So this is very inconvenient.

Just a quick frequent "bug" related to HA. When you add a VM/CT to HA then the VM/CT in question must be powered off before adding it and have rgmanager automatically start the VM/CT. Adding a running VM/CT to HA gives undefined behavior.

I tested it with running vm and it worked okay.
 
  • Like
Reactions: Oscar Calles

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!