Reboot has corrupted my VM

yyyy

Member
Nov 28, 2023
68
2
8
Hello,

I have HA set up for my vm but all I did was restarted the node, now the VM appears to be corrupted, pressing the start button for the vm fails, all the other vms that don't have HA are fine, it appears the HA function of Proxmox is completely broken. All I did was a simple restart of the machine and the vm is no longer working.

Requesting HA start for VM 105
service 'vm:105' in error state, must be disabled and fixed first
TASK ERROR: command 'ha-manager set vm:105 --state started' failed: exit code 255
 
Go to your Cluster HA settings, remove the VM from HA, try to start it.

I think I had that happen before.

If it boots, add it to HA again.

Did you ever test your HA beforehand? You store your running VMs on a shared network drive right?

Hope this helps
 
Go to your Cluster HA settings, remove the VM from HA, try to start it.

I think I had that happen before.

If it boots, add it to HA again.

Did you ever test your HA beforehand? You store your running VMs on a shared network drive right?

Hope this helps
Hi I have ceph shared storage set up but I decided to keep the vms on the local zfs disk for speed and so I could use my ceph cluster for other things not related to running systems (database operations etc). Does HA not work if I run the vm on the local drive? I can imagine simply having HA on would add significant overhead to my network if I have each vm running on the ceph storage.
 
Last edited:
Without ceph you'd need at least some sort of replication between the nodes.
With ceph - I can't tell, but I bet someone else can.

Did removing the specified VM from the HA (and adding it again later) fix the error state of the VM?

EDIT: With ZFS and lots of RAM on the node you shouldn't cause too much traffic to your network, except when booting or migrating of course. My Databases are pretty much idle though so I can't speak for that use case.
 
Last edited:
Does HA not work if I run the vm on the local drive?
I am using this exclusively (for a subset of my VMs), so I can confirm that it works.

The pitfall is that before it can work replication has to have happened (frequently). Set it up via <node> --> <vm> --> Replication --> Add.

:cool:
 
Last edited:
I am using this exclusively (for a subset of my VMs), so I can confirm that it works.

The pitfall is that before it can work replication has to have happened (frequently). Set it up via <node> --> <vm> --> Replication --> Add.

:cool:
Go to your Cluster HA settings, remove the VM from HA, try to start it.

I think I had that happen before.

If it boots, add it to HA again.

Did you ever test your HA beforehand? You store your running VMs on a shared network drive right?

Hope this helps
I've added the replication and it shows 2023-12-10 22:19:03 101-0: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve-3' root@172.16.82.207 -- pvesr prepare-local-job 101-0 local-zfs:vm-101-disk-0 --last_sync 0' failed: exit code 1

Also mophxyz thanks it works now (though still trying to figure out how to fix this replication error)
 
I am using this exclusively (for a subset of my VMs), so I can confirm that it works.

The pitfall is that before it can work replication has to have happened (frequently). Set it up via <node> --> <vm> --> Replication --> Add.

:cool:
I'm trying to remove this replication due to it being in that failed state but it keeps saying "Removal Scheduled" and won't disappear
 
I use network shared storage and can't provide experience with replication.
Are your local ZFS part of the Ceph cluster?
Just some ideas from https://pve.proxmox.com/wiki/Storage_Replication

  • ZFS on both storages?
  • Network connection?
  • Free space left on the replication target storage?
  • Storage with same storage ID on Source and Destination?
EDIT: https://forum.proxmox.com/threads/pve-7-replication-is-not-stable-and-logs-are-incompletes.100915/
Apparently jobs may still be running in the background yes, check with ps aux | grep pvesm

Did you find anything regarding the replication in syslog? (you'll find it on the specific node in GUI -> System -> Syslog
 
Last edited:
I use network shared storage and can't provide experience with replication.
Are your local ZFS part of the Ceph cluster?
Just some ideas from https://pve.proxmox.com/wiki/Storage_Replication

  • ZFS on both storages?
  • Network connection?
  • Free space left on the replication target storage?
  • Storage with same storage ID on Source and Destination?
EDIT: https://forum.proxmox.com/threads/pve-7-replication-is-not-stable-and-logs-are-incompletes.100915/
Apparently jobs may still be running in the background yes, check with ps aux | grep pvesm

Did you find anything regarding the replication in syslog? (you'll find it on the specific node in GUI -> System -> Syslog
Hi Morphxyz,

Thanks for letting me know, since the network overhead shouldn't be that much, I think I may just move my VMs over to the ceph storage but I do have a database vm and I have heard that running databases off ceph is not recommended? What do you think in this case?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!