Reboot has corrupted my VM

yyyy · Dec 10, 2023

Hello,

I have HA set up for my vm but all I did was restarted the node, now the VM appears to be corrupted, pressing the start button for the vm fails, all the other vms that don't have HA are fine, it appears the HA function of Proxmox is completely broken. All I did was a simple restart of the machine and the vm is no longer working.

Requesting HA start for VM 105
service 'vm:105' in error state, must be disabled and fixed first
TASK ERROR: command 'ha-manager set vm:105 --state started' failed: exit code 255

morphxyz · Dec 10, 2023

Go to your Cluster HA settings, remove the VM from HA, try to start it.

I think I had that happen before.

If it boots, add it to HA again.

Did you ever test your HA beforehand? You store your running VMs on a shared network drive right?

Hope this helps

yyyy · Dec 10, 2023

morphxyz said:
Go to your Cluster HA settings, remove the VM from HA, try to start it.

I think I had that happen before.

If it boots, add it to HA again.

Did you ever test your HA beforehand? You store your running VMs on a shared network drive right?

Hope this helps

Hi I have ceph shared storage set up but I decided to keep the vms on the local zfs disk for speed and so I could use my ceph cluster for other things not related to running systems (database operations etc). Does HA not work if I run the vm on the local drive? I can imagine simply having HA on would add significant overhead to my network if I have each vm running on the ceph storage.

morphxyz · Dec 10, 2023

Without ceph you'd need at least some sort of replication between the nodes.
With ceph - I can't tell, but I bet someone else can.

Did removing the specified VM from the HA (and adding it again later) fix the error state of the VM?

EDIT: With ZFS and lots of RAM on the node you shouldn't cause too much traffic to your network, except when booting or migrating of course. My Databases are pretty much idle though so I can't speak for that use case.

UdoB · Dec 10, 2023

yyyy said:
Does HA not work if I run the vm on the local drive?

I am using this exclusively (for a subset of my VMs), so I can confirm that it works.

The pitfall is that before it can work replication has to have happened (frequently). Set it up via <node> --> <vm> --> Replication --> Add.

yyyy · Dec 10, 2023

UdoB said:
I am using this exclusively (for a subset of my VMs), so I can confirm that it works.

The pitfall is that before it can work replication has to have happened (frequently). Set it up via <node> --> <vm> --> Replication --> Add.

morphxyz said:
Go to your Cluster HA settings, remove the VM from HA, try to start it.

I think I had that happen before.

If it boots, add it to HA again.

Did you ever test your HA beforehand? You store your running VMs on a shared network drive right?

Hope this helps

I've added the replication and it shows 2023-12-10 22:19:03 101-0: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve-3' root@172.16.82.207 -- pvesr prepare-local-job 101-0 local-zfs:vm-101-disk-0 --last_sync 0' failed: exit code 1

Also mophxyz thanks it works now (though still trying to figure out how to fix this replication error)

yyyy · Dec 10, 2023

UdoB said:
I am using this exclusively (for a subset of my VMs), so I can confirm that it works.

The pitfall is that before it can work replication has to have happened (frequently). Set it up via <node> --> <vm> --> Replication --> Add.

I'm trying to remove this replication due to it being in that failed state but it keeps saying "Removal Scheduled" and won't disappear

morphxyz · Dec 10, 2023

I use network shared storage and can't provide experience with replication.
Are your local ZFS part of the Ceph cluster?
Just some ideas from https://pve.proxmox.com/wiki/Storage_Replication

ZFS on both storages?
Network connection?
Free space left on the replication target storage?
Storage with same storage ID on Source and Destination?

EDIT: https://forum.proxmox.com/threads/pve-7-replication-is-not-stable-and-logs-are-incompletes.100915/
Apparently jobs may still be running in the background yes, check with ps aux | grep pvesm

Did you find anything regarding the replication in syslog? (you'll find it on the specific node in GUI -> System -> Syslog

yyyy · Dec 25, 2023

morphxyz said:
I use network shared storage and can't provide experience with replication.
Are your local ZFS part of the Ceph cluster?
Just some ideas from https://pve.proxmox.com/wiki/Storage_Replication

ZFS on both storages?

Network connection?

Free space left on the replication target storage?

Storage with same storage ID on Source and Destination?

EDIT: https://forum.proxmox.com/threads/pve-7-replication-is-not-stable-and-logs-are-incompletes.100915/
Apparently jobs may still be running in the background yes, check with ps aux | grep pvesm

Did you find anything regarding the replication in syslog? (you'll find it on the specific node in GUI -> System -> Syslog

Hi Morphxyz,

Thanks for letting me know, since the network overhead shouldn't be that much, I think I may just move my VMs over to the ceph storage but I do have a database vm and I have heard that running databases off ceph is not recommended? What do you think in this case?

Search

Search

Reboot has corrupted my VM

yyyy

Member

morphxyz

New Member

yyyy

Member

morphxyz

New Member

UdoB

Distinguished Member

yyyy

Member

yyyy

Member

morphxyz

New Member

yyyy

Member