Windows VM freezes after backup to PMBS

Jackmynet

New Member
Oct 5, 2024
18
0
1
Hi all we have an issue that I cannot solve from all the reading I do online.

We have a 3 node proxmox cluster running 3 very critical security monitoring servers.

We have a daily backlup of our VM's that runs at 11am. One Windows VM (most important one of course!) freezes after the backup has completed sometimes.

The backup is perfect and completes successfully.

It does this probably every 2-4 days randomly. Only way to restore the VM to operational is to restart the node.

Any idea what may be causing such a situation?
 
We have a daily backlup of our VM's that runs at 11am. One Windows VM (most important one of course!) freezes after the backup has completed sometimes.
How do you backup your VM's? On a "Proxmox Backup Server", or via vzdump on a network share?
What is the load, i/o, on the Proxmox VE node at the time of the backup?
It would also be interesting to know the config of the VM that freezes from time to time during the backup:

Code:
qm config <vmid>
 
Hi I am backing up to a Proxmox Backup Server along with my other 2 virtual machines that sit on this cluster. The others rung flawlessly.

See attached the config of this vm. It is an identical windows server to the others.
 

Attachments

Thank you for the config. It looks fine.

Only way to restore the VM to operational is to restart the node.
Really the whole node? Like reboot, reset? The VM cannot be reset, stopped, restarted?

What is the load, i/o, on the Proxmox VE node at the time of the backup?

Is there anything interesting in the logs? Please adjust date and time.

Code:
journalctl --since "2024-11-01 00:00" --until "2024-11-03 16:00" > "$(hostname)-journal.txt"
 
Yep It will not allow me to restart or stop the machine. Sometimes I can manage to stop it but normally no.

I may be doing something incorrectly but with the urgency of the issue I end up restarting the whole node which resolves it. I have had to disable these backups since the 25th of November due to this but see attached the logs from the 22nd and 23rd.

Have read every thread on this and seem to be able to find nothing relating to my setup that causes this.
 

Attachments

Hello @Jackmynet and thanks for the logs. What i see:

Code:
Nov 22 11:00:08 pve pvescheduler[521330]: INFO: starting new backup job: vzdump 100 101 102 --fleecing 0 --quiet 1 --mode snapshot --bwlimit 81920 --prune-backups 'keep-last=7' --mailto jack@mynetsecurity.com --mailnotification failure --notes-template '{{guestname}}' --storage mynet-ie-pxmbs                                           
Nov 22 11:04:12 pve pvescheduler[521330]: VM 100 qmp command failed - VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout
Nov 22 11:19:53 pve pvedaemon[2962656]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Nov 22 11:23:01 pve pve-ha-lrm[536914]: service vm:100 is in an error state and needs manual intervention. Look up 'ERROR RECOVERY' in the documentation.
Nov 22 11:29:07 pve pvedaemon[535741]: VM quit/powerdown failed
Nov 22 11:29:08 pve pmxcfs[4834]: [quorum] crit: quorum_dispatch failed: 2
Nov 22 11:29:08 pve pmxcfs[4834]: [status] notice: node lost quorum
Nov 22 11:29:08 pve pmxcfs[4834]: [dcdb] crit: cpg_dispatch failed: 2
Nov 22 11:29:08 pve pmxcfs[4834]: [dcdb] crit: cpg_leave failed: 2
Nov 22 11:29:08 pve pmxcfs[4834]: [status] crit: cpg_dispatch failed: 2
Nov 22 11:29:08 pve pmxcfs[4834]: [status] crit: cpg_leave failed: 2
Nov 22 11:29:08 pve pmxcfs[4834]: [quorum] crit: quorum_initialize failed: 2
Nov 22 11:29:08 pve pmxcfs[4834]: [quorum] crit: can't initialize service
Nov 22 11:29:08 pve pmxcfs[4834]: [confdb] crit: cmap_initialize failed: 2
Nov 22 11:29:08 pve pmxcfs[4834]: [confdb] crit: can't initialize service
Nov 22 11:29:08 pve pmxcfs[4834]: [dcdb] notice: start cluster connection
Nov 22 11:29:08 pve pmxcfs[4834]: [dcdb] crit: cpg_initialize failed: 2
Nov 22 11:29:08 pve pmxcfs[4834]: [dcdb] crit: can't initialize service
Nov 22 11:29:08 pve pmxcfs[4834]: [status] notice: start cluster connection
Nov 22 11:29:08 pve pmxcfs[4834]: [status] crit: cpg_initialize failed: 2
Nov 22 11:29:08 pve pmxcfs[4834]: [status] crit: can't initialize service
Nov 22 11:29:08 pve corosync[4946]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Nov 22 11:29:08 pve corosync[4946]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Nov 22 11:29:08 pve corosync[4946]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Nov 22 11:29:08 pve corosync[4946]:   [MAIN  ] Corosync Cluster Engine exiting normally
Nov 22 11:29:08 pve systemd[1]: corosync.service: Deactivated successfully.
Nov 22 11:29:08 pve systemd[1]: Stopped corosync.service - Corosync Cluster Engine.
Nov 22 11:29:08 pve systemd[1]: corosync.service: Consumed 1d 7h 59min 17.974s CPU time.
Nov 22 11:29:10 pve systemd[1]: ceph-osd@5.service: Deactivated successfully.
Nov 22 11:29:10 pve systemd[1]: Stopped ceph-osd@5.service - Ceph object storage daemon osd.5.
Nov 22 11:29:10 pve systemd[1]: ceph-osd@5.service: Consumed 8h 57min 15.289s CPU time.
Nov 22 11:29:11 pve systemd[1]: ceph-osd@0.service: Deactivated successfully.
Nov 22 11:29:11 pve systemd[1]: Stopped ceph-osd@0.service - Ceph object storage daemon osd.0.
Nov 22 11:29:11 pve systemd[1]: ceph-osd@0.service: Consumed 12h 12min 13.945s CPU time.
Nov 22 11:29:11 pve systemd[1]: ceph-osd@3.service: Deactivated successfully.
Nov 22 11:29:11 pve systemd[1]: Stopped ceph-osd@3.service - Ceph object storage daemon osd.3.
Nov 22 11:29:11 pve systemd[1]: ceph-osd@3.service: Consumed 8h 54.873s CPU time.
Nov 22 11:29:14 pve pmxcfs[4834]: [quorum] crit: quorum_initialize failed: 2
Nov 22 11:29:14 pve pmxcfs[4834]: [confdb] crit: cmap_initialize failed: 2
Nov 22 11:29:14 pve pmxcfs[4834]: [dcdb] crit: cpg_initialize failed: 2
Nov 22 11:29:14 pve pmxcfs[4834]: [status] crit: cpg_initialize failed: 2
Nov 22 11:29:14 pve systemd[1]: ceph-osd@4.service: Deactivated successfully.
Nov 22 11:29:14 pve systemd[1]: Stopped ceph-osd@4.service - Ceph object storage daemon osd.4.
Nov 22 11:29:14 pve systemd[1]: ceph-osd@4.service: Consumed 12h 42min 33.854s CPU time.
Nov 22 11:29:14 pve systemd[1]: Removed slice system-ceph\x2dosd.slice - Slice /system/ceph-osd.


Nov 23 11:00:13 pve rclone[1191460]: mount helper error: fusermount: mountpoint is not empty                                                                                                                                                                                                                                                   
Nov 23 11:00:13 pve rclone[1191460]: mount helper error: fusermount: if you are sure this is safe, use the 'nonempty' mount option                                                                                                                                                                                                             
Nov 23 11:00:13 pve rclone[1191460]: Fatal error: failed to mount FUSE fs: fusermount: exit status 1
Nov 23 11:02:12 pve pvestatd[5376]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries

The backup is started, the freeze via the agent cannot be done. There are timeouts, then the VM is in the error state (LRM reports this). Followed by problems with corosync. A little later, the node reboots.
Then there are also a lot of error messages from Rclone (maybe something unwanted is happening to it?).

There could be several reasons for this. I would check the following:
  • Status of the network during the backup -> is it very busy, much too busy?
  • Check Proxmox Cluster/Corosync -> is the network traffic for Corosync running on at least one physical network interface of its own?
  • What is the load on the node during the backup -> is there perhaps a performance issue?
-> If there is a timeout with the guest agent, disable it for testing.
-> Test VM Backup Fleecing
 
  • Like
Reactions: Kingneutron
I'm just going to ask, what version of the virtio utilities do you have installed in the VM? And is it enabled in the VM config?

Vm Options / QEmu Guest Agent in the GUI

On my win10 vm with cpulimit=1.9, I have to raise the Priority of qemu-agent in Taskmgr to Realtime or I get agent timeouts trying to do a timesync.
 
Hi guys,

We have a 10Gb dedicated fibre network between the servers on their own Nic for the cluster and corosync.

Our servers are not running high loads, traffic is not high on the network ever unless I am doing a migration or backup which would logically be high. When doing the backups however we have it limited to 80mb of bandwidth as this is going to an offsite server and we don't want to trash our upload on the network.

Interestingly we have an identical secondary server which is simply mirroring data from a software in our primary server and it backs up no problem every time, identical size etc. The other third Windows VM also backs up trouble free.

We have Qemu guest agent enabled but I am unsure how to change the priority on this in task manager?

Could the issue be related to the node this VM was running on? I have moved this VM to the same node that the secondary vm runs on and have not tried the backups but the reason it was moved is we had a disk failure on the node it was running on which is now resolved. We have scheduled to move this back to that node however I can try running these backups when the VMs are on the other nodes.

Is there a way to monitor the active traffic on the corosync network interfaces?
 
Also see below logs. this is from today I was updating this node and analysing the node logs out of curiousity.

Could this be related? None of the other nodes have this in the log in relation to the google drive. We used google drive as a test before for backups but were not happy with it but the directory is still there. Thoughts?

The error about time out to backup server is understood as this server is currently down.

How can I remove this google drive directory/mountpoint etc?
 

Attachments

but the directory is still there
From the logs - it is evident you have got more than just a gdrive directory/mountpoint going on. It appears you have rclone setup using gdrive with a gdrive.service in place. You probably want to reverse whatever activity was previously installed/run. I can't know that - but here are a few rough pointers/possibilities.

Maybe? - Check what jobs are/were setup using rclone. cron jobs etc? disable them.
Maybe? - stop that gdrive.service with systemctl stop gdrive.service & then systemctl disable gdrive.service
Maybe? - Delete that gdrive remote instance in rclone config

After you are pretty confident you've cleaned up the gdrive/rclone mess, reboot the node & test.

REMEMBER YOU ALWAYS MUST HAVE FULL & RESTORABLE BACKUPS!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!