Storage Replication regularly stops

FastLaneJB · Aug 6, 2017

This is an awesome feature and I know it's experimental at the moment but it regularly just stops syncing all my VM's. I need to restart the pvesr service and then it'll work again for a day or two before stopping again.

I think it's simply getting stuck on the freeze filesystem sometimes on some containers and that stops all of them working.

For instance it's the 2017-08-06 as I post this and this is the last log file I can see

Code:

2017-08-03 05:30:01 700-0: start replication job
2017-08-03 05:30:01 700-0: guest => CT 700, running => 1
2017-08-03 05:30:01 700-0: volumes => main:subvol-700-disk-1
2017-08-03 05:30:02 700-0: freeze guest filesystem

I think it's when the replication clashes with a backup job as that container is configured to backup at 05:30 exactly every day.

The replication.cfg for this container is as follows:

Code:

local: 700-0
        target kt-proxmox-02
        rate 10

So it's running every 15 minutes as the default.

Think the replication code needs to check if a backup is running and skips replication if that's the case? Or even better can run when the backup works but has some code to maybe not freeze the filesystem at exactly the same time so the replication holds fire just for a mo until the backup has done its freeze.

le_top · Oct 25, 2017

I'am pretty much getting the same experience here.

It may be related to "reboots" of the target machine, which implies I have to remove all zfsStorageCopies before being able to start replication again. (using "zfs destroy vmpool/subvol-102-disk-1" like command line commands).

That's pretty annoying and some kind of automatic recovery would be welcome.

dendi · Oct 25, 2017

Hi,
My 4 nodes cluster works fine with zfs replication. Even after reboot of a target machine.
I'm using kvm

le_top · Oct 25, 2017

Good that it works somewhere ;-).

I am on Proxmox 5.0-33 .

I currently have one LXC syncing on ZFS since 7 hours CET this morning and it is 13:00 CET now (the machines are synced every 15 minutes and the other machines had their last sync at 6:45).
The previous sync took 6.8 seconds according to the replication table.

The log shows that one backup (not replication) started at about 6:00 and finished after 3.5 minutes and another backup started at 7:00:01 and finished after about 13 minutes for that machine (remote storage, so slower).

Therefore it seems that the backup is impacting the replication negatively. I guess that the snapshot for the backup is not taken in a fashion that is completely independent from the replication.

Defining a schedule where this does not happen is difficult as the backup time of all machines takes a few hours. Therefore it is always possible that the backup snapshot is set when the CT is at the critical point for replication (whenever that is exactly).
Replication would have to be stopped for the time of the backup.
Also defining such a cron time specification is not a simple task I think.

Any suggestion to avoid this?

dendi · Oct 25, 2017

The backup doesn't take snapshot at storage level, it uses a kvm function to freeze the VM and intercept writes during backup.
I don't use backups because I also use pve-zsync nightly to a remote storage, having snapshots for the last x days.

le_top · Oct 25, 2017

Ok, my backup report says "backup mode:snapshot".

Here is the transcript (start time: 07:00 CEST). The server names have been changed to "example.com" for privacy:

INFO: starting new backup job: vzdump --compress gzip --mailto support@example.com --mailnotification failure --node p4 --storage p4BackupFtp --mode snapshot --quiet 1 --exclude 200 --all 1
INFO: Starting Backup of VM 102 (lxc)
INFO: status = running
INFO: CT Name: example.com
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
INFO: creating archive '/mnt/pve/p4BackupFtp/dump/vzdump-lxc-102-2017_10_25-05_00_01.tar.gz'
INFO: Total bytes written: 17194987520 (17GiB, 22MiB/s)
INFO: archive file size: 8.09GB
INFO: delete old backup '/mnt/pve/p4BackupFtp/dump/vzdump-lxc-102-2017_10_24-05_00_01.tar.gz'
INFO: remove vzdump snapshot
INFO: Finished Backup of VM 102 (00:12:55)

le_top · Oct 25, 2017

According to the WebGui, VM102 is still syncing at 19:30 CEST.
I havent found the command yet to fix this.

I noticed that I can get a log of the sync - it is in the "freeze" step:
2017-10-25 05:00:00 102-0: start replication job
2017-10-25 05:00:00 102-0: guest => CT 102, running => 1
2017-10-25 05:00:00 102-0: volumes => ZfsStorage:subvol-102-disk-1
2017-10-25 05:00:00 102-0: freeze guest filesystem

FastLaneJB · Oct 25, 2017

dendi said:
The backup doesn't take snapshot at storage level, it uses a kvm function to freeze the VM and intercept writes during backup.
I don't use backups because I also use pve-zsync nightly to a remote storage, having snapshots for the last x days.

It most certainly does use storage level snapshots for LXC backups. KVM is only half of Proxmox.

le_top · Oct 26, 2017

I now got a backup failure. I guess that this is because the replication that started yesterday, is still running according to the WebGUI (probably in the "freeze guest filesystem" state.

This is the content of the mail I received:

can't aquire lock '/var/run/vzdump.lock' - got timeout
VMID NAME STATUS TIME SIZE FILENAME
TOTAL 00:00:00 0KB

Detailed backup logs:
vzdump --mode snapshot --storage p4BackupFtp --mailto support@example.com --quiet 1 --exclude 200 --node p4 --all 1 --mailnotification failure --compress gzip

This is the content of the backup log:

Task viewer: Backup
OutputStatus
Stop
INFO: trying to get global lock - waiting...
ERROR: can't aquire lock '/var/run/vzdump.lock' - got timeout

le_top · Oct 26, 2017

Ok, I did an "lsof" the lockfile:

# lsof /var/run/vzdump.lock
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
task\x20U 9077 root 5wW REG 0,23 0 1923 /run/vzdump.lock

Then I grepped the processes:

# ps -edalf | grep 9077
5 S root 9077 8930 0 80 0 - 81338 poll_s 04:00 ? 00:00:00 task UPID

4:00002375:0474C324:59F15DC1:vzdump::root@pam:
4 S root 9094 9077 0 80 0 - 10441 hrtime 04:00 ? 00:00:01 /usr/bin/lxc-freeze -n 102
0 S root 32156 8075 0 80 0 - 3197 pipe_w 09:28 pts/10 00:00:00 grep 9077

So 'lxc-freeze' is still there and it was launched by 'task"

This led to identifiying somewhat how the tasks are started, examing the perl scripts located in '/usr/share/perl5/PVE' and finding that there is a "clean" way to stop the backup, which failed in my case:

# vzdump --stop 1
stoping backup process 9077 failed

(There is a typo in "stoping" as there are some others in the documentation for vzdump ("runing").)

As that didn't work, I looked for information about "lxc-freeze" and discovered that there is a counter part "lxc-unfreeze" that I then tried:

lxc-unfreeze -n 102

The documentation of lxc-freeze says that it freezes all processes inside the container, but the processes in the container were still running well.

As a result of the unfreeze, the process number 9077 (lxc-freeze) disappeared.

Thanks to the job number from the process list and a search, I found the log file of the task and its contents:

# cat /var/log/pve/tasks/1/UPID

4:00002375:0474C324:59F15DC1:vzdump::root@pam: |& more

INFO: starting new backup job: vzdump --mailto support@example.com --compress lzo --mode snapshot --exclude 200,108 --s
torage bigdisk_archive --mailnotification failure --all 1 --quiet 1
INFO: Starting Backup of VM 102 (lxc)
INFO: status = running
INFO: CT Name: example.com
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
snapshot create failed: starting cleanup
no lock found trying to remove 'backup' lock
ERROR: Backup of VM 102 failed - command '/usr/bin/lxc-freeze -n 102' failed: interrupted by signal
[...]

So finally it is the backup job that was waiting for the lxc-freeze to return.

With this knowledge, I did a final check in the system logs:
daemon.log:Oct 26 09:41:56 p4 vzdump[9077]: ERROR: Backup of VM 102 failed - command '/usr/bin/lxc-freeze -n 102' failed: interrupted by signal
:/var/log# grep freeze *
daemon.log:Oct 26 09:41:56 p4 vzdump[9077]: ERROR: Backup of VM 102 failed - command '/usr/bin/lxc-freeze -n 102' failed: interrupted by signal
daemon.log.1:Oct 21 23:25:19 p4 pvesr[23835]: lxc-freeze: 102: freezer.c: do_freeze_thaw: 64 Failed to get new freezer state for /var/lib/lxc:102
daemon.log.1:Oct 21 23:25:19 p4 pvesr[23835]: lxc-freeze: 102: tools/lxc_freeze.c: main: 107 Failed to freeze /var/lib/lxc:102

So nothing special there: a command failed because it could not get the freeze on October 21st and we get the vzdump failure notification due to the lxc-unfreeze that I issued.

Conclusion:
There are IMHO two issues:
1. 'lxc-freeze' does not return from freezing, it does not freeze under certain conditions and it fails to detect this. This possibly occurred with concurrent freezes (backup/replication);
2. Backup and replication processes interfere with each other. It looks like they should wait for each other, at least during the freezing step.

le_top · Oct 26, 2017

I forgot to check the GUI which shows that I am not done yet:

So while stopping the 'lxc-freeze' with 'lxc-unfreeze' was stopping the backup task, the state of the replication was that it was "syncing". It reports the duration as 23.0h, but I the unfreeze was done more than 23 hours after the start of the replication.

The fact that "Next Sync" is pending is because I disabled that replication. Enabling it did start the replication again and it ended successfully.

le_top · Oct 27, 2017

The problem occurred again this morning. That confirms that it is systematic when the backup occurs.
This time the 'lxc-unfreeze' did not do the trick - I had to kill the 'lxc-freeze' process itself.

I have now time-shifted the replications by one minute which will probably limit the misbehavior, but not avoid it completely.

fabian · Oct 27, 2017

thanks for investigating this issue - please file a bug report and link to this thread! should be fixable by adding more mutual-exclusion to replication and snapshotting.

le_top · Oct 27, 2017

I opened the report: https://bugzilla.proxmox.com/show_bug.cgi?id=1538 .

le_top · Nov 5, 2017

There is another cause for the replication to stop, I do not know yet where to start looking.

At this time the solution is to destroy the zfs filesystems on the target server, but that is not a good solution and it is unsafe (human error is to be expected):

# zfs destroy -r vmpool/subvol-103-disk-1
# zfs destroy -r vmpool/subvol-105-disk-1

I did this because the report indicates volume 'vmpool/subvol-103-disk-1' already exists (same for the other VM that is no longer replicated). On of the three VMs on this machine continued to be replicated.

Replication Log
2017-11-05 12:04:00 103-0: start replication job
2017-11-05 12:04:00 103-0: guest => CT 103, running => 1
2017-11-05 12:04:00 103-0: volumes => ZfsStorage:subvol-103-disk-1
2017-11-05 12:04:01 103-0: freeze guest filesystem
2017-11-05 12:04:02 103-0: create snapshot '__replicate_103-0_1509879840__' on ZfsStorage:subvol-103-disk-1
2017-11-05 12:04:02 103-0: thaw guest filesystem
2017-11-05 12:04:02 103-0: full sync 'ZfsStorage:subvol-103-disk-1' (__replicate_103-0_1509879840__)
2017-11-05 12:04:02 103-0: full send of vmpool/subvol-103-disk-1@__replicate_103-0_1509879840__ estimated size is 1.88G
2017-11-05 12:04:02 103-0: total estimated size is 1.88G
2017-11-05 12:04:02 103-0: TIME SENT SNAPSHOT
2017-11-05 12:04:02 103-0: vmpool/subvol-103-disk-1 name vmpool/subvol-103-disk-1 -
2017-11-05 12:04:02 103-0: volume 'vmpool/subvol-103-disk-1' already exists
2017-11-05 12:04:02 103-0: warning: cannot send 'vmpool/subvol-103-disk-1@__replicate_103-0_1509879840__': signal received
2017-11-05 12:04:02 103-0: cannot send 'vmpool/subvol-103-disk-1': I/O error
2017-11-05 12:04:02 103-0: command 'zfs send -Rpv -- vmpool/subvol-103-disk-1@__replicate_103-0_1509879840__' failed: exit code 1
2017-11-05 12:04:02 103-0: delete previous replication snapshot '__replicate_103-0_1509879840__' on ZfsStorage:subvol-103-disk-1
2017-11-05 12:04:02 103-0: end replication job with error: command 'set -o pipefail && pvesm export ZfsStorage:subvol-103-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_103-0_1509879840__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-103-disk-1 zfs - -with-snapshots 1' failed: exit code 255

le_top · Nov 7, 2017

Now I have a backup issue, possibly related to the same root issue:

The partial result from "pct list" is:

104 running snapshot-delete www

The replication seems to work though.

I performed a "pct unlock 104" to unlock the VM - I'll see if backup works again.

le_top · Nov 13, 2017

After another failure (possibly HW error), the backup which was running was interrupted.
So I was initially left with a lock on the machine (removed with 'pct unlock 109'), and I still had a snapshot according to the GUI.

So I requested to remove the snapshot.
The snapshot now is no longer there, but the GUI still says that it is.
When I try to remove it, there is an error that there is none to destroy.
(On a sidenote: usually the same word is used for the same operation in applications. Here we have "Remove", "Destroy" and "Delete" all for getting rid of the snapshot=.

That message is confirmed by the 'zfs list -t snapshot' command:

# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
vmpool/subvol-100-disk-1@__replicate_100-0_1510597320__ 745M - 38.9G -
vmpool/subvol-102-disk-1@__replicate_102-0_1510596960__ 114M - 16.9G -
vmpool/subvol-103-disk-1@__replicate_103-0_1510597020__ 952K - 2.12G -
vmpool/subvol-104-disk-1@__replicate_104-0_1510597560__ 6.04M - 4.58G -
vmpool/subvol-105-disk-1@__replicate_105-0_1510597023__ 97.9M - 8.05G -
vmpool/subvol-109-disk-1@__replicate_109-0_1510597565__ 1.12M - 1.09G -
vmpool/subvol-120-disk-1@__replicate_120-0_1510596965__ 5.53M - 5.46G -
vmpool/subvol-150-disk-1@__replicate_150-0_1510596968__ 55.1M - 3.59G -
vmpool/subvol-223-disk-1@__replicate_223-0_1510596973__ 2.41M - 1.14G -
vmpool/subvol-224-disk-1@__replicate_224-0_1510596976__ 6.98M - 1.40G -

Removing that status from the GUI will be another piece of "pleasure".

[Edit:]
Found it: the snapshot information is written in the container's configuration
In my case '/etc/pve/nodes/p4/lxc/109.conf'

I discovered that after finishing my read on https://pve.proxmox.com/wiki/Linux_Container and looking for the file with 'locate 109 | grep conf | xargs grep vzdump' .
I removed the snapshot information from the configuration file (after making a backup copy) and it is no longer mentioned in the GUI. Hopefully the backup will work.

Search

Search

Storage Replication regularly stops

FastLaneJB

Renowned Member

le_top

Renowned Member

dendi

Renowned Member

le_top

Renowned Member

dendi

Renowned Member

le_top

Renowned Member

le_top

Renowned Member

FastLaneJB

Renowned Member

le_top

Renowned Member

le_top

Renowned Member

le_top

Renowned Member

le_top

Renowned Member

fabian

Proxmox Staff Member

le_top

Renowned Member

le_top

Renowned Member

le_top

Renowned Member

le_top

Renowned Member

We value your privacy