Hi everyone,
the issue is the following:
I set up daily backups for four different containers on two different nodes. All containers are on ceph-storage on ssd's, the nodes are connected in one pve-cluster (3 nodes alltogether) and the backup goes to an nfs-directory located on the third node.
Node 1:
CT 203
Node 2:
CT 201
CT 202
CT 204
vzdump.cron:
proxmox-version 5.2.8 (running kernel: 4.15.18-4-pve)
ceph version 12.2.8
From my understanding the backups on each node are done serially and there should not be an issue when there are two backup-tasks that are spawned on each of the nodes, one on node1, one on node2.
The issue is that the backup of CT203 is hanging and would not do anything until it reaches the timeout-limit or is stopped manually. This is not the first time this happened to a container. Previously I would come in the next morning (10AM) after the backup would start at 4AM in the morning and seeing the backup still trying to run.
I started doing those manually or put them all on one node to avoid the problem. Unfortunately I don't have any more verbose logging for this, just start of the backup and the termination by myself.
Am I misunderstanding anything about the backup-structure or how it is supposed to be working?
The issue is quite similar described to the one here (unfortunately its in german)
https://forum.proxmox.com/threads/in-letzter-zeit-immer-wieder-hängende-backups-von-ceph-nfs.44343/
Now stopping the backup leaves the node in a peculiar state as described in this ticket:
https://forum.proxmox.com/threads/how-do-i-remove-a-ceph-vzdump-snapshot.36573/
I can manually delete the vzdump from rbd, (rbd snap rm ssd_poool-vm-203-disk-0@vzdump)
unlock the container (pct unlock 203),
remove the symlink from /dev/rbd/ssd_ool/vm-203-disk-0@vzdump
(this symlink recreates itself though)
and after that I am able to make a new manual backup of the container. Though the first rbd-disk that was to be used for the backup is still there and in read-only mode, which I can't get rid off, unless I reboot the node. Also I can't see any snapshots anymore in the GUI anymore (pct still lists them properly) .
As I am not able to restart the nodes frequently, but still want to use the vzdump-feature, I would be greatful for any help. I am aware that I could space out the backups of the different nodes, but as I thought that four should not be too much (and normally they only take about ~2 minutes to finish) I decided to post this.
Thanks in advance,
Johannes
the issue is the following:
I set up daily backups for four different containers on two different nodes. All containers are on ceph-storage on ssd's, the nodes are connected in one pve-cluster (3 nodes alltogether) and the backup goes to an nfs-directory located on the third node.
Node 1:
CT 203
Node 2:
CT 201
CT 202
CT 204
vzdump.cron:
Code:
0 12 * * 1,2,3,4,5 root vydump 204 202 201 203 --mailnotification failure --quiet 1 --mode snapshot --storage nfsproxmox -compress lzo
proxmox-version 5.2.8 (running kernel: 4.15.18-4-pve)
ceph version 12.2.8
From my understanding the backups on each node are done serially and there should not be an issue when there are two backup-tasks that are spawned on each of the nodes, one on node1, one on node2.
The issue is that the backup of CT203 is hanging and would not do anything until it reaches the timeout-limit or is stopped manually. This is not the first time this happened to a container. Previously I would come in the next morning (10AM) after the backup would start at 4AM in the morning and seeing the backup still trying to run.
I started doing those manually or put them all on one node to avoid the problem. Unfortunately I don't have any more verbose logging for this, just start of the backup and the termination by myself.
Am I misunderstanding anything about the backup-structure or how it is supposed to be working?
The issue is quite similar described to the one here (unfortunately its in german)
https://forum.proxmox.com/threads/in-letzter-zeit-immer-wieder-hängende-backups-von-ceph-nfs.44343/
Now stopping the backup leaves the node in a peculiar state as described in this ticket:
https://forum.proxmox.com/threads/how-do-i-remove-a-ceph-vzdump-snapshot.36573/
I can manually delete the vzdump from rbd, (rbd snap rm ssd_poool-vm-203-disk-0@vzdump)
unlock the container (pct unlock 203),
remove the symlink from /dev/rbd/ssd_ool/vm-203-disk-0@vzdump
(this symlink recreates itself though)
and after that I am able to make a new manual backup of the container. Though the first rbd-disk that was to be used for the backup is still there and in read-only mode, which I can't get rid off, unless I reboot the node. Also I can't see any snapshots anymore in the GUI anymore (pct still lists them properly) .
As I am not able to restart the nodes frequently, but still want to use the vzdump-feature, I would be greatful for any help. I am aware that I could space out the backups of the different nodes, but as I thought that four should not be too much (and normally they only take about ~2 minutes to finish) I decided to post this.
Thanks in advance,
Johannes