v7.2-3 Ceph cluster LXC snapshot stuck

Florius

Well-Known Member
Jul 2, 2017
35
9
48
57
Hi,
I am unable to create a snapshot of 1 specific LXC container on my Ceph cluster since I migrated everything to Ceph yesterday:

Code:
INFO: starting new backup job: vzdump 110 --remove 0 --notes-template '{{guestname}}' --mode snapshot --node kvm-01 --storage backup
INFO: Starting Backup of VM 110 (lxc)
INFO: Backup started at 2022-05-06 09:27:24
INFO: status = running
INFO: CT Name: galera-01.<HOSTNAME>
INFO: including mount point rootfs ('/') in backup
INFO: found old vzdump snapshot (force removal)
rbd error: error setting snapshot context: (2) No such file or directory
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'

It get's stuck on this and have to reboot the host to be able to run a `pct unlock 110`. Any advice on where to look for logs or to troubleshoot?
15 other LXC containers and a couple of KVM VM's went just fine... Thank you!
 
could you please share:
* the containers config
* your storage config and ceph setup (if there's anything special about it i.e. if it's not a hypercoverged setup created on the GUI)
* a list of the rbdsnapshots for the container's root mountpoint - see https://docs.ceph.com/en/quincy/rbd/rbd-snapshot/
 
Hi Stoiko. Thank you for your quick reply. Ofcourse!
I installed it via the GUI, so should be nothing special...

Code:
arch: amd64
cmode: tty
console: 1
cpulimit: 0
cpuunits: 1024
hostname: galera-01.<HOSTNAME>
memory: 2048
net0: name=eth0,bridge=vmbr1,hwaddr=22:DF:CA:0F:92:CA,ip=dhcp,type=veth
onboot: 1
ostype: debian
protection: 0
rootfs: pool1:vm-110-disk-0,size=20G
swap: 0
tty: 2
unprivileged: 1

[vzdump]
#vzdump backup snapshot
arch: amd64
cmode: tty
console: 1
cpulimit: 0
cpuunits: 1024
hostname: galera-01.<HOSTNAME>
memory: 2048
net0: name=eth0,bridge=vmbr1,hwaddr=22:DF:CA:0F:92:CA,ip=dhcp,type=veth
onboot: 1
ostype: debian
protection: 0
rootfs: pool1:vm-110-disk-0,size=20G
snapstate: prepare
snaptime: 1651822044
swap: 0
tty: 2
unprivileged: 1

Code:
# cat storage.cfg
dir: local
    path /var/lib/vz
    content vztmpl,iso,backup

lvmthin: local-lvm
    thinpool data
    vgname pve
    content images,rootdir

pbs: backup
    datastore backup
    server backup.<HOSTNAME>
    content backup
    fingerprint 4b:57:ab:eb:61:94:48:21:db:2b:c5:60:fa:f6:cf:9d:7a:72:b3:51:39:f0:a6:0d:a6:82:f7:6c:62:54:a9:37
    prune-backups keep-all=1
    username root@pam

rbd: pool1
    content rootdir,images
    krbd 0
    pool pool1

cephfs: cephfs
    path /mnt/pve/cephfs
    content backup,vztmpl,iso
    fs-name cephfs

Code:
# cat ceph.conf
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.0.0.250/24
     fsid = eecc145b-1642-48d2-acaf-58b5c0d07a76
     mon_allow_pool_delete = true
     mon_host = 10.0.0.251 10.0.0.252 10.0.0.250
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.0.0.250/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.kvm-01]
     host = kvm-01
     mds_standby_for_name = pve

[mds.kvm-02]
     host = kvm-02
     mds standby for name = pve

[mds.kvm-03]
     host = kvm-03
     mds_standby_for_name = pve

[mon.kvm-01]
     public_addr = 10.0.0.250

[mon.kvm-02]
     public_addr = 10.0.0.251

[mon.kvm-03]
     public_addr = 10.0.0.252

No snapshots:
Code:
# rbd snap ls pool1/vm-110-disk-0
root@kvm-01:/etc/pve#
 
ok - tried to reproduce the issue - but only partially succeeded.

* started a `vzdump` backup of a container and killed it (with sigkill) in the phase before the rbd snapshot was created
* the next try for vzdump resulted in an error (because the container was still 'locked' from the killed backup job)
* however for me the `pct unlock` went fine
* after a `pct unlock` a vzdump resulted in the same messages as you showed - but then the backup ran through

as for logs to check:
* the system journal is always a first goto log -> `journalctl -b` (see `man journalctl` for more options for limiting it to a specific time)
* if it's ceph related the ceph logs (these you should check on all nodes) can contain information - /var/log/ceph/ (on all nodes)
* it's always good to take a look at the ceph status - `ceph -s`

I hope this helps!
 
  • Like
Reactions: takeokun
Hi Stoiko. Thanks for giving it a try. So after the weekend it keeps happening. I deleted the LXC container and created a new one, but it still happens. I only re-used the ID, as I doubt it's a specific ID problem?
I am unable to find anything related to this in the (ceph) logs.
The problem only happens when using snapshot mode.

But if the task is stuck I am unable to cancel it. Using the web UI nothing happens, and killing it with `kill -9 <PID>` doesn't work either.

Any other advice you can give me, as I have no clue anymore. I expected a new LXC container to work...

EDIT: Using stop mode I made a new backup and restored it to a new ID. Making a backup via snapshot works with the new ID. So even when I deleted the LXC container and build a new one with the SAME ID, it doesn't work. Guess something is wrong in Ceph, but I can't find anything in Ceph what should remain after deleting the old LXC container.
Code:
root@kvm-03:~# rbd snap ls vm-110-disk-0 -p pool1
root@kvm-03:~#
 
Last edited:
EDIT: Using stop mode I made a new backup and restored it to a new ID. Making a backup via snapshot works with the new ID. So even when I deleted the LXC container and build a new one with the SAME ID, it doesn't work. Guess something is wrong in Ceph, but I can't find anything in Ceph what should remain after deleting the old LXC container.
grep the ceph-logs for the lxc id?
or also gather some information about the container's rbd-image and compare it to other disk images

I hope this helps!