v7.2-3 Ceph cluster LXC snapshot stuck

Florius

Well-Known Member
Jul 2, 2017
35
9
48
57
Hi,
I am unable to create a snapshot of 1 specific LXC container on my Ceph cluster since I migrated everything to Ceph yesterday:

Code:
INFO: starting new backup job: vzdump 110 --remove 0 --notes-template '{{guestname}}' --mode snapshot --node kvm-01 --storage backup
INFO: Starting Backup of VM 110 (lxc)
INFO: Backup started at 2022-05-06 09:27:24
INFO: status = running
INFO: CT Name: galera-01.<HOSTNAME>
INFO: including mount point rootfs ('/') in backup
INFO: found old vzdump snapshot (force removal)
rbd error: error setting snapshot context: (2) No such file or directory
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'

It get's stuck on this and have to reboot the host to be able to run a `pct unlock 110`. Any advice on where to look for logs or to troubleshoot?
15 other LXC containers and a couple of KVM VM's went just fine... Thank you!
 
could you please share:
* the containers config
* your storage config and ceph setup (if there's anything special about it i.e. if it's not a hypercoverged setup created on the GUI)
* a list of the rbdsnapshots for the container's root mountpoint - see https://docs.ceph.com/en/quincy/rbd/rbd-snapshot/
 
Hi Stoiko. Thank you for your quick reply. Ofcourse!
I installed it via the GUI, so should be nothing special...

Code:
arch: amd64
cmode: tty
console: 1
cpulimit: 0
cpuunits: 1024
hostname: galera-01.<HOSTNAME>
memory: 2048
net0: name=eth0,bridge=vmbr1,hwaddr=22:DF:CA:0F:92:CA,ip=dhcp,type=veth
onboot: 1
ostype: debian
protection: 0
rootfs: pool1:vm-110-disk-0,size=20G
swap: 0
tty: 2
unprivileged: 1

[vzdump]
#vzdump backup snapshot
arch: amd64
cmode: tty
console: 1
cpulimit: 0
cpuunits: 1024
hostname: galera-01.<HOSTNAME>
memory: 2048
net0: name=eth0,bridge=vmbr1,hwaddr=22:DF:CA:0F:92:CA,ip=dhcp,type=veth
onboot: 1
ostype: debian
protection: 0
rootfs: pool1:vm-110-disk-0,size=20G
snapstate: prepare
snaptime: 1651822044
swap: 0
tty: 2
unprivileged: 1

Code:
# cat storage.cfg
dir: local
    path /var/lib/vz
    content vztmpl,iso,backup

lvmthin: local-lvm
    thinpool data
    vgname pve
    content images,rootdir

pbs: backup
    datastore backup
    server backup.<HOSTNAME>
    content backup
    fingerprint 4b:57:ab:eb:61:94:48:21:db:2b:c5:60:fa:f6:cf:9d:7a:72:b3:51:39:f0:a6:0d:a6:82:f7:6c:62:54:a9:37
    prune-backups keep-all=1
    username root@pam

rbd: pool1
    content rootdir,images
    krbd 0
    pool pool1

cephfs: cephfs
    path /mnt/pve/cephfs
    content backup,vztmpl,iso
    fs-name cephfs

Code:
# cat ceph.conf
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.0.0.250/24
     fsid = eecc145b-1642-48d2-acaf-58b5c0d07a76
     mon_allow_pool_delete = true
     mon_host = 10.0.0.251 10.0.0.252 10.0.0.250
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.0.0.250/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.kvm-01]
     host = kvm-01
     mds_standby_for_name = pve

[mds.kvm-02]
     host = kvm-02
     mds standby for name = pve

[mds.kvm-03]
     host = kvm-03
     mds_standby_for_name = pve

[mon.kvm-01]
     public_addr = 10.0.0.250

[mon.kvm-02]
     public_addr = 10.0.0.251

[mon.kvm-03]
     public_addr = 10.0.0.252

No snapshots:
Code:
# rbd snap ls pool1/vm-110-disk-0
root@kvm-01:/etc/pve#
 
ok - tried to reproduce the issue - but only partially succeeded.

* started a `vzdump` backup of a container and killed it (with sigkill) in the phase before the rbd snapshot was created
* the next try for vzdump resulted in an error (because the container was still 'locked' from the killed backup job)
* however for me the `pct unlock` went fine
* after a `pct unlock` a vzdump resulted in the same messages as you showed - but then the backup ran through

as for logs to check:
* the system journal is always a first goto log -> `journalctl -b` (see `man journalctl` for more options for limiting it to a specific time)
* if it's ceph related the ceph logs (these you should check on all nodes) can contain information - /var/log/ceph/ (on all nodes)
* it's always good to take a look at the ceph status - `ceph -s`

I hope this helps!
 
Hi Stoiko. Thanks for giving it a try. So after the weekend it keeps happening. I deleted the LXC container and created a new one, but it still happens. I only re-used the ID, as I doubt it's a specific ID problem?
I am unable to find anything related to this in the (ceph) logs.
The problem only happens when using snapshot mode.

But if the task is stuck I am unable to cancel it. Using the web UI nothing happens, and killing it with `kill -9 <PID>` doesn't work either.

Any other advice you can give me, as I have no clue anymore. I expected a new LXC container to work...

EDIT: Using stop mode I made a new backup and restored it to a new ID. Making a backup via snapshot works with the new ID. So even when I deleted the LXC container and build a new one with the SAME ID, it doesn't work. Guess something is wrong in Ceph, but I can't find anything in Ceph what should remain after deleting the old LXC container.
Code:
root@kvm-03:~# rbd snap ls vm-110-disk-0 -p pool1
root@kvm-03:~#
 
Last edited:
EDIT: Using stop mode I made a new backup and restored it to a new ID. Making a backup via snapshot works with the new ID. So even when I deleted the LXC container and build a new one with the SAME ID, it doesn't work. Guess something is wrong in Ceph, but I can't find anything in Ceph what should remain after deleting the old LXC container.
grep the ceph-logs for the lxc id?
or also gather some information about the container's rbd-image and compare it to other disk images

I hope this helps!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!