Backup containers on Ceph storage

hahosting · Mar 14, 2019

Hi all, we're having trouble backing up Containers on Ceph storage.

Workflow as follows:

Container (LXC) on Proxmox 5.3-5 cluster
Dedicated Ceph 3/2 replicated pool for storage, with KRBD enabled
The RBD running the Root Disk is mounted on the host as /dev/rbd3
Backup container via Proxmox GUI or scheduled job:
- Storage: NFS server
- Mode: Snapshot
- Compression: LZO

The backup completes OK, but the snapshot doesn't delete, it's stuck deleting. As a result, we can't shutdown, migrate, or backup the Container again.

The only way of fixing it (so far) is to:

Shut down the Container internally via the console (not the GUI)
run "rbd showmapped | grep 538" on the host to get the mapped disks
run "rbd unmap -o force /dev/rbd3" and "rbd unmap -o force /dev/rbd4" against the mounted disks to unmount them
run "pct unlock 538" to relase the snapshot-delete lock
Delete the snapshot via the Proxmox GUI
Restart the Container

The full output from the backup job is below, but the standout line for me is:
"can't unmap rbd device /dev/rbd/ha-container-32-pool/vm-538-disk-0@vzdump: rbd: sysfs write failed"

We've seen this before when running a regular VM on KRBD storage, the disk isn't released on migration or backup, but I thought Containers needed KRBD to work properly?. To me it looks like KRBD isn't unmapping the image during the backup operation.

Has anyone seen this before? any ideas?

Thanks,
Stuart.

Code:

Virtual Environment 5.3-5
Container 538 (ct-XXXXX.co.uk) on node 'vms603'
Logs
()
INFO: starting new backup job: vzdump 538 --remove 0 --mode snapshot --compress lzo --mailto XXXXX.XXXXX@hahosting.com --storage XXXXX.XX.hahosting.net --node vms603
INFO: Starting Backup of VM 538 (lxc)
INFO: status = running
INFO: CT Name: ct-XXXXX.co.uk
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
2019-03-14 13:39:32.159154 7fa9119f3100 -1 did not load config file, using default settings.
/dev/rbd4
2019-03-14 13:39:33.318460 7f0cb0552100 -1 did not load config file, using default settings.
INFO: creating archive '/mnt/pve/XXXXX.XX.hahosting.net/dump/vzdump-lxc-538-2019_03_14-13_39_30.tar.lzo'
INFO: Total bytes written: 991344640 (946MiB, 18MiB/s)
INFO: archive file size: 466MB
INFO: remove vzdump snapshot
2019-03-14 13:40:30.449460 7fe042842100 -1 did not load config file, using default settings.
rbd: sysfs write failed
can't unmap rbd device /dev/rbd/ha-container-32-pool/vm-538-disk-0@vzdump: rbd: sysfs write failed
INFO: Finished Backup of VM 538 (00:01:00)
INFO: Backup job finished successfully
TASK OK

Alwin · Mar 19, 2019

Can you please check in the journal/syslog if there are more related messages? As it could be related to this report.
https://bugzilla.proxmox.com/show_bug.cgi?id=1911
https://forum.proxmox.com/threads/backup-hangup-with-ceph-rbd.45820/

hahosting · Mar 27, 2019

Hi Alwin, thanks for your reply - it's pointed us at the solution.

**holds head in shame for not checking the syslog first**

Multipath was still installed on these servers, a throwback from when we had OCFS2 over iSCSI from a HP P2000 SAN. Now we're using Ceph the iSCSI/OCFS2 bit has gone, but Multipath was still on the servers. The syslog showed:

Code:

root@vms602:~# tail /var/log/syslog
Mar 21 14:47:30 vms602 multipathd[603]: rbd10: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd33: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd41: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd43: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd45: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd47: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd48: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd51: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd12: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd9: unusable path

We did originally blacklist ^rbd within /etc/multipath.conf, but the problem came back the next day. We've now removed multipathd completely, and containers on Ceph storage now backup again.

Multipath had "grabbed" the disk and tried to use it as a "dm" disk, when the first backup completed, the lock wasn't released on the disk because of multipath, so the snapshot never removed, so future backups failed.

Thanks for your help. Hopefully this will be useful to someone else too!!!!!

One final thing, I've read on here that you don't need a separate KRBD pool now for containers, as they will use KRBD by default even if the pool isn't KRBD enabled - I can confirm this is true. We have VM's and Containers on the same Ceph pool (KRBD not enabled), and VM's access storage via KVM in user mode, whilst Containers mount the disk with KRBD, even on the same pool.

Thanks,
Stuart.

Search

Search

Backup containers on Ceph storage

hahosting

Well-Known Member

Alwin

Proxmox Retired Staff

hahosting

Well-Known Member