Backup containers on Ceph storage

hahosting

Well-Known Member
Aug 20, 2018
48
3
48
Sheffield
www.hahosting.com
Hi all, we're having trouble backing up Containers on Ceph storage.

Workflow as follows:
  • Container (LXC) on Proxmox 5.3-5 cluster
  • Dedicated Ceph 3/2 replicated pool for storage, with KRBD enabled
  • The RBD running the Root Disk is mounted on the host as /dev/rbd3
  • Backup container via Proxmox GUI or scheduled job:
    • Storage: NFS server
    • Mode: Snapshot
    • Compression: LZO
The backup completes OK, but the snapshot doesn't delete, it's stuck deleting. As a result, we can't shutdown, migrate, or backup the Container again.

The only way of fixing it (so far) is to:
  • Shut down the Container internally via the console (not the GUI)
  • run "rbd showmapped | grep 538" on the host to get the mapped disks
  • run "rbd unmap -o force /dev/rbd3" and "rbd unmap -o force /dev/rbd4" against the mounted disks to unmount them
  • run "pct unlock 538" to relase the snapshot-delete lock
  • Delete the snapshot via the Proxmox GUI
  • Restart the Container
The full output from the backup job is below, but the standout line for me is:
"can't unmap rbd device /dev/rbd/ha-container-32-pool/vm-538-disk-0@vzdump: rbd: sysfs write failed"

We've seen this before when running a regular VM on KRBD storage, the disk isn't released on migration or backup, but I thought Containers needed KRBD to work properly?. To me it looks like KRBD isn't unmapping the image during the backup operation.

Has anyone seen this before? any ideas?

Thanks,
Stuart.

Code:
Virtual Environment 5.3-5
Container 538 (ct-XXXXX.co.uk) on node 'vms603'
Logs
()
INFO: starting new backup job: vzdump 538 --remove 0 --mode snapshot --compress lzo --mailto XXXXX.XXXXX@hahosting.com --storage XXXXX.XX.hahosting.net --node vms603
INFO: Starting Backup of VM 538 (lxc)
INFO: status = running
INFO: CT Name: ct-XXXXX.co.uk
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
2019-03-14 13:39:32.159154 7fa9119f3100 -1 did not load config file, using default settings.
/dev/rbd4
2019-03-14 13:39:33.318460 7f0cb0552100 -1 did not load config file, using default settings.
INFO: creating archive '/mnt/pve/XXXXX.XX.hahosting.net/dump/vzdump-lxc-538-2019_03_14-13_39_30.tar.lzo'
INFO: Total bytes written: 991344640 (946MiB, 18MiB/s)
INFO: archive file size: 466MB
INFO: remove vzdump snapshot
2019-03-14 13:40:30.449460 7fe042842100 -1 did not load config file, using default settings.
rbd: sysfs write failed
can't unmap rbd device /dev/rbd/ha-container-32-pool/vm-538-disk-0@vzdump: rbd: sysfs write failed
INFO: Finished Backup of VM 538 (00:01:00)
INFO: Backup job finished successfully
TASK OK
 
Hi Alwin, thanks for your reply - it's pointed us at the solution.

**holds head in shame for not checking the syslog first**

Multipath was still installed on these servers, a throwback from when we had OCFS2 over iSCSI from a HP P2000 SAN. Now we're using Ceph the iSCSI/OCFS2 bit has gone, but Multipath was still on the servers. The syslog showed:

Code:
root@vms602:~# tail /var/log/syslog
Mar 21 14:47:30 vms602 multipathd[603]: rbd10: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd33: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd41: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd43: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd45: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd47: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd48: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd51: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd12: unusable path
Mar 21 14:47:30 vms602 multipathd[603]: rbd9: unusable path

We did originally blacklist ^rbd within /etc/multipath.conf, but the problem came back the next day. We've now removed multipathd completely, and containers on Ceph storage now backup again.

Multipath had "grabbed" the disk and tried to use it as a "dm" disk, when the first backup completed, the lock wasn't released on the disk because of multipath, so the snapshot never removed, so future backups failed.

Thanks for your help. Hopefully this will be useful to someone else too!!!!!

One final thing, I've read on here that you don't need a separate KRBD pool now for containers, as they will use KRBD by default even if the pool isn't KRBD enabled - I can confirm this is true. We have VM's and Containers on the same Ceph pool (KRBD not enabled), and VM's access storage via KVM in user mode, whilst Containers mount the disk with KRBD, even on the same pool.

Thanks,
Stuart.
 
  • Like
Reactions: Alwin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!