bugs related to ceph and disk move

grin

Renowned Member
Dec 8, 2008
159
16
83
Hungary
grin.hu
I'll be brief.

1) when moving between different ceph pools and they have different premissions (I don't yet know which one it chokes on), and you see this in the move log:
Code:
mount: /var/lib/lxc/2003/.copy-volume-1: WARNING: source write-protected, mounted read-only.
then expect the destination to be empty (0x00) and if you asked for deletion of the source you say loudly goodbye! to your image, since it's gone.
(As a sidenote rbd export and rbd import may do it faster, more reliably and without data loss.)

2) When you move image-0 from pool1 to pool2 and retain the old image, it's fine, you get image-0 there, too. When you move image-2, you'll get image-1, well, incements, it's confusing but okay. You may move image-1 into image-2, yay.
Now, try to move back image-1, since you have changed your mind. Well, you really should not, as the result will be that the mount point of image-2 will be replaced by the old content of the mount point of image-1. It will be a mess.

(I was lying a bit at #1 as they were completely separate ceph clusters, not just separate pools but the results are the same, moving between storages with different ceph permissions.)
 
I had time to test, and my guess was wrong: it is not about permission.

The move-volume is buggy: it seems cannot move images between rbd storages. My guess is, without checking the code, that it forgets to map the target image.

The result is fatal anyway: the copy will fail and the destination will be empty.

Why nobody checks the return value of mountpoint_mount in PVE/LXC.pm?
Perl:
    my @mounted;
    eval {
        # mount and copy
        mkdir $src;
        mountpoint_mount($src_mp, $src, $storage_cfg, $snapname, $rootuid, $rootgid);
        # ^^^^^ no check of retval and $!?
        push @mounted, $src;
        mkdir $dest;
        mountpoint_mount($dst_mp, $dest, $storage_cfg, undef, $rootuid, $rootgid);
        # ^^^^^ no check of retval and $!?
        push @mounted, $dest;

        $bwlimit //= 0;

        run_command([
            'rsync',
 
Last edited:
Summary:
  • ceph krbd cannot handle same pool name on multiple clusters, mainly because /dev/rbd/<poolname>/<imagename> does not consider cluster-id,
  • it means that most of pve functions relying of "working rbd call" fail silently, mostly due to rbd map failing to map samepool/sameimage from a different cluster-id, allocate a new /dev/rbdX device but remove the "old" one
  • due to missing [writeable] device the rsync fails
  • using different pool name works
  • using different image name also works (if there is already vm-1234-disk-0 then the new name will be disk-1 and it will not clash)
So, while technically it isn't pve's fault it's still dangerous: mountpoint_mount doesn't, also PVE::Storage::activate_volumes (I believe) fails, and there is bascially no checks of success as it's seen above.
The fail happens in a form where /dev/rdb111 is mounted into src, then rbd creates /dev/rbd112 for dst with the same image, which will be readonly since it's already opened rw; mount mounts it readonly and rsync fails to write.

Right now this could be checked on various levels:
  • on the highest level the mount is readonly, so rsync must result a failure, and this can (and really should) be checked
  • dst mount is definitely not writeable, also could be checked
  • since we know the problem, it could be checked whether /dev/rbd/<pool>/<image> exists since then it's a sure thing mapping it "again" (regardless of cluster-id) will fail.
  • various functions may give an error result which is not checked.
You could set this [SOLVED] but these really shall be fixed.
 
mountpoint_mount should (AFAICT) error out if mounting fails. probably in this case mount doesn't fail though, just "downgrades" the mount to ro.. same applies to rsync in $copy_volume - that should error out if copying fails, then we unmount/cleanup, and then we error out in turn.. so either the bug is somewhere else, or something along this error handling chain doesn't work as it should/

just to ensure I got your setup right:
- two rbd storages (with at least one being external) using the same pool name (I guess this is the root cause besides some error handling taking a wrong turn somewhere, and something we might want to forbid until we find a solution)
- move volume of a container from storage A to B, where B's next free volume ID slot is exactly the source volume ID
- returns success even though nothing was copied (and removes source volume if that box is ticked)

could you also post pveversion -v output? I'll file a bug report then with all the info so that we can hopefully reproduce and find a fix.
 
oh, and I have a theory what's going wrong, although I haven't tried to reproduce the issue in practice yet:

- source volume gets mapped normally
- target volume is not mapped at all, but same path as source volume is returned (they'd have the same device path, and it already exists, so we assume it's already mapped, just like if map_volume were called a second time for the source volume)
- source volume is mounted normally (rw)
- target volume (== device path of mapped source volume) is mounted ro, but that doesn't mean the mount call fails, so error handling is not triggered at this point

at this point, both mount dirs have the same data since they are in fact mounts of the same volume, 1x rw, 1x ro

- rsync works flawlessly - source and target are identical, so nothing to copy, and no error
- we think the copying worked, so we proceed with unmounting and cleaning up the tmp mountpoint dirs
- if the box was ticked, the source volume is removed
- target volume is empty, because it was never mapped or mounted or rsynced to

things that we could/should do:
- improve the mapping process to take the cluster id into account (or forbid storage.cfg referencing the same pool name twice, but that is obviously more limiting)
- check after mounting if we expect a RW mount that it's not RO for whatever reason
- check that source and target are not the same mount source when copying a volume, as this can only be an error/buggy code path

the last two are basically safeguards against similar issue cropping up, the first one might be a bit involved, we'll have to see.
 
Easiest to test is to create two separate ceph clusters and create the same pool on both.

Indeed, it's possible that the target may contain the source and rsync happily does nothing.

I believe rbd does the wrong thing as well since it can't map into the /dev/rbd/<pool>/<image> schema the matching combos. I am not sure whether rbd actually gives any indication that something's wrong. Therefore I am not sure you can do lot of useful things in No1.

No2 and No3 seems to be useful test anyway.

pveversion -v
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.78-2-pve: 5.4.78-2
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1
 
I believe rbd does the wrong thing as well since it can't map into the /dev/rbd/<pool>/<image> schema the matching combos. I am not sure whether rbd actually gives any indication that something's wrong. Therefore I am not sure you can do lot of useful things in No1.
since we control the ceph packages we can actually do something (and in this case, it's probably okay for upstream as well ;)) - the info which mapped image belongs to which cluster is available, just not exposed by the current udev rule. filed #3969 / ceph/pve-storage (midterm improvement to actually support this scenario) and #3970 / pve-container (stop-gap fix to prevent damage by running into it)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!