Live migration with ceph sometimes fails

mart.v

Well-Known Member
Mar 21, 2018
30
0
46
43
Hi everyone,

I'm using latest proxmox (6.3) and experiencing strange issue during live migration of KVM machines running on ceph block storage (ceph cluster was created through proxmox). Cluster is running fine for several years (was prevously on proxmox 5). This issue started only lately, I was not able to determine exact time or action when it happened for the first time (so I dont really know if this was linked to some update etc.).

Code:
2021-03-23 09:46:08 starting migration of VM 177 to node 'node7' (10.10.254.107)
2021-03-23 09:46:08 ERROR: Failed to sync data - rbd error: rbd: listing images failed: (2) No such file or directory
2021-03-23 09:46:08 aborting phase 1 - cleanup resources
2021-03-23 09:46:08 ERROR: migration aborted (duration 00:00:01): Failed to sync data - rbd error: rbd: listing images failed: (2) No such file or directory
TASK ERROR: migration aborted

But when I try to repeat the same action, eventually it is completed OK. Sometimes for the first try, sometimes second, sometimes third....

Any ideas?
Thanks
 
Which version of Ceph are you using?
Code:
ceph --version

What is the output of
Code:
pveceph pool ls
?
 
# ceph --version
ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

# pveceph pool ls

┌────────────────────┬──────┬──────────┬────────┬───────────────────┬─────────────────┬──────────────────────┬────────────────┐ │ Name │ Size │ Min Size │ PG Num │ PG Autoscale Mode │ Crush Rule Name │ %-Used │ Used │ ╞════════════════════╪══════╪══════════╪════════╪═══════════════════╪═════════════════╪══════════════════════╪════════════════╡ │ XXX_data │ 2 │ 1 │ 32 │ on │ fast │ 0.000898373778909445 │ 11920613376 │ ├────────────────────┼──────┼──────────┼────────┼───────────────────┼─────────────────┼──────────────────────┼────────────────┤ │ XXX_metadata │ 3 │ 2 │ 32 │ on │ fast │ 4.66130768472794e-05 │ 617986826 │ ├────────────────────┼──────┼──────────┼────────┼───────────────────┼─────────────────┼──────────────────────┼────────────────┤ │ YYY_data │ 3 │ 2 │ 128 │ on │ slow │ 0.812334537506104 │ 23151852322816 │ ├────────────────────┼──────┼──────────┼────────┼───────────────────┼─────────────────┼──────────────────────┼────────────────┤ │ YYY_metadata │ 3 │ 2 │ 32 │ on │ fast │ 0.000821315567009151 │ 10897280475 │ ├────────────────────┼──────┼──────────┼────────┼───────────────────┼─────────────────┼──────────────────────┼────────────────┤ │ proxmox │ 3 │ 2 │ 512 │ on │ fast │ 0.772557318210602 │ 45030830518837 │ ├────────────────────┼──────┼──────────┼────────┼───────────────────┼─────────────────┼──────────────────────┼────────────────┤ │ proxmox_hdd │ 3 │ 2 │ 64 │ on │ slow │ 0.688848495483398 │ 11840960660342 │ ├────────────────────┼──────┼──────────┼────────┼───────────────────┼─────────────────┼──────────────────────┼────────────────┤ │ proxmox_ssd_double │ 2 │ 1 │ 512 │ on │ fast │ 0.559283971786499 │ 16823821653073 │ └────────────────────┴──────┴──────────┴────────┴───────────────────┴─────────────────┴──────────────────────┴────────────────┘
 
hi,

could you also post the outputs from:
* pveceph status
* cat /etc/pve/ceph.conf
* rbd ls -l <POOLNAME> where VM 177 resides
 
Hi,

Code:
# pveceph status
  cluster:
    id:     ecc963a4-009f-4236-87fe-e672a7cb5d49
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum node97,node98,node99,node2,node1 (age 4d)
    mgr: node2(active, since 4d), standbys: node99, node97, node98, node4
    mds: XXX:1 YYY:1 {XXX:0=node8=up:active,YYY:0=node6=up:active} 2 up:standby
    osd: 79 osds: 79 up (since 40h), 79 in (since 4d)
 
  data:
    pools:   7 pools, 1312 pgs
    objects: 31.18M objects, 34 TiB
    usage:   88 TiB used, 40 TiB / 128 TiB avail
    pgs:     1311 active+clean
             1    active+clean+scrubbing+deep
 
  io:
    client:   131 MiB/s rd, 84 MiB/s wr, 997 op/s rd, 4.39k op/s wr

# cat /etc/pve/ceph.conf
Code:
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 172.16.254.0/24
         fsid = ecc963a4-009f-4236-87fe-e672a7cb5d49
         mon_allow_pool_delete = true
         mon_host = 172.16.254.97, 172.16.254.98, 172.16.254.99, 172.16.254.102, 172.16.254.101
         osd_journal_size = 5120
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 172.16.254.0/24


[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring


[mds]
         keyring = /var/lib/ceph/mds/ceph-$id/keyring
         mds_cache_memory_limit = 6442450944


#[osd]
#        keyring = /var/lib/ceph/osd/ceph-$id/keyring


[mds.node6]
         host = node6
         mds_standby_for_name = pve


[mds.node7]
         host = node7
         mds_standby_for_name = pve


[mds.node8]
         host = node8
         mds_standby_for_name = pve


[mds.node97]
         host = node97
         mds_standby_for_name = pve
 

Attachments

  • rbd.txt
    32.3 KB · Views: 4

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!