[SOLVED] stuck for 4 days on "rescan volumes"

peterG

Member
Jan 20, 2014
15
5
23
Boston, MA, US
SO, the question is (details below) can i abruptly interrupt or stop this task and simply fire up the vm it has restored..???

longer story w details:
Did a VZdump of a server a few days ago and the VzDump backup took about 30 hours bcs it was compressed thru gzip to bout 660 Gigs (big server with lotsa content on a boot drive + 2 RAID arrays holding the data and VZdumped the whole thing..)

Then physically moved the VZdump *.vma.gz to a new node (just a single node NOT part of a cluster and its own local storage) and performed a restore, took about a bit over 24 hours, the task status shows that it went to 100% complete and THEN after that, it's now been "Stuck" on "rescan volumes" for a bit under 4 full days..

Another vm that was created while this restore is being run hasn't finished being created and still shows as task running as well, i suspect that the 1st restoration has to complete prior to any mroe tasks completing.. idk ..

here is what the task status looks like:
Code:
Virtual Environment 5.4-3
Search
Datacenter
Search:
Server View
Logs
restore vma archive: zcat /mnt/mira2/dump/vzdump-qemu-103-2019_05_25-16_46_23.vma.gz | vma extract -v -r /var/tmp/vzdumptmp31505.fifo - /var/tmp/vzdumptmp31505
CFG: size: 649 name: qemu-server.conf
DEV: dev_id=1 size: 1127428915200 devname: drive-ide1
DEV: dev_id=2 size: 214748364800 devname: drive-ide3
DEV: dev_id=3 size: 49392123904 devname: drive-scsi0
CTIME: Sat May 25 16:46:26 2019
  Using default stripesize 64.00 KiB.
  Logical volume "vm-113-disk-0" created.
new volume ID is 'local-lvm:vm-113-disk-0'
map 'drive-ide1' to '/dev/pve/vm-113-disk-0' (write zeros = 0)
  Using default stripesize 64.00 KiB.
  Logical volume "vm-113-disk-1" created.
new volume ID is 'local-lvm:vm-113-disk-1'
map 'drive-ide3' to '/dev/pve/vm-113-disk-1' (write zeros = 0)
  Using default stripesize 64.00 KiB.
  Logical volume "vm-113-disk-2" created.
new volume ID is 'local-lvm:vm-113-disk-2'
map 'drive-scsi0' to '/dev/pve/vm-113-disk-2' (write zeros = 0)

progress 1% (read 13915717632 bytes, duration 1003 sec)
..
progress 100% (read 1391569403904 bytes, duration 86490 sec)

total bytes read 1391569403904, sparse bytes 611882180608 (44%)
space reduction due to 4K zero blocks 0.456%
rescan volumes...

and like i said its been "stuck on this" "rescan volumes" part for 4 days.

lsblk shows this:
Code:
 lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   3.7T  0 disk
├─sda1                         8:1       1007K  0 part
├─sda2                         8:2    0   512M  0 part
└─sda3                         8:3    0   3.7T  0 part
  ├─pve-swap                 253:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                 253:1    0    96G  0 lvm  /
  ├─pve-miraqcow             253:2    0 900.9G  0 lvm  /mnt/mira2
  ├─pve-data_tmeta           253:3    0    88M  0 lvm
  │ └─pve-data-tpool         253:5    0   2.7T  0 lvm
  │   ├─pve-data             253:6    0   2.7T  0 lvm
  │   ├─pve-vm--113--disk--0 253:7    0     1T  0 lvm
  │   ├─pve-vm--113--disk--1 253:8    0   200G  0 lvm
  │   └─pve-vm--113--disk--2 253:9    0    46G  0 lvm
  └─pve-data_tdata           253:4    0   2.7T  0 lvm
    └─pve-data-tpool         253:5    0   2.7T  0 lvm
      ├─pve-data             253:6    0   2.7T  0 lvm
      ├─pve-vm--113--disk--0 253:7    0     1T  0 lvm
      ├─pve-vm--113--disk--1 253:8    0   200G  0 lvm
      └─pve-vm--113--disk--2 253:9    0    46G  0 lvm
sr0                           11:0    1  1024M  0 rom

The VM's in question config file is this
Code:
agent: 1
balloon: 4096
bootdisk: scsi0
cores: 6
ide0: none,media=cdrom
ide1: local-lvm:vm-113-disk-0,size=1050G
ide2: none,media=cdrom
ide3: local-lvm:vm-113-disk-1,size=200G
memory: 6144
name: MiRAKermit
net0: virtio=12:A5:1C:99:CD:78,bridge=vmbr0
numa: 0
ostype: win10
scsi0: local-lvm:vm-113-disk-2,discard=on,size=46G
scsihw: virtio-scsi-pci
smbios1: uuid=93449426-4965-4cda-adc0-21120d339bb6
sockets: 1

SO the question is: can i interupt this task, perhaps run a rescan volumes manually? and start the VM anyways?
 
for future readers.. PVE seems to have gotten stuck
did a > shutdown -r now
to restart the chassis
then
qm rescan --vmid 1113 (where 113 is the vmid number of the vm that got stuck in the rebuild)
it finished in a couple secs
Then started the VM, all is fine
 
this happens on my complete cluster.. running the latest version of proxmox... even reboot and manually qm rescan or qm rescan --vmid 1234 does not help.. is there an issue in the community version ?

pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.39-3-pve)
 
Hi,
this happens on my complete cluster.. running the latest version of proxmox... even reboot and manually qm rescan or qm rescan --vmid 1234 does not help.. is there an issue in the community version ?

pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.39-3-pve)
can you share the output of pveversion -v and the restore log? Are there any messages in /var/log/syslog? What is the output when you run the rescan command or does that hang too?
 
can you share the output of pveversion -v and the restore log? Are there any messages in /var/log/syslog? What is the output when you run the rescan command or does that hang too?


Happening to me atm as well.

Here is my pveversion:

Code:
# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.64-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-5.15: 7.2-13
pve-kernel-helper: 7.2-13
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-4
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-3
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-6
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

In /var/log/syslog I don't find anything that relates to the VM to be restored (I grepped for its ID and cursorily reviewed the entire output).

What is the rescan command you are referring to?

Thanks
 
So I stopped the restore task (as it indicated that the VM had been fully restored; prior to getting stuck at rescanning volumes).

The VM did show up then and I was able to restart it. (No need to restart the entire node.)

That only leaves the question of why did the task get stuck.
 
Hi,
So I stopped the restore task (as it indicated that the VM had been fully restored; prior to getting stuck at rescanning volumes).

The VM did show up then and I was able to restart it. (No need to restart the entire node.)
glad you were able to work around the issue :)

That only leaves the question of why did the task get stuck.
What is the output of pvesm status?

What is the rescan command you are referring to?
qm rescan --vmid <ID> with your VM's ID. Does that get stuck if issued manually?
 
What is the output of pvesm status?
Well, now that the issue is gone, it outputs a list of my volumes and all are listed as active (except for one that I disabled). So probably one cannot learn anything from this. But I shall try to keep this command in mind and apply it, if the issue returns...

Interestingly, it also lists my cephpool as active although the cephpool is not working at the moment (which is the whole reason I needed to restore several VMs from the backup (to a disk that is not part of the pool)). But this is a topic for another thread that I shall open soon.
 
Well, now that the issue is gone, it outputs a list of my volumes and all are listed as active (except for one that I disabled). So probably one cannot learn anything from this. But I shall try to keep this command in mind and apply it, if the issue returns...

Interestingly, it also lists my cephpool as active although the cephpool is not working at the moment (which is the whole reason I needed to restore several VMs from the backup (to a disk that is not part of the pool)). But this is a topic for another thread that I shall open soon.
I'd guess that rescan tried to access the cephpool (it tries to scan all enabled storages) and got stuck?
 
Yes, that's what I thought, too.

Out of curiosity: What is the purpose of rescanning the volumes after restoring?
 
Yes, that's what I thought, too.

Out of curiosity: What is the purpose of rescanning the volumes after restoring?
For VMs, disks that are not in the backup will be left-over (e.g. if you currently have a scsi1 disk, but the backup doesn't include a scsi1 disk). Those will be picked up by the rescan, so they show up as unused in the VM config.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!