Cloud-init `instance_id` changes on migration to different hypervisor

May 20, 2017
183
19
83
Netherlands
cyberfusion.io
Hi,

For some time now, we've been having an issue where Cloud-init re-runs on VMs that have already been operating for months or years. By 'run', I mean that first-boot-only stages run again, such as upgrading packages. This is obviously very problematic in production, as an 'innocent' reboot can trigger all sorts of unexpected actions.

After a lot of testing, we found that this happens after migrating a VM to a different hypervisor. The `instance_id` suddenly changes, which Cloud-init uses to determine whether this is the first boot.

Proxmox staff previously wrote (https://forum.proxmox.com/threads/rerunning-cloud-init.90932/#post-398172):

The 'instance-id' is basically the hash of the user config and the network config concatenated. cloud-init checks the current instance-id against all previously known ones to see if it has to rerun the different systems and modules, not just the previous instance-id, which leads to strange behavior sometimes.

Is this accurate? Because migrating a VM from one hypervisor to the other does not change the user config nor the network config, as far as I'm aware. I've attached the output of `cloud-init query -a` from before and after the `instance_id` suddenly changes, i.e. the VM is migrated. You can see that there is no difference except for the ID.

I can confirm that this behaviour does not occur when rebooting a VM on the same hypervisor.

We're on PVE 8.4.19.
 

Attachments

Hi @William Edwards ,
I am not part of PVE staff, and they will reply directly if they find it necessary. That said, I would check these:
a) "pveversion --verbose" across all hosts
b) consistency of /etc/resolv.conf across hosts
c) mount/save/analyze/compare the generated ISO in before and after (userdata and networkdata)

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi @bbgeek17,

Thanks for your reply.

Your comment on `/etc/resolv.conf` is actually a very good one - I found that five nodes in the cluster have a deviating one. To see if that's an issue, I migrated the VM - that I could reproduce the issue on consistently, with every migration - to a hypervisor with a deviating `resolv.conf` and now... the issue doesn't occur anymore at all. It's not getting clearer :oops:

Re the ISO: it doesn't get re-generated on migrate, right?

PVE version is the same on all nodes (with some differences such as minor kernel version).
 
the issue doesn't occur anymore at all. It's not getting clearer
It is hard to say what is going on without having fingers on the keyboard. That said, the ID is a Hash based on userdata and networkdata. The resolve.conf of the host is part of Network Data by default, and is inherited by VMs from the generated ISO.

The ISO is indeed regenerated on migration, and if the Network and User data files are different - so will be the ID. You can mount the ISO, ie: mount /dev/sr0 /mnt/temp
And save the before and after context for comparison. If the files are different, ie due to resolv.conf, that would affect your ID change.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
  • Like
Reactions: Johannes S
Pretty sure I just found the deviation in `resolv.conf` that causes the hash to change. That was a very useful hint, thanks.

What throws me off is that SSH host keys aren't re-generated every time this happens, while the `ssh` module runs with `once-per-instance` frequency. I'll think a bit more about that - but I can confirm the instance ID has changed.