VMs disappear after reboot

lpmr · Dec 27, 2019

Hello

We have one pve node that used to be in a cluster of two, but the second node was removed around a month ago. Our remaining pve node rebooted in the night and when it came back up, it was missing several of our VMs, particularly those that were on a zfs pool, 'hdd-pool', that didn't seem to be acknowledged by pve. For example, we can't create any new VMs on that particular pool, yet that pool is shown as online, and No known data errors.

The VMs that weren't created on that particular pool are still listed and are back up and running.

Some log info:

Code:

pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Code:

 zpool status
   pool: hdd-pool
 state: ONLINE
  scan: scrub repaired 0B in 12h57m with 0 errors on Sun Dec  8 13:21:37 2019
config:

        NAME        STATE     READ WRITE CKSUM
        hdd-pool    ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0h8m with 0 errors on Sun Dec  8 00:32:57 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdk3    ONLINE       0     0     0
            sdl3    ONLINE       0     0     0

errors: No known data errors

  pool: ssd-pool
 state: ONLINE
  scan: scrub repaired 0B in 0h9m with 0 errors on Sun Dec  8 00:33:14 2019
config:

        NAME                                             STATE     READ WRITE CKSUM
        ssd-pool                                         ONLINE       0     0     0
          mirror-0                                       ONLINE       0     0     0
            ata-Samsung_SSD_860_EVO_1TB_S3Z8NY0M448647R  ONLINE       0     0     0
            ata-Samsung_SSD_860_EVO_1TB_S3Z8NY0M448653Y  ONLINE       0     0     0

errors: No known data errors

In particular, qemu-server/ is missing a dozen configs (there should be ~20).

Code:

ls /etc/pve/qemu-server/                                     
1000.conf  1001.conf  1002.conf  1003.conf  106.conf  112.conf  9000.conf  9001.conf

The last few bits of syslog before the restart (at 03:59:00)

Code:

Dec 27 03:58:31 pve-green corosync[2409]: notice  [TOTEM ] A new membership (10.191.0.177:4
1807416) was formed. Members
Dec 27 03:58:31 pve-green corosync[2409]: warning [CPG   ] downlist left_list: 0 received
Dec 27 03:58:31 pve-green corosync[2409]:  [TOTEM ] A new membership (10.191.0.177:41807416
) was formed. Members
Dec 27 03:58:31 pve-green corosync[2409]: notice  [QUORUM] Members[1]: 2
Dec 27 03:58:31 pve-green corosync[2409]: notice  [MAIN  ] Completed service synchronizatio
n, ready to provide service.
Dec 27 03:58:31 pve-green corosync[2409]:  [CPG   ] downlist left_list: 0 received
Dec 27 03:58:31 pve-green corosync[2409]:  [QUORUM] Members[1]: 2
Dec 27 03:58:31 pve-green corosync[2409]:  [MAIN  ] Completed service synchronization, read
y to provide service.
Dec 27 03:58:32 pve-green kernel: [12722782.816174] igb 0000:05:00.2 eno3: igb: eno3 NIC Li
nk is Down
Dec 27 03:58:32 pve-green kernel: [12722782.816302] vmbr0: port 1(eno3) entered disabled st
ate
Dec 27 03:58:32 pve-green kernel: [12722782.816375] vmbr0v4: port 1(eno3.4) entered disable
d state
Dec 27 03:58:32 pve-green kernel: [12722782.816480] vmbr0v5: port 1(eno3.5) entered disable
d state
Dec 27 03:58:32 pve-green corosync[2409]: notice  [TOTEM ] A new membership (10.191.0.177:4
1807420) was formed. Members
Dec 27 03:58:32 pve-green corosync[2409]: warning [CPG   ] downlist left_list: 0 received
Dec 27 03:58:32 pve-green corosync[2409]:  [TOTEM ] A new membership (10.191.0.177:41807420
) was formed. Members
Dec 27 03:58:32 pve-green corosync[2409]: notice  [QUORUM] Members[1]: 2
Dec 27 03:58:32 pve-green corosync[2409]: notice  [MAIN  ] Completed service synchronizatio
n, ready to provide service.
Dec 27 03:58:32 pve-green corosync[2409]:  [CPG   ] downlist left_list: 0 received
Dec 27 03:58:32 pve-green corosync[2409]:  [QUORUM] Members[1]: 2
Dec 27 03:58:32 pve-green corosync[2409]:  [MAIN  ] Completed service synchronization, read
y to provide service.
Dec 27 03:59:00 pve-green systemd[1]: Starting Proxmox VE replication runner...

In the weberface, we can see all the drives for the missing VMs.

Mainly trying to figure out how to gain access to that 'hdd-pool' in order to recover some data still on that device. Also would be nice to know why those VMs disappeared as well as figure out why pve rebooted at 3:59:00 in the morning. At this point the only way to recover the VMs is to restore from backups, but if anyone has a faster solution, that would be great to know as well.

Thanks for any help!!

oguz · Dec 31, 2019

hi,

lpmr said:
In the weberface, we can see all the drives for the missing VMs.

that's good.

lpmr said:
. Also would be nice to know why those VMs disappeared as well as figure out why pve rebooted at 3:59:00 in the morning

possibly you are having some cluster quorum problems.. if those vms were in the "missing" node, it's possible that they were removed from the cluster filesystem in /etc/pve (the config files of the vms, like /etc/pve/qemu-server/VMID.conf). so what i'm guessing is the data is still all there, you just need to get your VM config files back (which may be impossible in some scenarios unless you have backups of them too).
at the worst case you might have to create empty VMs with the same settings and use the disks

lpmr said:
We have one pve node that used to be in a cluster of two, but the second node was removed around a month ago. Our remaining pve node rebooted in the night and when it came back up, it was missing several of our VMs

my guess is something was done wrongly while removing the second node, and the cluster didn't get affected until that reboot because you probably set pvecm expected 1 or something similar during removal and forgot to restart cluster service (or something along these lines).

lpmr · Jan 3, 2020

Hey thanks for the response.

We went through the node removal steps indicated here so that the other node could be run independently without having to re-install anything until it was finally ready to be shut off all together. The cluster wasn't restarted at that time, but I'm still confused as to how the configs for the vms just up and disappeared. As the disk images were still available, we simply copied a config from an existing vm, removed various uids and pointed them at the drives which allowed us to recover the system.

Scoured the /var/log/syslog and other logs (messages), but still not able to figure out why it restarted.

Search

Search

VMs disappear after reboot

lpmr

New Member

oguz

Proxmox Retired Staff

lpmr

New Member

We value your privacy