Hello
We have one pve node that used to be in a cluster of two, but the second node was removed around a month ago. Our remaining pve node rebooted in the night and when it came back up, it was missing several of our VMs, particularly those that were on a zfs pool, 'hdd-pool', that didn't seem to be acknowledged by pve. For example, we can't create any new VMs on that particular pool, yet that pool is shown as online, and No known data errors.
The VMs that weren't created on that particular pool are still listed and are back up and running.
Some log info:
In particular, qemu-server/ is missing a dozen configs (there should be ~20).
The last few bits of syslog before the restart (at 03:59:00)
In the weberface, we can see all the drives for the missing VMs.
Mainly trying to figure out how to gain access to that 'hdd-pool' in order to recover some data still on that device. Also would be nice to know why those VMs disappeared as well as figure out why pve rebooted at 3:59:00 in the morning. At this point the only way to recover the VMs is to restore from backups, but if anyone has a faster solution, that would be great to know as well.
Thanks for any help!!
We have one pve node that used to be in a cluster of two, but the second node was removed around a month ago. Our remaining pve node rebooted in the night and when it came back up, it was missing several of our VMs, particularly those that were on a zfs pool, 'hdd-pool', that didn't seem to be acknowledged by pve. For example, we can't create any new VMs on that particular pool, yet that pool is shown as online, and No known data errors.
The VMs that weren't created on that particular pool are still listed and are back up and running.
Some log info:
Code:
pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2
Code:
zpool status
pool: hdd-pool
state: ONLINE
scan: scrub repaired 0B in 12h57m with 0 errors on Sun Dec 8 13:21:37 2019
config:
NAME STATE READ WRITE CKSUM
hdd-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
errors: No known data errors
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0h8m with 0 errors on Sun Dec 8 00:32:57 2019
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdk3 ONLINE 0 0 0
sdl3 ONLINE 0 0 0
errors: No known data errors
pool: ssd-pool
state: ONLINE
scan: scrub repaired 0B in 0h9m with 0 errors on Sun Dec 8 00:33:14 2019
config:
NAME STATE READ WRITE CKSUM
ssd-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-Samsung_SSD_860_EVO_1TB_S3Z8NY0M448647R ONLINE 0 0 0
ata-Samsung_SSD_860_EVO_1TB_S3Z8NY0M448653Y ONLINE 0 0 0
errors: No known data errors
In particular, qemu-server/ is missing a dozen configs (there should be ~20).
Code:
ls /etc/pve/qemu-server/
1000.conf 1001.conf 1002.conf 1003.conf 106.conf 112.conf 9000.conf 9001.conf
The last few bits of syslog before the restart (at 03:59:00)
Code:
Dec 27 03:58:31 pve-green corosync[2409]: notice [TOTEM ] A new membership (10.191.0.177:4
1807416) was formed. Members
Dec 27 03:58:31 pve-green corosync[2409]: warning [CPG ] downlist left_list: 0 received
Dec 27 03:58:31 pve-green corosync[2409]: [TOTEM ] A new membership (10.191.0.177:41807416
) was formed. Members
Dec 27 03:58:31 pve-green corosync[2409]: notice [QUORUM] Members[1]: 2
Dec 27 03:58:31 pve-green corosync[2409]: notice [MAIN ] Completed service synchronizatio
n, ready to provide service.
Dec 27 03:58:31 pve-green corosync[2409]: [CPG ] downlist left_list: 0 received
Dec 27 03:58:31 pve-green corosync[2409]: [QUORUM] Members[1]: 2
Dec 27 03:58:31 pve-green corosync[2409]: [MAIN ] Completed service synchronization, read
y to provide service.
Dec 27 03:58:32 pve-green kernel: [12722782.816174] igb 0000:05:00.2 eno3: igb: eno3 NIC Li
nk is Down
Dec 27 03:58:32 pve-green kernel: [12722782.816302] vmbr0: port 1(eno3) entered disabled st
ate
Dec 27 03:58:32 pve-green kernel: [12722782.816375] vmbr0v4: port 1(eno3.4) entered disable
d state
Dec 27 03:58:32 pve-green kernel: [12722782.816480] vmbr0v5: port 1(eno3.5) entered disable
d state
Dec 27 03:58:32 pve-green corosync[2409]: notice [TOTEM ] A new membership (10.191.0.177:4
1807420) was formed. Members
Dec 27 03:58:32 pve-green corosync[2409]: warning [CPG ] downlist left_list: 0 received
Dec 27 03:58:32 pve-green corosync[2409]: [TOTEM ] A new membership (10.191.0.177:41807420
) was formed. Members
Dec 27 03:58:32 pve-green corosync[2409]: notice [QUORUM] Members[1]: 2
Dec 27 03:58:32 pve-green corosync[2409]: notice [MAIN ] Completed service synchronizatio
n, ready to provide service.
Dec 27 03:58:32 pve-green corosync[2409]: [CPG ] downlist left_list: 0 received
Dec 27 03:58:32 pve-green corosync[2409]: [QUORUM] Members[1]: 2
Dec 27 03:58:32 pve-green corosync[2409]: [MAIN ] Completed service synchronization, read
y to provide service.
Dec 27 03:59:00 pve-green systemd[1]: Starting Proxmox VE replication runner...
In the weberface, we can see all the drives for the missing VMs.
Mainly trying to figure out how to gain access to that 'hdd-pool' in order to recover some data still on that device. Also would be nice to know why those VMs disappeared as well as figure out why pve rebooted at 3:59:00 in the morning. At this point the only way to recover the VMs is to restore from backups, but if anyone has a faster solution, that would be great to know as well.
Thanks for any help!!