Probable Data Loss - What is happening ?

silverstone · Apr 29, 2024

I am a bit dumbfounded with 2 Debian VMs on 2 different Proxmox VE Hosts.

The boot process seems stuck as in the screenshot above. No matter how many times I refresh the page, the NoVNC Console always displays this useless page and I cannot do anything.

However ... I could ssh and perform updates, install packages, etc.

Except then ... After I rebooted the 2nd Debian VM after EVERYTHING was configured, I see in the NoVNC Console the same Hostname, Old Configuration for IP address, etc, as it had ... when I started working on it.

So it would appear that I lost 3 hours of work and I don't know why.

Any tips would be appreciated.

Installation is Proxmox VE 8.1.1 that I attempted to upgrade to Proxmox VE 8.2.2 (No-Subscription Repository).

EDIT 1: just checked, I lost everything that I did in the last session on those VMs ... It's like they were NOT writing the changes to disk or something

.

EDIT 2: Adding "nomodeset" in the "linux" kernel command line in GRUB seems to help. But still, why did I lose 3 hours of work and no warning occurred on the VM ??? It's like it automatically rolled back a snapshot or something

. It doesn't appear to be a corruption case at least ...

EDIT 3: I just did a nightly zpool scrub, everything looks fine. What's even werider is that the VM seemed to have saved the data BETWEEN A FEW REBOOTs. But after a 2nd / 3rd reboot, everything got reverted back to the original state.

fiona · Apr 30, 2024

Hi,
please share the VM configuration file qm config <ID>. Are you maybe using cloud-init or similar or a cache setting for the disk? Please also check the Task History of the VM and the system logs/journal for anything that could be related. Did you install any third-party scripts?

silverstone · Apr 30, 2024

fiona said:
Hi,
please share the VM configuration file qm config <ID>. Are you maybe using cloud-init or similar or a cache setting for the disk? Please also check the Task History of the VM and the system logs/journal for anything that could be related. Did you install any third-party scripts?

Here is the Config:

Code:

root@pve16:~# qm config 105
balloon: 0
boot: order=scsi0;ide2;net0
cores: 2
cpu: host
ide2: none,media=cdrom
memory: 4096
meta: creation-qemu=8.1.5,ctime=1714410786
name: DC2
net0: virtio=BC:24:11:A4:EA:7E,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: l26
scsi0: local-zfs:vm-105-disk-0,backup=1,discard=on,iothread=1,size=32G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=fe3fe77c-b478-4d92-b04a-f618e1308bf0
sockets: 1
vmgenid: d00331e8-9b5d-4c6b-9bab-a550419907d1

I tried changing the following at some point:
- CPU: switched from host to kvm64 then back to host
- Enabled / Disabled NUMA
- Set Cache to Write Through based on an input from one of the Proxmox VE Developers here on this Forum (it might be needed for some machines that refused to boot)

Task History doesn't show anything abnormal. Just the usual stop/start/bring up console/etc.

Logs don't show anything suspicious ...

Installed custom scripts ? Not really. I use my own Scripts to Provision the VM usually (import it from a .raw file that I generated from a Production VM, using it as a "template"). But not using those after having imported the Disk into the VM ...

fiona · Apr 30, 2024

Can you check zpool history | grep vm-110 if any rollbacks happened on the storage level? Do you have a cluster or is this node standalone? I'm asking, because in case of a node failure, automatic recovery can lead to data loss since the last replication and therefore replication should be done very frequently.

silverstone · Apr 30, 2024

fiona said:
Can you check zpool history | grep vm-110 if any rollbacks happened on the storage level? Do you have a cluster or is this node standalone? I'm asking, because in case of a node failure, automatic recovery can lead to data loss since the last replication and therefore replication should be done very frequently.

It's standalone ...

But I believe Proxmox VE automatically runs a cluster anyways, even if there is a single node.

Code:

root@pve16:~# mount -l | grep pve
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

Code:

root@pve16:~# zpool history | grep vm-105
2024-04-29.19:14:00 zfs create -s -V 33554432k rpool/data/vm-105-disk-0

That's probably when I imported it with my importDisk.sh Script:

Code:

#!/bin/bash

# Parameters
basefolder="/tools_nfs/exported-vm"
type="raw"
options=""
datastore="rpool/data"

# List available VMs "Templates"
ls -l ../exported-vm/

# Ask for more parameters
read -p "Enter VM ID: " vmid
read -p "Enter VM Image Source: " source
read -p "Enter Source Disk ID: " sourcediskid
read -p "Enter Destination Disk ID: " destinationdiskid
read -p "Enter destination FS: " destinationfs

# Build Source file
current=$(pwd)
imagefile=`ls ${basefolder}/${source}/${source}-disk-${sourcediskid}*.raw`

# Execute command
qm importdisk $vmid $imagefile -format raw $destinationfs
qm set $vmid --scsihw virtio-scsi-pci --scsi0 $destinationfs:vm-$vmid-disk-$destinationdiskid,backup=1,discard=on,iothread=1,ssd=1

silverstone · May 5, 2024

@fiona : the only possible reason, but I cannot be 100% sure about it since I don't remember the exact timing of events, is that I ran a Proxmox VE Update/Upgrade before-during-after (cannot remember exactly now

) I was doing the modifications inside that VM.

Maybe that, at reboot of the VM and/or the Host (cannot remember exactly now

), triggered a Rollback from the pre-existing ZFS Snapshot ?

Like: Proxmox VE `apt update && apt dist-upgrade` and that "freezes" the current State in place, which is restored after the Upgrade is finished (and/or the Proxmox VE Host completed a reboot sequence) ?

fiona · May 6, 2024

silverstone said:
@fiona : the only possible reason, but I cannot be 100% sure about it since I don't remember the exact timing of events, is that I ran a Proxmox VE Update/Upgrade before-during-after (cannot remember exactly now ) I was doing the modifications inside that VM.

Maybe that, at reboot of the VM and/or the Host (cannot remember exactly now ), triggered a Rollback from the pre-existing ZFS Snapshot ?

Like: Proxmox VE `apt update && apt dist-upgrade` and that "freezes" the current State in place, which is restored after the Upgrade is finished (and/or the Proxmox VE Host completed a reboot sequence) ?

No, there is no such mechanism. Rollbacks are only done when requested and will show up in the task history. Those done manually on the storage layer would show up in the zpool history (you can still check if there was a rollback of the whole pool). If the VM was not shut down cleanly and hadn't synced the data to disk yet, the data could get lost, but seems a bit unlikely since you said it was running for hours.

Search

Search

Probable Data Loss - What is happening ?

silverstone

Well-Known Member

fiona

Proxmox Staff Member

silverstone

Well-Known Member

fiona

Proxmox Staff Member

silverstone

Well-Known Member

silverstone

Well-Known Member

fiona

Proxmox Staff Member