Corrupt Filesystem after snapshot

Waiting to see how this turns out... would be great to not end up with a 30PB file after a failed snapshot.

We were able to repro every time by putting the VM under heavy load and attempting a live snapshot.
 
The BIG problem persist !!!

our cluster: 3 server Dell connected to Netapp NFS storage via 10GBE
proxmox 5.2.9
yesterday, after a unbootable VM (Domain controller of course) i check all qcow2 and discovered some VM (6-7 on 80)
with ERROR cluster ....

i suddenly dist-upgraded all server, reboot
fix another broken VM (Windows 2012 virtio-scsi (virtio-win-0.1.141.iso)

run, take snapshot without ram no problem
after 1 minute take second snapshot with ram , and disk get corrupted :O

the VM still run, and booting and cloning with clonezilla run without problems
but qemu-img check report ERRORS

hope to hear solution
we stopped to use snapshot but we scare about vzdump that, i suppose, use snapshot feature :(
 
Hi Tom, i tried new kernel but things get worse
now corrupt also with snapshot without ram

i tried only one time

tomorrow i'll do extensive check

strange things:

on many test ne test ,
i toke snapshot and qemu-img report ERROR ....
do a chkdsk on windows 2012
windows 2012 report no problem
and qemu-img report all ok :O
 
I cannot see the issue here.

Please describe your setup in full detail (hardware, storage, VM settings).
 
hi Tom, i did some test, the problem arise on snapshot with ram
(previously i did a mistake check live vm with qemu-img)

test on updated proxmox with kernel that you suggest
storage Netapp 9.3, connected with double 10GBE ethernet connected via switch HP

host 3 server dell 256gb ecc registered ram each

# All qemu agent quest installed
# New W2012 created for test
# New Centos 7 created for test
# All qemu-img check done on power off VM

power on 4.5.18-7

windows 2012

testw201201 no IO
A.1) snapshot no ram
A.2) qemu-img check No errors

A.3) snapshot ram
A.4) qemu-img check No Errors

A.5) snapshot no ram
A.6) qemu-img check No Errors



testw201202 HIGH IO (iometer)
B.1) snapshot no ram
B.2) qemu-img check No Errors

B.3) snapshot si ram
B.4) qemu-img check ERROR OFLAG_COPIED .... (many Errors)
B.5) VM cannot start (no power on)
B.6) qemu-img check -r all
B.7) VM start
B.8) check disk from Guest no error detected

testcos701 no IO

C.1) snapshot no ram
C.2) qemu-img check No Errors
C.3) snapshot ram
C.4) qemu-img check No Errors

testcos702 HIGH IO (fio on ext4)

D.1) snapshot no ram
D.2) qemu-img check No Errors
D.3) snapshot ram
D.4) qemu-img check ERROR

23974 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.

40945 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.
491989/851968 = 57.75% allocated, 4.31% fragmented, 0.00% compressed clusters
Image end offset: 35214917632
D.5) VM start
D.6) fsck on filesystem where fio run, correct bitmap

VM definition (uploaded files) w2012 602.txt Centos7 606.txt
Mount definition (uploaded files)

if you need other thingor test tell me, i keep one serve with new kernel
empty until we fix the problem

Thanks in advance
 

Attachments

  • mount.txt
    5.7 KB · Views: 3
  • 602.txt
    1.1 KB · Views: 3
  • 606.txt
    1.1 KB · Views: 4
  • interfaces.txt
    714 bytes · Views: 1
Hi tom
I did the upgrade that you suggested me

wget wget http://download.proxmox.com/debian/...pve-kernel-4.15.18-7-pve_4.15.18-27_amd64.deb
dpkg -i pve-kernel-4.15.18-7-pve_4.15.18-27_amd64.deb

now on running machine
dpkg -l | grep kernel
ii pve-kernel-4.15 5.2-8 all Latest Proxmox VE Kernel Image
ii pve-kernel-4.15.17-1-pve 4.15.17-9 amd64 The Proxmox PVE Kernel Image
ii pve-kernel-4.15.18-5-pve 4.15.18-24 amd64 The Proxmox PVE Kernel Image
ii pve-kernel-4.15.18-7-pve 4.15.18-27 amd64 The Proxmox PVE Kernel Image
ii pve-kernel-4.4.67-1-pve 4.4.67-92 amd64 The Proxmox PVE Kernel Image

uname -r

4.15.18-7-pve

TIA
 
yes, persist, i did all test with last proxmox release and last kernel suggested by Tom
i'm waiting for suggestion or for another test by Tom
until fix be found/done i stop to use snapshot with ram
and limits others snapshot
 
Hi Tom, we did another test on another customer
cluster with two server (supermicro) storage NFS Qnap
test with centos 7 last updated

Different proxmox and kernel version

pve-manager/4.4-24/08ba4d2d (running kernel: 4.4.134-1-pve)

same behaviour !!! :O

172.17.1.1:/NFSSamba on /mnt/pve/NFSSamba type nfs (rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.17.1.1,mountvers=3,mountport=30000,mountproto=udp,local_lock=none,addr=172.17.1.1)

Versione 4.2.2 Build 20161102


With qemu-info qcow2 doesnt appear corrupted
/usr/bin/qemu-img info /mnt/pve/NFSSamba/images/901/vm-901-disk-2.qcow2

image: /mnt/pve/NFSSamba/images/901/vm-901-disk-2.qcow2
file format: qcow2
virtual size: 15G (16106127360 bytes)
disk size: 3.1G
cluster_size: 65536
Snapshot list:
ID TAG VM SIZE DATE VM CLOCK
1 TESTcorruzione 0 2018-10-18 10:37:33 00:27:08.538
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

with qemu-check
32774 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.


16384 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.
231364/245760 = 94.14% allocated, 11.43% fragmented, 0.00% compressed clusters
Image end offset: 17157914624


do you try to do same test ?
it could be a big issue
 
What storage protocols are in use, Paolo? NFS, iSCSI, RBD?

I can confirm that there seems to be three things in play when I experience this failure:

1) The machine is running under load
2) It's on a storage device that is mounted with NFS or RBD
3) It's when we snapshot RAM

I was speaking to another colleague the other day that also has been able to reproduce this and he said that he *thinks* he hasn't seen this on a iSCSI mounted storage device.

<D>
 
Hi Dmulk, we use NFS, but we couldnt use iscsi because with iSCSI i think that you cannot take snapshot, iSCSI in a cluster is used as block device with LVM
 
Hi Tom, i did another test on another client that has server with Sata disk and 04:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)a)
i crated same VM (Centos7 last update, qemu agent, virtio-scsi and disk on local directory /var/lib/vz/images)

running i/o task (fio) and snapshoting with ram damage disk (qemu-img check)
i think that Proxmox does some strange things when snap a qcow2 disk with ram


pve kernle version
4.13.13-5-pve

qemu-img check vm-150-disk-1.qcow2
...
ERROR OFLAG_COPIED data cluster: l2_entry=8000000020040000 refcount=2

8192 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
524288/770048 = 68.09% allocated, 3.52% fragmented, 0.00% compressed clusters
Image end offset: 35000025088

vm definition
agent: 1
balloon: 0
boot: cdn
bootdisk: scsi0
cores: 2
ide2: none,media=cdrom
memory: 2048
name: testcos701
net0: virtio=D6:58:88:23:36:28,bridge=vmbr0
numa: 0
ostype: l26
parent: siram
scsi0: local:150/vm-150-disk-1.qcow2,size=47G
scsihw: virtio-scsi-pci
smbios1: uuid=7f7153d0-ce2b-4561-b4c7-734c2e63db9a
sockets: 1

[siram]
agent: 1
balloon: 0
boot: cdn
bootdisk: scsi0
cores: 2
ide2: none,media=cdrom
machine: pc-i440fx-2.9
memory: 2048
name: testcos701
net0: virtio=D6:58:88:23:36:28,bridge=vmbr0
numa: 0
ostype: l26
scsi0: local:150/vm-150-disk-1.qcow2,size=47G
scsihw: virtio-scsi-pci
smbios1: uuid=7f7153d0-ce2b-4561-b4c7-734c2e63db9a
snaptime: 1539960519
sockets: 1
vmstate: local:150/vm-150-state-siram.raw

i think that it s a generalized problem, we have same problem on 4 customer !!!
and it doesnt seem NFS related
i would like to know if other has same behaviour
 
Seems like I am experiencing this issue too. I had a new Widows Server 2016 VM go corrupt a few weeks back for no apparent reason. Since it was new I just spun up another. However, over this past weekend I had a Windows server qcow file vanish from the NFS storage!?

I've recently started using the Datacenter scheduled backup feature. Prior to that I was just manually making backups and never experienced any issues. I've disabled my automated schedule and will try a few cycles of manual backups to see if the issue persists.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!