Proxmox 4.4 virtio_scsi regression.

tomtom13 · Jan 16, 2017

Just FYI I can reproduce this without any FS, just by having DD write data to passed through disk in VM.

ps. I've received an email few days ago from proxmox to root at my company domain that is diverted to my mailbox ... and it let me know that there was a smart discovered bad sector on completely unrelated machine ... this feature should be better advertised because it's really nice bit of automation (and kudos).

e100 · Jan 16, 2017

I changed a vm from virtio to virtio scsi single Friday. The vm disk is 4TB ceph rbd formatted in guest as ext4. Made the change so i could run fstrim in the guest.

The kvm process died while vzdump was running on Sunday.
Logs in guest just abruptly end and only log in host was vzdump:

Code:

105: Jan 15 06:20:16 INFO: status: 69% (3035027668992/4398583382016), sparse 16% (730775957504), duration 130782, 26/18 MB/s
105: Jan 15 06:45:48 ERROR: VM 105 not running
105: Jan 15 06:45:48 INFO: aborting backup job
105: Jan 15 06:45:48 ERROR: VM 105 not running
105: Jan 15 06:46:05 ERROR: Backup of VM 105 failed - VM 105 not running

Happened on a fully updated enterprise repo Proxmox server.

tomtom13 · Jan 16, 2017

@e100 I think this might be a unrelated issue that you have there. Problems that people have is that without any evidence on host the virtio_scsi will cause corruption of data between guest-disk ...

Also logs will "just end" if you have a kernel panic within guest, I think the best way for you would be to setup a serial port between your guests and host and make this a default console in guest, if kernel panics it will spit out log into serial cable (virtual ofcourse) that you can log on host. This is how we solved a reliable logging of crashes on non accesible server rack - servers were not virtual but there was a dedicated server with just connected to rs232 of each machine picking the default console data 24/7 ... if kernel crashed we knew remotely what was the culprit.

e100 · Jan 16, 2017

A kernel panic in the guest leaves the guest running, I'd open the console for that guest in Proxmox and see the panic.

In my case the kvm process itself exited, likely due to a segfault or some other serious error. Might be a different problem but it happened for the first time after changing this guest to use virtio scsi so it may be related.

fabian · Jan 17, 2017

e100 said:
A kernel panic in the guest leaves the guest running, I'd open the console for that guest in Proxmox and see the panic.

In my case the kvm process itself exited, likely due to a segfault or some other serious error. Might be a different problem but it happened for the first time after changing this guest to use virtio scsi so it may be related.

no it's definitely not. please open a new thread/bug - this one here is for pass-through related issues.

fabian · Jan 17, 2017

@users affected by the passthrough issue:

I have a very strong suspicion what causes this, but could you confirm that this happens when passing through sata or sas-with-sata-emulation disks (and if you have such hardware, not when using real scsi disks)?

spirit · Jan 17, 2017

A simalar corruption problem has been reported on qemu-devel mailing list today with lvm + scsi.

https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg02593.html

Maybe qemu devs could help to investigate ?

rampage · Jan 17, 2017

fabian said:
@users affected by the passthrough issue:

I have a very strong suspicion what causes this, but could you confirm that this happens when passing through sata or sas-with-sata-emulation disks (and if you have such hardware, not when using real scsi disks)?

I'm passing through normal SATA disks

fabian · Jan 17, 2017

rampage said:
I'm passing through normal SATA disks

in that case, I recommend manually changing the "block" to "disk" or "hd" as a workaround as advised earlier in this thread. this will probably also be what we will push into qemu-server tomorrow, either permanently (with a manual override to force scsi-block in case you need to have the full SCSI passthrough functionality for actual SCSI disks) or as a temporary workaround if the qemu devs see this changed behaviour as regression and scsi-block will work again for SATA disks after a qemu bug fix.

tomtom13 · Jan 17, 2017

fabian said:
I have a very strong suspicion what causes this, but could you confirm that this happens when passing through sata or sas-with-sata-emulation disks (and if you have such hardware, not when using real scsi disks)?

I've got setup with
SATA controller - SATA disk
SAS controller - SATA disk
SAS controller - SAS disk

What will be a net effect for my setups vs. 4.3 proxmox ?

fabian · Jan 18, 2017

tomtom13 said:
I've got setup with
SATA controller - SATA disk
SAS controller - SATA disk
SAS controller - SAS disk

What will be a net effect for my setups vs. 4.3 proxmox ?

could you post the output of "sg_inq /dev/XYZ" for each of those combinations after installing "sg3-utils"?

tomtom13 · Jan 19, 2017

@fabian, do you want to have an output under 4.3 ? (also sas -> sas_disk setup has had proxmox removed from for safety sake, so this will take some time to get proxmox back on it)

fabian · Jan 19, 2017

tomtom13 said:
@fabian, do you want to have an output under 4.3 ? (also sas -> sas_disk setup has had proxmox removed from for safety sake, so this will take some time to get proxmox back on it)

no, but I would be interested in whether the problem goes away if you do

Code:

echo "madvise" >  /sys/kernel/mm/transparent_hugepage/enabled

before starting the VM.

tomtom13 · Jan 21, 2017

@fabian, unfortunately I can't do that for you. All proxmox installations I've had to manually revert to 4.3 and I'm waiting for issue to be resolved to push them up to 4.4 ... I can't afford any more data loss and time consuming re-installations.

I thought that you guys have managed to reproduce the issue ? or am I wrong ?

(also I'm soon off for holiday, and back in two weeks - after that maybe I'll get some spare server to play with but this issue have consumed much of my time already)

superbert · Jan 21, 2017

This thread has me concerned about losing data, but I have not (that I know of) had any data loss yet. I've done several reboots (of the guest), btrfs scrub (inside the guest), hammered the disk, restored data, etc., and I appear to be fine, no crashes, nothing interesting in my logs...

FWIW, In my setup, I have a VM running backup software on Debian 8 with a pass-through disk. My "disk" is actually an mdraid device on the host system with the physical disks connected via USB/UASP. Maybe I am ok because my "disk" **is** a block device on the host??

To help narrow down the issue:

Virtual Environment 4.4-5/c43015a5

Code:

agent: 1
boot: cd
bootdisk: scsi0
cores: 2
cpu: host
ide2: none,media=cdrom
memory: 16384
name: {redacted}
net0: virtio={redacted},bridge=vmbr0,firewall=1
net1: virtio={redacted},bridge=vmbr1,firewall=1
numa: 1
onboot: 1
ostype: l26
protection: 1
scsi0: local-lvm:vm-104-disk-1,discard=on,size=100G
scsi1: /dev/disk/by-id/md-uuid-{redacted},backup=0,size=3906836224K
scsihw: virtio-scsi-single
smbios1: uuid=5896dc9c-107d-4cfa-b16c-9ab59a363551
sockets: 2

fabian · Jan 23, 2017

tomtom13 said:
@fabian, unfortunately I can't do that for you. All proxmox installations I've had to manually revert to 4.3 and I'm waiting for issue to be resolved to push them up to 4.4 ... I can't afford any more data loss and time consuming re-installations.

I thought that you guys have managed to reproduce the issue ? or am I wrong ?

yes - but confirmation from more systems is always a good idea..

the situation is as follows:

since qemu 2.7, scsi-block uses SG_IO to talk to pass through disks
this can cause issues (failing reads and/or writes) if the hypervisor host has very low free memory or very highly fragmented memory (or both)
this was worsened by PVE's kernel defaulting to disabling transparent huge pages (small pages => more fragmentation)

there are two counter measures we will release this week:

default to scsi-hd (which is not full pass-through) instead of scsi-block for pass-through, with the possibility to "opt-in" to the old behaviour with all the associated risk (until further notice)
enable transparent huge pages for programs explicity requesting them, such as Qemu (to decrease the risk of running into the issue when using scsi-block)

there is unfortunately no upstream fix in sight - we'll investige further this week to look for more complete solutions, but the above should minimize the risk for now.

fabian · Jan 23, 2017

superbert said:
This thread has me concerned about losing data, but I have not (that I know of) had any data loss yet. I've done several reboots (of the guest), btrfs scrub (inside the guest), hammered the disk, restored data, etc., and I appear to be fine, no crashes, nothing interesting in my logs...

I could only reproduce it with lots of I/O, and some physical disks triggered it a lot easier than others. you could verify with "qm showcmd VMID" whether the device uses "scsi-block", and not some other "scsi-XX" variant..

FWIW, In my setup, I have a VM running backup software on Debian 8 with a pass-through disk. My "disk" is actually an mdraid device on the host system with the physical disks connected via USB/UASP. Maybe I am ok because my "disk" **is** a block device on the host??

mdraid can already cause data corruption on its own (hence it's not supported by PVE). but since it is not a physical device that is passed into the VM in your case, you might be "saved" by the layer of abstraction provided by MD (see the start of the thread, where the issue was also not reproducible with "disks" provided via iSCSI).

superbert · Jan 24, 2017

It looks like my mdraid pass-through is running as scsi-hd, so I am indeed "saved"!

Code:

'file=/dev/disk/by-id/md-uuid-{redacted},if=none,id=drive-scsi1,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi1.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1'

mdraid can already cause data corruption on its own (hence it's not supported by PVE)

Can you clarify mdraid's data corruption issue and the resulting stance by Proxmox? Are we talking about bitrot and how md doesn't scrub? Resync errors after power failure?

fabian · Jan 24, 2017

superbert said:
It looks like my mdraid pass-through is running as scsi-hd, so I am indeed "saved"!

Code:

'file=/dev/disk/by-id/md-uuid-{redacted},if=none,id=drive-scsi1,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi1.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1'

Can you clarify mdraid's data corruption issue and the resulting stance by Proxmox? Are we talking about bitrot and how md doesn't scrub? Resync errors after power failure?

IIRC, using O_DIRECT to write to an mdraid array potentially leading to a corrupt array was (one of ?) the original reason(s?) for the "MDRAID is not supported" stance.

wbumiller · Jan 24, 2017

If you'd like a technical explanation:

With O_DIRECT the kernel doesn't cache the data to be written. Instead it passes the pointer down to the storage layer. The software raid in turn passes the same pointer down to the handler for each individual disk. Each of these handlers then uses that pointer to read the data from the buffer independently and asynchronously.
This means that if while one thread is waiting for the write() to finish another one is modifying the data being written, each raid disk - since they are performing the copying independently - might be writing a different state of the buffer.

While this seems like a bad thing to do there are two important things to remember:
a) Any random unprivileged user program in a guest VM can do this, causing the host's RAID to be considered inconsistent/broken.
b) There are cases where you legitimately know before the write() finishes that you won't be needing the data so you don't care about writing a consistent state. The most obvious one (and number 1 reason for the corruption) being swap space: If the kernel starts swapping out memory of, for example, a program which is just about to exit(), the data currently in-flight effectively becomes "useless", and the kernel starts recycling that memory block before the write() finishes, causing the same kind of corruption since the kernel doesn't know that the single physical hard drive it actually sees and thinks its writing the data to is in fact part of a software raid on a hypervisor.

Kernel bug entry: https://bugzilla.kernel.org/show_bug.cgi?id=99171

The problem is: using an intermediate buffer is basically the opposite of what O_DIRECT is meant to do (although the documentation does state it only makes the kernel try to skip any such buffers), and locking out other threads sharing the same memory would be a performance killer.
While my personal stance on this is that the whole point of a RAID1 system is to write the same data to all underlying disks, other people might argue that O_DIRECT is an exception here. The documentation warns that it has to be used with caution, and the open(2) manpage contains a nice quote from Linus:

man 2 open said:
"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."—Linus

Proxmox 4.4 virtio_scsi regression.

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Distinguished Member

Active Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

New Member

Proxmox Staff Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Proxmox Staff Member

We value your privacy