Proxmox 4.4 virtio_scsi regression.

could you post the output of "sg_inq /dev/XYZ" for each of those combinations after installing "sg3-utils"?

Here's some output of
SATA Controller - SATA disk

Code:
root@omv3-kvm:~# sg_inq /dev/sda
standard INQUIRY:
  PQual=0  Device_type=0  RMB=1  LU_CONG=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=0
  [RelAdr=0]  WBus16=0  Sync=0  [Linked=0]  [TranDis=0]  CmdQue=0
  [SPI: Clocking=0x0  QAS=0  IUS=0]
    length=96 (0x60)   Peripheral device type: disk
Vendor identification: ATA
Product identification: WDC WD20EARX-00P
Product revision level: AB51
Unit serial number:      WD-WCAZA9734804
root@omv3-kvm:~# sg_inq /dev/sdb
standard INQUIRY:
  PQual=0  Device_type=0  RMB=1  LU_CONG=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=0
  [RelAdr=0]  WBus16=0  Sync=0  [Linked=0]  [TranDis=0]  CmdQue=0
  [SPI: Clocking=0x0  QAS=0  IUS=0]
    length=96 (0x60)   Peripheral device type: disk
Vendor identification: ATA
Product identification: ST32000644NS
Product revision level: SN11
Unit serial number:             9WM1PNPB
root@omv3-kvm:~# sg_inq /dev/sdc
standard INQUIRY:
  PQual=0  Device_type=0  RMB=1  LU_CONG=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=0
  [RelAdr=0]  WBus16=0  Sync=0  [Linked=0]  [TranDis=0]  CmdQue=0
  [SPI: Clocking=0x0  QAS=0  IUS=0]
    length=96 (0x60)   Peripheral device type: disk
Vendor identification: ATA
Product identification: WDC WD20EARS-00J
Product revision level: 0A80
Unit serial number:      WD-WCAYY0101692
root@omv3-kvm:~# sg_inq /dev/sdd
standard INQUIRY:
  PQual=0  Device_type=0  RMB=1  LU_CONG=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=0
  [RelAdr=0]  WBus16=0  Sync=0  [Linked=0]  [TranDis=0]  CmdQue=0
  [SPI: Clocking=0x0  QAS=0  IUS=0]
    length=96 (0x60)   Peripheral device type: disk
Vendor identification: ATA
Product identification: ST3000DM001-9YN1
Product revision level: CC9E
Unit serial number:             Z1F0LGFR

I might be able to get some
SAS controller - SATA disk result later
 
no, but I would be interested in whether the problem goes away if you do

Code:
echo "madvise" >  /sys/kernel/mm/transparent_hugepage/enabled

before starting the VM.
hi,

with this action, it seems it fixes my problem, it no longer has read/write erros


Edit:

Still having read/write erros.
 
Last edited:
Is there any update on this issue? About to deploy Rockstor on Proxmox.

please read the whole thread. updated packages are available on pvetest and will move to the regular repositories soon.
 
yes - but confirmation from more systems is always a good idea..

the situation is as follows:
  • since qemu 2.7, scsi-block uses SG_IO to talk to pass through disks
  • this can cause issues (failing reads and/or writes) if the hypervisor host has very low free memory or very highly fragmented memory (or both)
  • this was worsened by PVE's kernel defaulting to disabling transparent huge pages (small pages => more fragmentation)
there are two counter measures we will release this week:
  • default to scsi-hd (which is not full pass-through) instead of scsi-block for pass-through, with the possibility to "opt-in" to the old behaviour with all the associated risk (until further notice)
  • enable transparent huge pages for programs explicity requesting them, such as Qemu (to decrease the risk of running into the issue when using scsi-block)
there is unfortunately no upstream fix in sight - we'll investige further this week to look for more complete solutions, but the above should minimize the risk for now.

I was not aware of fixes that you guys are making (or those were not communicated properly).
From your statement I can gather that fix is essentially making sure that I won't use virtio_scsi - but more inept and performance hitting virtio_blk (I know that i's still virtio_scsi, but it will mask it behind a "file like" emulation that is domain of virtio_blk). So this is not a fix per say - it's making sure that non functioning driver is not used at expense of performance.

Essentially I will try to get some spare server soon, for time being I can't stop production servers.

@wbumiller
Don't get this as picking on you, but your theory is in my case slightly defunct (unless you mean _specifically_ mdraid). I get corruption on system where there is:
- 48GB of ram, an only 1VM is created with 20GB allocated,
- not a single process running on VM - of course there will be a lot running but when proving error nothing was running, so I could see errors in syslog with nothing writting / reading out of drive (drive just sits with no FS and I get errors)
- I don't use swap space - I despise them - so there is nothing in host and guest either.
 
I was not aware of fixes that you guys are making (or those were not communicated properly).
From your statement I can gather that fix is essentially making sure that I won't use virtio_scsi - but more inept and performance hitting virtio_blk (I know that i's still virtio_scsi, but it will mask it behind a "file like" emulation that is domain of virtio_blk). So this is not a fix per say - it's making sure that non functioning driver is not used at expense of performance.

Essentially I will try to get some spare server soon, for time being I can't stop production servers.

the following two fixes were released:
  • qemu-server >= 4.0-106: only use direct pass-through via scsi-block when explicitly requested, use scsi-hd by default
  • pve-kernel-4.4.35-2-pve >= 4.4.35-79: enable transparent huge pages for programs explicitly requesting it via madvise
the first one means the issue does not occur anymore when running the default configuration, but you lose the full pass-through (which should only be relevant if you really need to issue raw SCSI / ATA commands to the devices). the second one makes the issue less likely to occur when using full pass-through with scsi-block (because it reduces memory fragmentation, which seems to be the root cause).

neither of the changes has anything to do with switching from virtio-scsi to virtio-blk. hope this clears the situation up a bit - if you have more questions, feel free to ask!
 
b) There are cases where you legitimately know before the write() finishes that you won't be needing the data so you don't care about writing a consistent state. The most obvious one (and number 1 reason for the corruption) being swap space: If the kernel starts swapping out memory of, for example, a program which is just about to exit(), the data currently in-flight effectively becomes "useless", and the kernel starts recycling that memory block before the write() finishes, causing the same kind of corruption since the kernel doesn't know that the single physical hard drive it actually sees and thinks its writing the data to is in fact part of a software raid on a hypervisor.

Kernel bug entry: https://bugzilla.kernel.org/show_bug.cgi?id=99171

@wbumiller I have followed this [1] to the dot, or so I believe:

Take a virtual machine, give it a disk - put the image on a software raid and tell qemu to disable caching (iow. use O_DIRECT, because the guest already does caching anyway).
Run linux in the VM, add part of the/a disk on the raid as swap, and cause the guest to start swapping a lot.

After writing some TBs into Gen3 SSD swapping in and out on a 4-device RAID1 (4 loop devices on that SSD which form RAID that has XFS on the host storing QCOW2 image of the guest) on Debian guest with qemu cache=none, the RAID is all clean.

I was running: stress-ng -m 4 --vm-bytes 40G --vm-hang 4

Would you have a more precisely defined reproducer, i.e. with qemu, not the synthetic one that Kernel BZ started with, i.e. something that I could verify on current kernels?

Thank you.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=99171#c5
 
  • Like
Reactions: RolandK
IIRC, using O_DIRECT to write to an mdraid array potentially leading to a corrupt array was (one of ?) the original reason(s?) for the "MDRAID is not supported" stance.

And also, @tomtom13, as long as this is the "original reason" ... maybe wiki needs updating if it's not an actual issue.
 
For the record, I also failed to reproduce this on the said PVE4 with Jessie guest, RAID1 and 32 workers (older stress-ng has limits for one hog) shredding full SWAP 32x the size of available 1GB RAM, 2 TBW total. Neither PVE4 nor Debian (8.0.0) were upgraded.
 
Why actually try the original reproducer from the linked bugzilla, which was confirmed to still trigger just yesterday, instead of some completely different pattern...?
Note that Wolfgang was just mentioning a more "realistic" scenario where this can happen, but for us the more "realistic" one is not required as reason to not support mdraid; a VM user being able to trivially wreck the whole RAID on the host by making it go out of sync through a O_DIRECT write is by far enough for us, just hand waving that away with being "synthetic" won't cut it.
 
Why actually try the original reproducer from the linked bugzilla, which was confirmed to still trigger just yesterday, instead of some completely different pattern...?
Note that Wolfgang was just mentioning a more "realistic" scenario where this can happen, but for us the more "realistic" one is not required as reason to not support mdraid; a VM user being able to trivially wreck the whole RAID on the host by making it go out of sync through a O_DIRECT write is by far enough for us, just hand waving that away with being "synthetic" won't cut it.

Thank you, Thomas. I just wanted to know if there was (ever) a reproducer of the said behaviour (with qemu), not some (synthetic) reproducer, I now understand that there was not.

EDIT: Also, even that test case was called to be invalid, but I have to confirm why. I do like to post things when I have some evidence only.

To answer your question, I will just quote the very first response in the same:
I'm not convinced this is a meaningful testcase. Any userspace application that modifies a data buffer in one thread while another thread is writing that buffer to disk is certain to not get predicable data back when reading it later. Whether this situation results in a mismatch among raid mirrors is not terribly meaningful.

I happen to agree with that as I happen to agree with the further:
I'd be nice for it to be consistent, but giving up the performance of zero-copy operations to avoid what can only be garbage doesn't seem like a great tradeoff to me. And it is long-known behaviour thanks to direct access by the kernel on mirrored swap devices.

I happen to be in the camp of not wanting to give up performance (in certain scenarios). I also - mistakenly, given the admission above - filed BZ report for you [1] to ask the said behaviour (with qemu) to be documented (as PVE defaults to the O_DIRECT).

I do not like how PVE's wiki disparages MDRAID, but provides no alternatives but ZFS (as of today), I am NOT asking for support.

For me personally, your rationale alone is circular - it makes no sense for Wolfgang to draw such conclusion unless you have seen such behaviour with qemu.
Note that Wolfgang was just mentioning a more "realistic" scenario where this can happen, but for us the more "realistic" one is not required as reason to not support mdraid; a VM user being able to trivially wreck the whole RAID on the host by making it go out of sync through a O_DIRECT write is by far enough for us, just hand waving that away with being "synthetic" won't cut it.

Also, it does not support the kernel BZ report (as I cannot reproduce it) and does not support your own wiki regarding MDRAID, other statements on the reasons why you do not support it (again I am not asking for you providing support, but e.g. keeping the language as it once used to be [1]) are vague at best from what I found.

Just to make it clear, it is alright for you to make your choices what you support out of the box, but creating artificial (the confirmed test case is such) (virtually) smear campaign is not.

Thank you for your understanding.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5235#c10
 
Last edited:
>Why actually try the original reproducer from the linked bugzilla, which was confirmed to still trigger just yesterday,
>instead of some completely different pattern...?

@t.lamprecht , because hannes reinecke (=experienced kernel hacker) made this bold statement

"Which means that the test case is actually invalid; you either would
need drop O_DIRECT or modify the buffer after write() to arrive with
a valid example."

at

https://lore.kernel.org/all/d5227ac.../T/#m0d19459261a92d8f8f6a20583562c352bf823096

and i for myself failed to find any reference/real world example/report/bugzilla entry on this bug being triggered by a virtual machine.

i don't have a problem with proxmox not supporting mdraid. i understand the decisions.

but i have a problem with O_DIRECT being a "valid" or "supported" feature of mdraid , when it can easily be broken by userspace.

and i have a problem with that, if nobody knows that and that information is only hidden in a dead old bugzilla ticket or in the dephts of proxmox forum.

there should be some big "warning sign".
 
Last edited:
@t.lamprecht , because hannes reinecke (=experienced kernel hacker) made this bold statement

"Which means that the test case is actually invalid; you either would
need drop O_DIRECT or modify the buffer after write() to arrive with
a valid example."

Yeah, and as Wolfgang wrote one can trigger a modification of the butter after write was issued, this can be due to a simple bug or a targeted attack on, e.g., a hosting platform. And even if it's a bug, bugs are, well, quite widespread, so a storage technology should be resilient to those.
In any way the fact that there is a test case that can make this happen is enough for it to be a problem, the kernel cannot tell the difference if a user space behavior is made possible through a made up test case or a (also made up) "real" program. I mean basically all modern (security) attacks are quite gibberish and odd in what they do code wise, that's not an argument to not fix an issue though.
The "re-use just swapped out memory" is one vector Wolfgang singled out, but there can be others. And sure O_DIRECT is a PITA interface and if everything behaves well this would not be a problem, but file systems and storage technology holds user data, and as such are simply held to a higher standard; as mentioned by Wolfgang/Fabian, that's why md-RAID is just not an option for Proxmox VE.
and i for myself failed to find any reference/real world example/report/bugzilla entry on this bug being triggered by a virtual machine.
I'm not sure what you mean here, the virtual machine is just running the guest OS and its user space, as long as the disk cache is "none" or the disks are directly passed through any code running there can trigger it. I mean, this thread was opened due to a user running into problems that were very real world for them.
but i have a problem with O_DIRECT being a "valid" or "supported" feature of mdraid , when it can easily be broken by userspace.

and i have a problem with that, if nobody knows that and that information is only hidden in a dead old bugzilla ticket or in the dephts of proxmox forum.

there should be some big "warning sign".
Yes, I get and understand your sentiment, this is not nice.

FWIW, the upcoming ZFS 2.3 adds O_DIRECT support in a safer manner, besides the fact that ZFS has scrubbing and repairing built in they also seem to be aware of the issue, as the PR mentions "O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests." and further down:
To ensure data integrity for all data written using O_DIRECT, all user pages are made stable in the event one
of the following is required:
Checksum
Compression
Encryption
Parity

By making the user pages stable, we make sure the contents of the user provided buffer can not be changed after
any of the above operations have taken place.
And sure, the implementation could also have bugs, but as it's designed for actually being safe with the problematic behavior that breaks MD-raid It's quite likely that the ZFS developers will rather fix it if anything comes up, and not try to discuss the problem away as being "synthetic".
 
  • Like
Reactions: RolandK

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!