[SOLVED] VM's losing disk access, file systems getting corrupted

plofkat

Active Member
Mar 20, 2013
51
2
28
Previous system Proxmox 2.x
Upgrade to 4.x

Additional info. VM disk images on shared folder mounted with fuse (MooseFS)
This has been the case for several years, running without issues on Proxmox 2.x

VM's under heavy load loses disk access.

Linux VM logs:
Jun 7 20:28:33 backup00 kernel: [99516.446504] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:28:33 backup00 kernel: [99516.446535] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:28:33 backup00 kernel: [99516.446538] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:28:33 backup00 kernel: [99516.446540] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:28:33 backup00 kernel: [99516.446543] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:28:33 backup00 kernel: [99516.446545] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:28:33 backup00 kernel: [99516.446552] sd 0:0:1:0: [sdb] DEVICE RESET operation started
Jun 7 20:28:33 backup00 kernel: [99516.446580] sd 0:0:1:0: DEVICE RESET operation complete.
Jun 7 20:28:33 backup00 kernel: [99516.446744] scsi target0:0:1: control msgout: c.
Jun 7 20:28:33 backup00 kernel: [99516.446869] scsi target0:0:1: has been reset
Jun 7 20:28:33 backup00 kernel: [99516.447025] sd 0:0:1:0: [sdb] BUS RESET operation started
Jun 7 20:28:33 backup00 kernel: [99516.449442] sym0: SCSI BUS reset detected.
Jun 7 20:28:33 backup00 kernel: [99516.453395] sym0: SCSI BUS has been reset.
Jun 7 20:28:33 backup00 kernel: [99516.488883] sd 0:0:1:0: BUS RESET operation complete.
Jun 7 20:28:33 backup00 kernel: [99516.684263] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
Jun 7 20:45:46 backup00 kernel: [100549.500915] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500931] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500934] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500937] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500939] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500942] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500944] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500946] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500949] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500951] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500954] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500956] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500958] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500961] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500963] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500966] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500968] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500970] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500973] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500975] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500977] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500980] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500982] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500985] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500987] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500990] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500992] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500994] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.500997] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.500999] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.501001] sd 0:0:1:0: [sdb] ABORT operation started
Jun 7 20:45:46 backup00 kernel: [100549.501004] sd 0:0:1:0: ABORT operation failed.
Jun 7 20:45:46 backup00 kernel: [100549.501010] sd 0:0:1:0: [sdb] DEVICE RESET operation started
Jun 7 20:45:46 backup00 kernel: [100549.501038] sd 0:0:1:0: DEVICE RESET operation complete.
Jun 7 20:45:46 backup00 kernel: [100549.501206] scsi target0:0:1: control msgout: c.
Jun 7 20:45:46 backup00 kernel: [100549.501333] scsi target0:0:1: has been reset
Jun 7 20:45:46 backup00 kernel: [100549.501485] sd 0:0:1:0: [sdb] BUS RESET operation started
Jun 7 20:45:46 backup00 kernel: [100549.518340] sym0: SCSI BUS reset detected.
Jun 7 20:45:46 backup00 kernel: [100549.522294] sym0: SCSI BUS has been reset.
Jun 7 20:45:46 backup00 kernel: [100549.556819] sd 0:0:1:0: BUS RESET operation complete.
Jun 7 20:45:46 backup00 kernel: [100549.684681] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
Jun 7 20:58:12 backup00 kernel: [101295.709863] sd 0:0:0:0: [sda] ABORT operation started
Jun 7 20:58:12 backup00 kernel: [101295.709869] sd 0:0:0:0: ABORT operation failed.
Jun 7 20:58:12 backup00 kernel: [101295.709875] sd 0:0:0:0: [sda] DEVICE RESET operation started
Jun 7 20:58:12 backup00 kernel: [101295.709901] sd 0:0:0:0: DEVICE RESET operation complete.
Jun 7 20:58:12 backup00 kernel: [101295.710076] scsi target0:0:0: control msgout: c.
Jun 7 20:58:12 backup00 kernel: [101295.710203] scsi target0:0:0: has been reset
Jun 7 20:58:12 backup00 kernel: [101295.710358] sd 0:0:0:0: [sda] BUS RESET operation started
Jun 7 20:58:12 backup00 kernel: [101295.712784] sym0: SCSI BUS reset detected.
Jun 7 20:58:12 backup00 kernel: [101295.716737] sym0: SCSI BUS has been reset.
Jun 7 20:58:12 backup00 kernel: [101295.752244] sd 0:0:0:0: BUS RESET operation complete.
Jun 7 20:58:12 backup00 kernel: [101295.999263] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
Jun 7 20:58:53 backup00 kernel: [101336.653170] sd 0:0:0:0: [sda] ABORT operation started
Jun 7 20:58:53 backup00 kernel: [101336.653176] sd 0:0:0:0: ABORT operation failed.
Jun 7 20:58:53 backup00 kernel: [101336.653179] sd 0:0:0:0: [sda] ABORT operation started
Jun 7 20:58:53 backup00 kernel: [101336.653215] sd 0:0:0:0: ABORT operation failed.
Jun 7 20:58:53 backup00 kernel: [101336.653222] sd 0:0:0:0: [sda] DEVICE RESET operation started
Jun 7 20:58:53 backup00 kernel: [101336.653250] sd 0:0:0:0: DEVICE RESET operation complete.
Jun 7 20:58:53 backup00 kernel: [101336.653412] scsi target0:0:0: control msgout: c.
Jun 7 20:58:53 backup00 kernel: [101336.653540] scsi target0:0:0: has been reset
Jun 7 20:58:53 backup00 kernel: [101336.653696] sd 0:0:0:0: [sda] BUS RESET operation started
Jun 7 20:58:53 backup00 kernel: [101336.656098] sym0: SCSI BUS reset detected.
Jun 7 20:58:53 backup00 kernel: [101336.660051] sym0: SCSI BUS has been reset.
Jun 7 20:58:53 backup00 kernel: [101336.695442] sd 0:0:0:0: BUS RESET operation complete.
Jun 7 20:58:53 backup00 kernel: [101336.785158] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
 
Update:

Switching VM disk images to use writethrough cache seems to improve the situation, however fsyncs are noticeably slower to the fuse file system using proxmox 4 than it was on 2.x
 
Hi,

PVE 4 use ext4 what default use writebarriers, this is the reason why the fsyncs are less then in PVE2
 
I have narrowed down the issue some more.

It seems the issue starts when the hosts are running backups of the virtual machines.
I start getting I/O errors on some of my virual machines, and the file systems get corrupted on some of them. (those with larger disk images ranging from 250 to 500GB)

This is NOT an ideal situation, going from version 2.x to 4.x should be an improvement, not make things worse.

To be clear, this exact setup has been working properly for YEARS without any data loss, and now after upgrading to 4.x I start losing data.
 
This is the type of shit I have to deal with on all my VM's


[Tue Jun 13 07:22:50 2017] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
[Tue Jun 13 07:23:31 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:23:31 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:23:31 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:23:31 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:23:31 2017] sd 0:0:1:0: [sdb] DEVICE RESET operation started
[Tue Jun 13 07:23:31 2017] sd 0:0:1:0: DEVICE RESET operation complete.
[Tue Jun 13 07:23:31 2017] scsi target0:0:1: control msgout: c.
[Tue Jun 13 07:23:31 2017] scsi target0:0:1: has been reset
[Tue Jun 13 07:23:31 2017] sd 0:0:1:0: [sdb] BUS RESET operation started
[Tue Jun 13 07:23:31 2017] sym0: SCSI BUS reset detected.
[Tue Jun 13 07:23:31 2017] sym0: SCSI BUS has been reset.
[Tue Jun 13 07:23:31 2017] sd 0:0:1:0: BUS RESET operation complete.
[Tue Jun 13 07:23:31 2017] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
[Tue Jun 13 07:25:53 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:25:53 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:25:53 2017] sd 0:0:1:0: [sdb] DEVICE RESET operation started
[Tue Jun 13 07:25:53 2017] sd 0:0:1:0: DEVICE RESET operation complete.
[Tue Jun 13 07:25:53 2017] scsi target0:0:1: control msgout: c.
[Tue Jun 13 07:25:53 2017] scsi target0:0:1: has been reset
[Tue Jun 13 07:25:53 2017] sd 0:0:1:0: [sdb] BUS RESET operation started
[Tue Jun 13 07:25:53 2017] sym0: SCSI BUS reset detected.
[Tue Jun 13 07:25:53 2017] sym0: SCSI BUS has been reset.
[Tue Jun 13 07:25:53 2017] sd 0:0:1:0: BUS RESET operation complete.
[Tue Jun 13 07:25:53 2017] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
[Tue Jun 13 07:26:34 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:26:34 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:26:34 2017] sd 0:0:1:0: [sdb] DEVICE RESET operation started
[Tue Jun 13 07:26:34 2017] sd 0:0:1:0: DEVICE RESET operation complete.
[Tue Jun 13 07:26:34 2017] scsi target0:0:1: control msgout: c.
[Tue Jun 13 07:26:34 2017] scsi target0:0:1: has been reset
[Tue Jun 13 07:26:34 2017] sd 0:0:1:0: [sdb] BUS RESET operation started
[Tue Jun 13 07:26:34 2017] sd 0:0:1:0: BUS RESET operation complete.
[Tue Jun 13 07:26:34 2017] sym0: SCSI BUS reset detected.
[Tue Jun 13 07:26:34 2017] sym0: SCSI BUS has been reset.
[Tue Jun 13 07:26:34 2017] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] DEVICE RESET operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: DEVICE RESET operation complete.
[Tue Jun 13 07:28:24 2017] scsi target0:0:1: control msgout: c.
[Tue Jun 13 07:28:24 2017] scsi target0:0:1: has been reset
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: [sdb] BUS RESET operation started
[Tue Jun 13 07:28:24 2017] sd 0:0:1:0: BUS RESET operation complete.
[Tue Jun 13 07:28:24 2017] sym0: SCSI BUS reset detected.
[Tue Jun 13 07:28:24 2017] sym0: SCSI BUS has been reset.
[Tue Jun 13 07:28:24 2017] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
[Tue Jun 13 07:30:01 2017] sd 0:0:1:0: [sdb] ABORT operation started
[Tue Jun 13 07:30:01 2017] sd 0:0:1:0: ABORT operation failed.
[Tue Jun 13 07:30:01 2017] sd 0:0:1:0: [sdb] DEVICE RESET operation started
[Tue Jun 13 07:30:01 2017] sd 0:0:1:0: DEVICE RESET operation complete.
[Tue Jun 13 07:30:01 2017] scsi target0:0:1: control msgout: c.
[Tue Jun 13 07:30:01 2017] scsi target0:0:1: has been reset
[Tue Jun 13 07:30:01 2017] sd 0:0:1:0: [sdb] BUS RESET operation started
[Tue Jun 13 07:30:01 2017] sym0: SCSI BUS reset detected.
[Tue Jun 13 07:30:01 2017] sym0: SCSI BUS has been reset.
[Tue Jun 13 07:30:01 2017] sd 0:0:1:0: BUS RESET operation complete.
[Tue Jun 13 07:30:01 2017] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
 
In case anyone was interested - this is due to a networking issue with corosync.
 
In case anyone was interested - this is due to a networking issue with corosync.
Well I was interested :) Was it due to corosync sharing a network interface with your storage layer? corosync seems to be very sensitive to lag under high bandwidth usage situations.
 
Well I was interested :) Was it due to corosync sharing a network interface with your storage layer? corosync seems to be very sensitive to lag under high bandwidth usage situations.

Basically corosync did not like our WAN link at all.
I still have the storage on the same network after removing the remote server from the cluster and no issues.
It seems that corosync died every time the latency exceeded about 15ms on the wan link.

For now, it runs very well along with the storage layer, however I am still going to split it off during our next maintenance cycle, just to be safe.

I miss the ability to migrate servers between sites on the fly, but hey, what can you do, it seems progress always has a price.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!