[URGENT] High Disk Write Spike Causes whole system to crash

MasterGberry

New Member
Jul 9, 2013
15
0
1
Hi. My proxmox hypervisor reports a huge spike in Disk Write up to 200 PB a second (which obviously isn't possible) and then my entire SAN CPU shoots up to 100% and proceeds to not be responsive until the entire system is rebooted. I need to know who to point fingers at. Is this some glitch in the proxmox system that causes my SAN to freak out or vice versa? If the SAN were just crashing by itself wouldnt it not show this huge spike in IO on the proxmox graphs? I have a screenshot attached.

5kUCiZ.png

Here is what I have from the /var/log/messages file:

Code:
Aug 14 05:11:54 proxmox1 kernel: lost page write due to I/O error on dm-6 (this one is higher up in the log)
Aug 14 05:12:02 proxmox1 kernel: __ratelimit: 991 callbacks suppressed
Aug 14 05:12:02 proxmox1 kernel: lost page write due to I/O error on dm-3 (a lot more of these higher up in the log)
Aug 14 09:39:52 proxmox1 kernel: lost page write due to I/O error on dm-3 (some of these)
Aug 14 10:47:14 proxmox1 kernel: __ratelimit: 1040 callbacks suppressed ( A lot of these)
Aug 14 11:11:55 proxmox1 kernel: vmbr0: port 3(tap103i0) entering disabled state
Aug 14 11:11:55 proxmox1 kernel: vmbr0: port 3(tap103i0) entering disabled state
Aug 14 11:12:14 proxmox1 kernel: vmbr0: port 4(tap106i0) entering disabled state
Aug 14 11:12:14 proxmox1 kernel: vmbr0: port 4(tap106i0) entering disabled state
Aug 14 11:12:17 proxmox1 kernel: device tap103i0 entered promiscuous mode
Aug 14 11:12:17 proxmox1 kernel: HTB: quantum of class 10001 is big. Consider r2q change.
Aug 14 11:12:17 proxmox1 kernel: vmbr0: port 3(tap103i0) entering forwarding state
Aug 14 11:12:54 proxmox1 kernel: device tap106i0 entered promiscuous mode
Aug 14 11:12:54 proxmox1 kernel: vmbr0: port 4(tap106i0) entering forwarding state

Where should I be looking to make sure this does not happen again? I am happy to provide as much server info as I can and appreciate any help in solving this urgent matter.

EDIT: The SAN messages log shows nothing unusual. I believe there is some proxmox issue that is spiking the system out of control randomly (this has happened a few times over the course of a few months).

EDIT2: I was trying to run an automated backup yesterday for the first time. It was proceeding fine at first and then maybe towards one of the later VM's it crashed? Or it crashed when it finished? All the backups appear to be intact...
 
Last edited:
EDIT2: I was trying to run an automated backup yesterday for the first time. It was proceeding fine at first and then maybe towards one of the later VM's it crashed? Or it crashed when it finished? All the backups appear to be intact...

Do you run snapshot backup for a container?
 
there is an issue the the current lvm2, but looks fixed in our latest lvm2 packages.

upgrade using pvetest and test again. if you don´t want to switch to pvetest, you need to wait till 3.1 final.
see http://forum.proxmox.com/threads/15433-Proxmox-VE-3-1-beta-(pvetest)

How long has this been an issue? As previously mentioned I have had this same exact crash 1-2 other times (I don't think the other times i was backing up anything). But those issues were about...2 months ago I think? I might not have been on 3.0 yet. Just want to ensure that this doesn't occur too often. Thanks.
 
I guess another interesting question would be this. I have my SAN on a proxmox node also with a single VM, the SAN. Would it be possible to write a bash script that does ping checks (which I already use) and if the ping checks fail for 10-15 minutes per se it will force restart the VM so the SAN is brought back online? Then I have to restart all the VM's on the actual hypervisors too though...hmmm might be difficult to coordinate this.
 
This just happened again now, and I was not doing any backup tasks. I need to get this resolved. Why is proxmox misfiring and overloading the SAN? Or is just the SAN crashing and proxmox is showing weird things in the Disk I/O? I am willing to provide as much info as needed to get this resolved ASAP. Thanks.

Untitled.pngUntitled2.png

Code:
Aug 17 08:32:38 proxmox1 kernel: connection1:0: detected conn error (1011)
Aug 17 08:34:38 proxmox1 kernel: session1: session recovery timed out after 120 secs
Aug 17 08:34:38 proxmox1 kernel: __ratelimit: 646 callbacks suppressed
Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4

EDIT: I found the code segment above in /var/log/messages Looks like the SAN is going to 100% CPU usage and then the hypervisor can't communicate with it?
 
Last edited:
there is an issue the the current lvm2, but looks fixed in our latest lvm2 packages.

upgrade using pvetest and test again. if you don´t want to switch to pvetest, you need to wait till 3.1 final.
see http://forum.proxmox.com/threads/15433-Proxmox-VE-3-1-beta-(pvetest)

Bump. I ran a backup yesterday around 10-11 PST and it all seemed fine. The SAN crashed again later around 3 AM in the morning. Could this have been a result of my backup? If so do you suggest I upgrade to 3.1 now that it has been released? Are there any other logs I can look in to try and find what is going wrong besides /var/log/messages

Thanks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!