[URGENT] High Disk Write Spike Causes whole system to crash

Discussion in 'Proxmox VE: Installation and configuration' started by MasterGberry, Aug 14, 2013.

  1. MasterGberry

    MasterGberry New Member

    Joined:
    Jul 9, 2013
    Messages:
    15
    Likes Received:
    0
    Hi. My proxmox hypervisor reports a huge spike in Disk Write up to 200 PB a second (which obviously isn't possible) and then my entire SAN CPU shoots up to 100% and proceeds to not be responsive until the entire system is rebooted. I need to know who to point fingers at. Is this some glitch in the proxmox system that causes my SAN to freak out or vice versa? If the SAN were just crashing by itself wouldnt it not show this huge spike in IO on the proxmox graphs? I have a screenshot attached.

    5kUCiZ.png

    Here is what I have from the /var/log/messages file:

    Code:
    Aug 14 05:11:54 proxmox1 kernel: lost page write due to I/O error on dm-6 (this one is higher up in the log)
    Aug 14 05:12:02 proxmox1 kernel: __ratelimit: 991 callbacks suppressed
    Aug 14 05:12:02 proxmox1 kernel: lost page write due to I/O error on dm-3 (a lot more of these higher up in the log)
    Aug 14 09:39:52 proxmox1 kernel: lost page write due to I/O error on dm-3 (some of these)
    Aug 14 10:47:14 proxmox1 kernel: __ratelimit: 1040 callbacks suppressed ( A lot of these)
    Aug 14 11:11:55 proxmox1 kernel: vmbr0: port 3(tap103i0) entering disabled state
    Aug 14 11:11:55 proxmox1 kernel: vmbr0: port 3(tap103i0) entering disabled state
    Aug 14 11:12:14 proxmox1 kernel: vmbr0: port 4(tap106i0) entering disabled state
    Aug 14 11:12:14 proxmox1 kernel: vmbr0: port 4(tap106i0) entering disabled state
    Aug 14 11:12:17 proxmox1 kernel: device tap103i0 entered promiscuous mode
    Aug 14 11:12:17 proxmox1 kernel: HTB: quantum of class 10001 is big. Consider r2q change.
    Aug 14 11:12:17 proxmox1 kernel: vmbr0: port 3(tap103i0) entering forwarding state
    Aug 14 11:12:54 proxmox1 kernel: device tap106i0 entered promiscuous mode
    Aug 14 11:12:54 proxmox1 kernel: vmbr0: port 4(tap106i0) entering forwarding state
    Where should I be looking to make sure this does not happen again? I am happy to provide as much server info as I can and appreciate any help in solving this urgent matter.

    EDIT: The SAN messages log shows nothing unusual. I believe there is some proxmox issue that is spiking the system out of control randomly (this has happened a few times over the course of a few months).

    EDIT2: I was trying to run an automated backup yesterday for the first time. It was proceeding fine at first and then maybe towards one of the later VM's it crashed? Or it crashed when it finished? All the backups appear to be intact...
     
    #1 MasterGberry, Aug 14, 2013
    Last edited: Aug 14, 2013
  2. dietmar

    dietmar Proxmox Staff Member
    Staff Member

    Joined:
    Apr 28, 2005
    Messages:
    16,433
    Likes Received:
    300
    Do you run snapshot backup for a container?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,302
    Likes Received:
    131
    The spike Petabyte in rrd graph is a bug, I have sometime theses big values when shutting down vms. (kvm guests)

    (So it possible that your problem is not related to disks)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. MasterGberry

    MasterGberry New Member

    Joined:
    Jul 9, 2013
    Messages:
    15
    Likes Received:
    0
    Yes, I ran snapshot with LZO. Is this not advised?
     
  5. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,448
    Likes Received:
    387
    there is an issue the the current lvm2, but looks fixed in our latest lvm2 packages.

    upgrade using pvetest and test again. if you don´t want to switch to pvetest, you need to wait till 3.1 final.
    see http://forum.proxmox.com/threads/15433-Proxmox-VE-3-1-beta-(pvetest)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. MasterGberry

    MasterGberry New Member

    Joined:
    Jul 9, 2013
    Messages:
    15
    Likes Received:
    0
    How long has this been an issue? As previously mentioned I have had this same exact crash 1-2 other times (I don't think the other times i was backing up anything). But those issues were about...2 months ago I think? I might not have been on 3.0 yet. Just want to ensure that this doesn't occur too often. Thanks.
     
  7. MasterGberry

    MasterGberry New Member

    Joined:
    Jul 9, 2013
    Messages:
    15
    Likes Received:
    0
    I guess another interesting question would be this. I have my SAN on a proxmox node also with a single VM, the SAN. Would it be possible to write a bash script that does ping checks (which I already use) and if the ping checks fail for 10-15 minutes per se it will force restart the VM so the SAN is brought back online? Then I have to restart all the VM's on the actual hypervisors too though...hmmm might be difficult to coordinate this.
     
  8. MasterGberry

    MasterGberry New Member

    Joined:
    Jul 9, 2013
    Messages:
    15
    Likes Received:
    0
    This just happened again now, and I was not doing any backup tasks. I need to get this resolved. Why is proxmox misfiring and overloading the SAN? Or is just the SAN crashing and proxmox is showing weird things in the Disk I/O? I am willing to provide as much info as needed to get this resolved ASAP. Thanks.

    Untitled.png Untitled2.png

    Code:
    Aug 17 08:32:38 proxmox1 kernel: connection1:0: detected conn error (1011)
    Aug 17 08:34:38 proxmox1 kernel: session1: session recovery timed out after 120 secs
    Aug 17 08:34:38 proxmox1 kernel: __ratelimit: 646 callbacks suppressed
    Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
    Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
    Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
    Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
    Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
    Aug 17 08:34:38 proxmox1 kernel: lost page write due to I/O error on dm-4
    EDIT: I found the code segment above in /var/log/messages Looks like the SAN is going to 100% CPU usage and then the hypervisor can't communicate with it?
     
    #8 MasterGberry, Aug 17, 2013
    Last edited: Aug 17, 2013
  9. MasterGberry

    MasterGberry New Member

    Joined:
    Jul 9, 2013
    Messages:
    15
    Likes Received:
    0
    Bump. I ran a backup yesterday around 10-11 PST and it all seemed fine. The SAN crashed again later around 3 AM in the morning. Could this have been a result of my backup? If so do you suggest I upgrade to 3.1 now that it has been released? Are there any other logs I can look in to try and find what is going wrong besides /var/log/messages

    Thanks.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice