restarted a node, some kvm's on other nodes panic

Discussion in 'Proxmox VE: Installation and configuration' started by RobFantini, Feb 9, 2017.

  1. RobFantini

    RobFantini Active Member
    Proxmox Subscriber

    Joined:
    May 24, 2012
    Messages:
    1,490
    Likes Received:
    21
    we have a ceph cluster

    a minute after restarting one node at least 3 key kvm's paniced.

    screen shot attached.

    kvm's are on 2 diff nodes
     

    Attached Files:

  2. RobFantini

    RobFantini Active Member
    Proxmox Subscriber

    Joined:
    May 24, 2012
    Messages:
    1,490
    Likes Received:
    21
    2 of 3 kvm's had high memory usage.

    one did not have swap.

    the 3 were busy with disk i/o

    one of the nodes uses on board sata, the other a high end recent supermicro and it mode hba.

    Code:
     # pveversion -v
    proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
    pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
    pve-kernel-4.4.35-1-pve: 4.4.35-77
    pve-kernel-4.4.35-2-pve: 4.4.35-79
    lvm2: 2.02.116-pve3
    corosync-pve: 2.4.0-1
    libqb0: 1.0-1
    pve-cluster: 4.0-48
    qemu-server: 4.0-108
    pve-firmware: 1.1-10
    libpve-common-perl: 4.0-91
    libpve-access-control: 4.0-23
    libpve-storage-perl: 4.0-73
    pve-libspice-server1: 0.12.8-1
    vncterm: 1.2-1
    pve-docs: 4.4-3
    pve-qemu-kvm: 2.7.1-1
    pve-container: 1.0-93
    pve-firewall: 2.0-33
    pve-ha-manager: 1.0-40
    ksm-control-daemon: 1.2-1
    glusterfs-client: 3.5.2-2+deb8u3
    lxc-pve: 2.0.7-1
    lxcfs: 2.0.6-pve1
    criu: 1.6.0-1
    novnc-pve: 0.5-8
    smartmontools: 6.5+svn4324-1~pve80
    zfsutils: 0.6.5.8-pve14~bpo80
    ceph: 10.2.5-1~bpo80+1
    
     
  3. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,461
    Likes Received:
    395
    post your VM config:

    > qm config VMID

    what OS do you run inside in detail?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. RobFantini

    RobFantini Active Member
    Proxmox Subscriber

    Joined:
    May 24, 2012
    Messages:
    1,490
    Likes Received:
    21
    all three use jessie
    Code:
    boot: cn
    bootdisk: scsi0
    cores: 2
    memory: 1024
    name: fbcadmin
    net0: virtio=DE:60:C3:F6:55:23,bridge=vmbr1
    numa: 0
    onboot: 1
    ostype: l26
    protection: 1
    scsi0: ceph-kvm3:vm-100-disk-1,discard=on,size=8G
    smbios1: uuid=195cf837-ebaa-49c2-95e9-5ba7a0869cb0
    sockets: 1
    
     
  5. RobFantini

    RobFantini Active Member
    Proxmox Subscriber

    Joined:
    May 24, 2012
    Messages:
    1,490
    Likes Received:
    21
    also none of the systems logged out of memory errors. so prob. not a mem issue. note mem on above conf was 512 yesterday.

    kernel running per uname -a
    Linux fbcadmin 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux
     
  6. RobFantini

    RobFantini Active Member
    Proxmox Subscriber

    Joined:
    May 24, 2012
    Messages:
    1,490
    Likes Received:
    21
    and I can update these to debian stretch if newer kernel helps.
     
  7. RobFantini

    RobFantini Active Member
    Proxmox Subscriber

    Joined:
    May 24, 2012
    Messages:
    1,490
    Likes Received:
    21
    after some research , since I have 8 nodes , I'll try using 5 for OSD and 3 for VM. I am not sure yet where to place the 3 mons .
     
  8. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,461
    Likes Received:
    395
    make sure that you use virtio-scsi controller (not LSI), see VM options. I remember some panic when using LSI recently but I did not debug it further as modern OS should use virtio-scsi anyways.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    158
    Hi Rob,
    I'm not sure if this help for this issue, but I had an seperate ceph,cluster (8 nodes), where the mons run on the pve-nodes.
    So I would run the mons on the VM-nodes.

    Was the restarted node an osd+mon-node? Because there is an issue that the osd-stop would not reconised early enough, because the mon also died to fast. If you restart an node and shut down the ceph-osd first the VMs have app. 20sec less IO-stall.

    Normaly the VMs should handle short IO-stalling without trouble, but perhaps not?! (Don't know if the discard is also an problem in this case).

    Udo
     
  10. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    158
    Answer myself,
    got an email that this bug (#18516) is solved now - but don't know how long it's take to get this changes in ceph (i guess 10.2.6).

    Udo
     
  11. RobFantini

    RobFantini Active Member
    Proxmox Subscriber

    Joined:
    May 24, 2012
    Messages:
    1,490
    Likes Received:
    21
    they are set to LSI, I'll do the switch. thank you.
     
  12. RobFantini

    RobFantini Active Member
    Proxmox Subscriber

    Joined:
    May 24, 2012
    Messages:
    1,490
    Likes Received:
    21
    Udo: yes the restarted node ran mon+osd .
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice