Random Reboot (pve 5.3.5)

Discussion in 'Proxmox VE: Installation and configuration' started by zorg, Dec 26, 2018.

  1. zorg

    zorg New Member

    Joined:
    Dec 26, 2018
    Messages:
    5
    Likes Received:
    0
    Hello

    Since install in september, I have got ramdom reboot of my 4 nodes cluster

    each node reboot once a week

    No reason in the log. I have put corosync in debug mode but still nothing

    the only thing i have is when i test with omping . it loose 1 packet out of 600.

    but nothinhg is log about watchdog in the log.

    I'm using intel card with ixgbe I have seen stuff in forum about this forum but nothing is clear about a solution.

    I'm stuck because don't know where to look for or what to do that help me determine the problem

    Hope someone can help

    Zorg

    Code:
    pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
    
    pve-kernel-4.15: 5.2-12
    
    pve-kernel-4.15.18-9-pve: 4.15.18-30
    
    pve-kernel-4.15.17-3-pve: 4.15.17-14
    
    ceph: 12.2.10-1~bpo90+1
    
    corosync: 2.4.4-pve1
    
    criu: 2.11.1-1~bpo90
    
    glusterfs-client: 3.8.8-1
    
    libjs-extjs: 6.0.1-2
    
    libpve-access-control: 5.1-3
    
    libpve-apiclient-perl: 2.0-5
    
    libpve-common-perl: 5.0-43
    
    libpve-guest-common-perl: 2.0-18
    
    libpve-http-server-perl: 2.0-11
    
    libpve-storage-perl: 5.0-33
    
    libqb0: 1.0.3-1~bpo9
    
    lvm2: 2.02.168-pve6
    
    lxc-pve: 3.0.2+pve1-5
    
    lxcfs: 3.0.2-2
    
    novnc-pve: 1.0.0-2
    
    proxmox-widget-toolkit: 1.0-22
    
    pve-cluster: 5.0-31
    
    pve-container: 2.0-31
    
    pve-docs: 5.3-1
    
    pve-edk2-firmware: 1.20181023-1
    
    pve-firewall: 3.0-16
    
    pve-firmware: 2.0-6
    
    pve-ha-manager: 2.0-5
    
    pve-i18n: 1.0-9
    
    pve-libspice-server1: 0.14.1-1
    
    pve-qemu-kvm: 2.12.1-1
    
    pve-xtermjs: 1.0-5
    
    qemu-server: 5.0-43
    
    smartmontools: 6.5+svn4324-1
    
    spiceterm: 3.0-5
    
    vncterm: 1.5-3
     
  2. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    325
    Likes Received:
    29
    Hey,

    Can you post some log entries (dmesg, syslog, pve logs etc.) 5 minutes before and after an reboot?
    Let u know which hardware you use and how do you configured them.
    Which load and how many VMs do you have?
    Do you have some metrics from your monitoring?
    Could you update PVE and the FW of your servers?
    Do you use HA functionality from PVE?
     
  3. zorg

    zorg New Member

    Joined:
    Dec 26, 2018
    Messages:
    5
    Likes Received:
    0
    Thanks

    I have 12 vm on this node (all linux) and 38 vm for all my node

    I have a ceph cluster on 6 others server to store my vm

    My 4 nodes are in a proxmox cluster so I guess corosync in on and i think wathdog too (but my vm are not configure to use HA)

    I have put file with post for load and log

    My hardware for my hypervisor


    Carte mere : TYAN Computer Corporation
    carte réseau :
    03:00.0 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection (rev 01)
    03:00.1 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection (rev 01)

    CPU
    Architecture : x86_64
    Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
    Boutisme : Little Endian
    Processeur(s) : 32
    Liste de processeur(s) en ligne : 0-31
    Thread(s) par cœur : 2
    Cœur(s) par socket : 8
    Socket(s) : 2
    Nœud(s) NUMA : 4
    Identifiant constructeur : AuthenticAMD
    Famille de processeur : 21
    Modèle : 1
    Nom de modèle : AMD Opteron(TM) Processor 6272
    Révision : 2
    Vitesse du processeur en MHz : 2100.070
    Vitesse maximale du processeur en MHz : 2100,0000
    Vitesse minimale du processeur en MHz : 1400,0000
    BogoMIPS : 4200.13
    Virtualisation : AMD-V
    Cache L1d : 16K
    Cache L1i : 64K
    Cache L2 : 2048K
    Cache L3 : 6144K
    Nœud NUMA 0 de processeur(s) : 0-7
    Nœud NUMA 1 de processeur(s) : 8-15
    Nœud NUMA 2 de processeur(s) : 16-23
    Nœud NUMA 3 de processeur(s) : 24-31
    Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr topoext perfctr_core perfctr_nb cpb hw_pstate ssbd vmmcall arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
     

    Attached Files:

  4. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    325
    Likes Received:
    29
    It looks your Nodes have public IPs, is it possible to put then behind a firewall? Eventually you get a "magic packet" which forced your application / server to crash or do you get an DDoS attack. Is your password secure? Eventually your password is hacked and someone do some bad thinks on your nodes.

    What about your network, do you have your own VLAN for external and internal Communications or do you share this VLAN with many other Dedicated Server Customers? Do you have shared NIC for different service types?

    Could you please upload more log files? They are many other interesting log files which can help us to see what's going on.

    If you not use the HA function, have you tried to turn watchdog off and see if the problems is persistent?

    Is there anything running before the node fails, like backup, cronjob, unattended updates or something else?
     
  5. zorg

    zorg New Member

    Joined:
    Dec 26, 2018
    Messages:
    5
    Likes Received:
    0
    how to turn off watchdog
     
  6. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    325
    Likes Received:
    29
    It depends on the system you are using, you should take a look at your server / Mainboard manual. Normally it is an BIOS option.
     
  7. zorg

    zorg New Member

    Joined:
    Dec 26, 2018
    Messages:
    5
    Likes Received:
    0
    I have turn watch in the bios but after 4 day one of my node decide to reboot still nothing in the log
    It seem to me other people have kind of erratic problem in the forum but noone seem to have the clue

    I do not really believe in DDOS because I have nothing in my monitoring indicate that

    So still on it
     
  8. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    325
    Likes Received:
    29
    An DDoS Attack should not be have multiple GBit bandwidth, in depends on the system and configuration. So often 1Mbit can be enough, it depends on the attack itself. There are multiple ways to do that.

    What about all my other question?
     
  9. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,302
    Likes Received:
    131
    do you have last amd cpu microcode installed ? (through bios update or apt install amd64-microcode).

    also, if you have uefi, do you have kernel panic trace file in /sys/fs/pstore/ ?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. Manny Vazquez

    Manny Vazquez Member

    Joined:
    Jul 12, 2017
    Messages:
    88
    Likes Received:
    1
    I am going thru the same situation, any solution to this?
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice