Random Reboot (pve 5.3.5)

zorg

Active Member
Dec 26, 2018
5
0
41
53
Hello

Since install in september, I have got ramdom reboot of my 4 nodes cluster

each node reboot once a week

No reason in the log. I have put corosync in debug mode but still nothing

the only thing i have is when i test with omping . it loose 1 packet out of 600.

but nothinhg is log about watchdog in the log.

I'm using intel card with ixgbe I have seen stuff in forum about this forum but nothing is clear about a solution.

I'm stuck because don't know where to look for or what to do that help me determine the problem

Hope someone can help

Zorg

Code:
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)

pve-kernel-4.15: 5.2-12

pve-kernel-4.15.18-9-pve: 4.15.18-30

pve-kernel-4.15.17-3-pve: 4.15.17-14

ceph: 12.2.10-1~bpo90+1

corosync: 2.4.4-pve1

criu: 2.11.1-1~bpo90

glusterfs-client: 3.8.8-1

libjs-extjs: 6.0.1-2

libpve-access-control: 5.1-3

libpve-apiclient-perl: 2.0-5

libpve-common-perl: 5.0-43

libpve-guest-common-perl: 2.0-18

libpve-http-server-perl: 2.0-11

libpve-storage-perl: 5.0-33

libqb0: 1.0.3-1~bpo9

lvm2: 2.02.168-pve6

lxc-pve: 3.0.2+pve1-5

lxcfs: 3.0.2-2

novnc-pve: 1.0.0-2

proxmox-widget-toolkit: 1.0-22

pve-cluster: 5.0-31

pve-container: 2.0-31

pve-docs: 5.3-1

pve-edk2-firmware: 1.20181023-1

pve-firewall: 3.0-16

pve-firmware: 2.0-6

pve-ha-manager: 2.0-5

pve-i18n: 1.0-9

pve-libspice-server1: 0.14.1-1

pve-qemu-kvm: 2.12.1-1

pve-xtermjs: 1.0-5

qemu-server: 5.0-43

smartmontools: 6.5+svn4324-1

spiceterm: 3.0-5

vncterm: 1.5-3
 
Hey,

Can you post some log entries (dmesg, syslog, pve logs etc.) 5 minutes before and after an reboot?
Let u know which hardware you use and how do you configured them.
Which load and how many VMs do you have?
Do you have some metrics from your monitoring?
Could you update PVE and the FW of your servers?
Do you use HA functionality from PVE?
 
Thanks

I have 12 vm on this node (all linux) and 38 vm for all my node

I have a ceph cluster on 6 others server to store my vm

My 4 nodes are in a proxmox cluster so I guess corosync in on and i think wathdog too (but my vm are not configure to use HA)

I have put file with post for load and log

My hardware for my hypervisor


Carte mere : TYAN Computer Corporation
carte réseau :
03:00.0 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection (rev 01)
03:00.1 Ethernet controller: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection (rev 01)

CPU
Architecture : x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Boutisme : Little Endian
Processeur(s) : 32
Liste de processeur(s) en ligne : 0-31
Thread(s) par cœur : 2
Cœur(s) par socket : 8
Socket(s) : 2
Nœud(s) NUMA : 4
Identifiant constructeur : AuthenticAMD
Famille de processeur : 21
Modèle : 1
Nom de modèle : AMD Opteron(TM) Processor 6272
Révision : 2
Vitesse du processeur en MHz : 2100.070
Vitesse maximale du processeur en MHz : 2100,0000
Vitesse minimale du processeur en MHz : 1400,0000
BogoMIPS : 4200.13
Virtualisation : AMD-V
Cache L1d : 16K
Cache L1i : 64K
Cache L2 : 2048K
Cache L3 : 6144K
Nœud NUMA 0 de processeur(s) : 0-7
Nœud NUMA 1 de processeur(s) : 8-15
Nœud NUMA 2 de processeur(s) : 16-23
Nœud NUMA 3 de processeur(s) : 24-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr topoext perfctr_core perfctr_nb cpb hw_pstate ssbd vmmcall arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
 

Attachments

  • syslog.txt
    244.5 KB · Views: 6
  • myproxmox11.png
    myproxmox11.png
    211.6 KB · Views: 10
It looks your Nodes have public IPs, is it possible to put then behind a firewall? Eventually you get a "magic packet" which forced your application / server to crash or do you get an DDoS attack. Is your password secure? Eventually your password is hacked and someone do some bad thinks on your nodes.

What about your network, do you have your own VLAN for external and internal Communications or do you share this VLAN with many other Dedicated Server Customers? Do you have shared NIC for different service types?

Could you please upload more log files? They are many other interesting log files which can help us to see what's going on.

If you not use the HA function, have you tried to turn watchdog off and see if the problems is persistent?

Is there anything running before the node fails, like backup, cronjob, unattended updates or something else?
 
It depends on the system you are using, you should take a look at your server / Mainboard manual. Normally it is an BIOS option.
 
I have turn watch in the bios but after 4 day one of my node decide to reboot still nothing in the log
It seem to me other people have kind of erratic problem in the forum but noone seem to have the clue

I do not really believe in DDOS because I have nothing in my monitoring indicate that

So still on it
 
I do not really believe in DDOS because I have nothing in my monitoring indicate that
An DDoS Attack should not be have multiple GBit bandwidth, in depends on the system and configuration. So often 1Mbit can be enough, it depends on the attack itself. There are multiple ways to do that.

What about all my other question?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!