Daily node crash

aviel900

New Member
Jun 25, 2020
11
0
1
25
Hey,
I have a cluster with 3 nodes. and all nodes except the master node are crashing at least once per day. i cant find the error. before i added them to the cluster they worked without problems also with proxmox just standalone. in the syslog are no errors its just stops logging and after hardreset they start normal.
 
Hello,

Can you please provide us with the following:

1. the cluster network configuration # cat /etc/network/interfaces
2. the corosync config # cat /etc/pve/corosync.conf
3. output of # pveversion -v
4. the Syslog for the crashed node /var/log/syslog
 
Sure, the interface file look like follow:
Code:
source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback


auto vmbr0

iface vmbr0 inet static
        address 5.xxx.xxx.226
        netmask 255.255.255.224
        network 5.xxx.xxx.224
        broadcast 5.xxx.xxx.255
        gateway 5.xxx.xxx.225
        bridge_ports eth0
        bridge_stp off
        bridge_fd 0

iface vmbr0 inet6 static
                address xxxx:xxxx:0000:xxxx:0000:0000:0000:b000/64
                gateway xxxx:xxxx:0000:xxxx:0000:0000:0000:0001

auto vmbr0:1
iface vmbr0:1 inet static
        address 192.168.40.3
        netmask 255.255.255.0
        gateway 192.168.40.1

Corosync
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: hdd1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.40.3
    ring1_addr: 5.xxx.xxx.226
    ring2_addr: xxxx:xxxx:0000:xxxx:0000:0000:0000:b000
  }
  node {
    name: proxmox
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.40.4
    ring1_addr: 193.xxx.xxx.186
    ring2_addr: xxxx:xxxx:0000:xxxx:0000:0000:0000:0009
  }
  node {
    name: ssd1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.40.2
    ring1_addr: 5.xxx.xxx.240
    ring2_addr: xxxx:xxxx:0000:xxxx:0000:0000:0000:a000
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: xxxxcluster1
  config_version: 7
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  interface {
    linknumber: 2
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

pveversion
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-6-pve)
pve-manager: 7.1-12 (running version: 7.1-12/b3c09de3)
pve-kernel-helper: 7.1-14
pve-kernel-5.13: 7.1-9
pve-kernel-5.4: 6.3-7
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-7
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-5
libpve-guest-common-perl: 4.1-1
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-1
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-7
pve-cluster: 7.1-3
pve-container: 4.1-4
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-6
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

hdd1 ist the master node. i give you an pastebin link for the syslog of ssd1. last crash was yesterday (10th april) 11:43
https://pastebin.com/0gvySGkb
 
Last edited:
Thank you for the outputs!

May I ask you what doing the PHP scripts in your cluster cron job?
Code:
Apr 10 10:46:01 ssd1 CRON[238487]: (root) CMD (/usr/local/emps/bin/php /usr/local/virtualizor/scripts/powercron.php >> /var/virtualizor/log/powercron 2>&1)
Apr 10 10:47:01 ssd1 CRON[238685]: (root) CMD (/usr/local/emps/bin/php /usr/local/virtualizor/scripts/powercron.php >> /var/virtualizor/log/powercron 2>&1)

Because I see the node lose quorum after running the PHP Cron job, and regarding your Corosync config, we recommend a physical NIC for the Corosync traffic [0] in order to provide failover.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
 
Its not possible for me to get an second NIC. The PHP scripts are only for the GUI for the customers and synchronizes data with whmcs such as when the rented time is over and how much traffic is used.
 
and if it would be because of the php script. why should also crash the "proxmox" node? because the 3rd node called "proxmox" has the GUI not installed
 
and if it would be because of the php script. why should also crash the "proxmox" node? because the 3rd node called "proxmox" has the GUI not installed
To narrow down the issue see the monitor after the PHP scripts run if you have load/IO/netin/netout in order to know these scripts take high resources from the nodes.
 
i deinstalled now the GUI including the php cronjobs. but i dont think it will change anything. i send you the new syslog after the next crash
 
The master node (hdd1) is not crashing in the logs you only see that the other nodes went offline (btw. The additional GUI is also installed in the masternode) in the 3rd node (proxmox) just stops logging so also nothing which tells the problem