Daily node crash

aviel900 · Apr 10, 2022

Hey,
I have a cluster with 3 nodes. and all nodes except the master node are crashing at least once per day. i cant find the error. before i added them to the cluster they worked without problems also with proxmox just standalone. in the syslog are no errors its just stops logging and after hardreset they start normal.

Moayad · Apr 11, 2022

Hello,

Can you please provide us with the following:

1. the cluster network configuration # cat /etc/network/interfaces
2. the corosync config # cat /etc/pve/corosync.conf
3. output of # pveversion -v
4. the Syslog for the crashed node /var/log/syslog

aviel900 · Apr 11, 2022

Sure, the interface file look like follow:

Code:

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback


auto vmbr0

iface vmbr0 inet static
        address 5.xxx.xxx.226
        netmask 255.255.255.224
        network 5.xxx.xxx.224
        broadcast 5.xxx.xxx.255
        gateway 5.xxx.xxx.225
        bridge_ports eth0
        bridge_stp off
        bridge_fd 0

iface vmbr0 inet6 static
                address xxxx:xxxx:0000:xxxx:0000:0000:0000:b000/64
                gateway xxxx:xxxx:0000:xxxx:0000:0000:0000:0001

auto vmbr0:1
iface vmbr0:1 inet static
        address 192.168.40.3
        netmask 255.255.255.0
        gateway 192.168.40.1

Corosync

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: hdd1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.40.3
    ring1_addr: 5.xxx.xxx.226
    ring2_addr: xxxx:xxxx:0000:xxxx:0000:0000:0000:b000
  }
  node {
    name: proxmox
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.40.4
    ring1_addr: 193.xxx.xxx.186
    ring2_addr: xxxx:xxxx:0000:xxxx:0000:0000:0000:0009
  }
  node {
    name: ssd1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.40.2
    ring1_addr: 5.xxx.xxx.240
    ring2_addr: xxxx:xxxx:0000:xxxx:0000:0000:0000:a000
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: xxxxcluster1
  config_version: 7
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  interface {
    linknumber: 2
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

pveversion

Code:

proxmox-ve: 7.1-1 (running kernel: 5.13.19-6-pve)
pve-manager: 7.1-12 (running version: 7.1-12/b3c09de3)
pve-kernel-helper: 7.1-14
pve-kernel-5.13: 7.1-9
pve-kernel-5.4: 6.3-7
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-7
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-5
libpve-guest-common-perl: 4.1-1
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-1
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-7
pve-cluster: 7.1-3
pve-container: 4.1-4
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-6
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

hdd1 ist the master node. i give you an pastebin link for the syslog of ssd1. last crash was yesterday (10th april) 11:43
https://pastebin.com/0gvySGkb

Moayad · Apr 11, 2022

Thank you for the outputs!

May I ask you what doing the PHP scripts in your cluster cron job?

Code:

Apr 10 10:46:01 ssd1 CRON[238487]: (root) CMD (/usr/local/emps/bin/php /usr/local/virtualizor/scripts/powercron.php >> /var/virtualizor/log/powercron 2>&1)
Apr 10 10:47:01 ssd1 CRON[238685]: (root) CMD (/usr/local/emps/bin/php /usr/local/virtualizor/scripts/powercron.php >> /var/virtualizor/log/powercron 2>&1)

Because I see the node lose quorum after running the PHP Cron job, and regarding your Corosync config, we recommend a physical NIC for the Corosync traffic [0] in order to provide failover.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

aviel900 · Apr 11, 2022

Its not possible for me to get an second NIC. The PHP scripts are only for the GUI for the customers and synchronizes data with whmcs such as when the rented time is over and how much traffic is used.

aviel900 · Apr 11, 2022

and if it would be because of the php script. why should also crash the "proxmox" node? because the 3rd node called "proxmox" has the GUI not installed

Moayad · Apr 12, 2022

aviel900 said:
and if it would be because of the php script. why should also crash the "proxmox" node? because the 3rd node called "proxmox" has the GUI not installed

To narrow down the issue see the monitor after the PHP scripts run if you have load/IO/netin/netout in order to know these scripts take high resources from the nodes.

aviel900 · Apr 12, 2022

i deinstalled now the GUI including the php cronjobs. but i dont think it will change anything. i send you the new syslog after the next crash

aviel900 · Apr 12, 2022

there we are:
https://pastebin.com/Kbzx5hQA

crash on 14:01

Moayad · Apr 12, 2022

The syslog from the other nodes at the crash node ssd1 time does not give you any hint?

aviel900 · Apr 12, 2022

The master node (hdd1) is not crashing in the logs you only see that the other nodes went offline (btw. The additional GUI is also installed in the masternode) in the 3rd node (proxmox) just stops logging so also nothing which tells the problem

Search

Search

Daily node crash

aviel900

New Member

Moayad

Proxmox Staff Member

aviel900

New Member

Moayad

Proxmox Staff Member

aviel900

New Member

aviel900

New Member

Moayad

Proxmox Staff Member

aviel900

New Member

aviel900

New Member

Moayad

Proxmox Staff Member

aviel900

New Member

We value your privacy