pvesr Call Trace: (servers going offline)

efinley · Sep 23, 2019

I have two clusters running the following version:
root@vsys07:/var/log# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-10-pve: 4.15.18-32
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

both clusters were updated from 5.X

randomly a node will become inaccessible. It goes grey in the GUI, all VMs get a question mark next to them and I'm also unable to SSH into the host. The SSH session does NOT time out or get rejected, it just hangs forever. The VMs on the inaccessible host continue to run. I can log into the console, but almost any pve or ssh related command will hang and will not exit with Ctrl-C.

A 'reboot' always hangs on 'A stop job for PVE guest VMs' (<-- wording from memory)
Cold booting the server is the only thing that brings it back.

in /var/log/messages at the time the server becomes inaccessible, I see pvesr Call Trace messages (see attached file).

Any help or insight into this problem would appreciated, I'm getting tired of driving into the office in the middle of the night to reboot servers.

aaron · Sep 27, 2019

What do you have in /var/log/kern* and /var/log/syslog* in the affected time frame before the server becomes unresponsive?

efinley · Sep 27, 2019

aaron said:
What do you have in /var/log/kern* and /var/log/syslog* in the affected time frame before the server becomes unresponsive?

I've attached those files. It appears that corosync and/or pvesr are having issues.

Thanks for your attention to this.

efinley · Sep 28, 2019

aaron said:
What do you have in /var/log/kern* and /var/log/syslog* in the affected time frame before the server becomes unresponsive?

Another data point:

I was able to log into the console and 'systemctl restart corosync'. and corosync seemed to come back up and regain quorum - but I still can't SSH into the box and that node in the GUI then changes from a grey question mark to a red X, just like as if it was powered down.

efinley · Sep 28, 2019

aaron said:
What do you have in /var/log/kern* and /var/log/syslog* in the affected time frame before the server becomes unresponsive?

Another data point:

if I 'systemctl restart sshd' it restarts and the old hung sessions are still hung and I still can't ssh into the server.

If there are any other commands you want me to run on the console while it's in this hung state, let me know.

efinley · Sep 28, 2019

aaron said:
What do you have in /var/log/kern* and /var/log/syslog* in the affected time frame before the server becomes unresponsive?

another data point:

If i'm logged into the console I can NOT ssh out of the server, but all the VMs and containers continue to run without issue - reading/writing to disk, reading/writing to the network.

aaron · Sep 30, 2019

Do you have HA enabled?
Can you post the content of /etc/corosync/corosync.conf

efinley · Sep 30, 2019

aaron said:
Do you have HA enabled?
Can you post the content of /etc/corosync/corosync.conf

Code:

root@vsys07:/etc/pve# cat corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: vsys06
    nodeid: 1
    quorum_votes: 1
    ring0_addr: X.Y.241.2
  }
  node {
    name: vsys07
    nodeid: 2
    quorum_votes: 1
    ring0_addr: X.Y.241.3
  }
  node {
    name: vsys09
    nodeid: 3
    quorum_votes: 1
    ring0_addr: X.Y.241.4
  }
  node {
    name: vsys11
    nodeid: 4
    quorum_votes: 1
    ring0_addr: X.Y.241.5
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: LindonHQ
  config_version: 4
  interface {
    bindnetaddr: X.Y.241.2
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

efinley · Sep 30, 2019

aaron said:
Do you have HA enabled?
Can you post the content of /etc/corosync/corosync.conf

HA is not enabled.

Another data point and this may be the most relevant one... I have 14 nodes running proxmox 6.0:

6 of them are standalone - all stable.

4 of them are in the first cluster - only 3 of them have gone offline

4 of them are in the second cluster - only 3 of them have gone offline

so in total, only 6 of them have gone offline - 3 in each cluster - these 6 systems that have gone offline are all 1.5TB RAM systems, all the others have less than 512GB RAM. So it may have something to do with the high RAM content... However, these 6 systems are also utilized the heaviest so it may also be the high utilization.

aaron · Oct 1, 2019

Do you use the X.Y.241 network for other things than corosync?
If not, is the corosync the only network on the physical port?

With version 3 of corosync in PVE6 the traffic changed from multicast to unicast. Which means that there will be a bit more corosync packages on the network.
The symptoms you see could very well be related to a clogged network.

efinley · Oct 2, 2019

aaron said:
Do you use the X.Y.241 network for other things than corosync?
If not, is the corosync the only network on the physical port?

With version 3 of corosync in PVE6 the traffic changed from multicast to unicast. Which means that there will be a bit more corosync packages on the network.
The symptoms you see could very well be related to a clogged network.

Yes, all traffic is on X.Y.241 (10GE, but only about 15% utilization max). I will try putting corosync on it's own network, which will hopefull fix it - but it seems like a pretty serious bug in corosync if missing a packet can make a server become unresponsive. It seems like missing a packet or two would make it fall out of quorum, but then recover when packets started flowing again.

j2mc · Oct 2, 2019

Are you using ZFS as your root storage? While your logs are different, your symptoms are similar to what I was experiencing, also it seems your logs just stop writing when it becomes mostly unresponsive. If so check out this thread:
https://forum.proxmox.com/threads/proxmox-v6-servers-freeze-zvol-blocked-for-more-than-120s.57765/ (There is a workaround on page 2)

efinley · Oct 4, 2019

I am using ZFS as my storage... However, the thread you referenced doesn't seem to be related. When my nodes go offline, the VMs and containers keep running just fine. My heaviest node has 83 VMs/containers that are very active on the filesystem and network. They all keep running... for days if I let them. Also, I don't ever see any kernel messages about zfs/zvol or other zfs processes hanging.

efinley · Nov 1, 2019

I'm reviving this thread. I've moved the clustering network to it's own physical NIC on each node going across an isolated switch. This is did not fix the problem. I currently have 1 node that is in the 'grey' state - VMs continuing to run and function normally, but I can't SSH into or out of it and it shows up with a grey '?' on the GUI.

I do have remote access to the console that I could share with someone if anyone wants to take a look.

efinley · Nov 1, 2019

I just figured out from this post https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/post-262788 that I can bring my 'grey' node back online by restarting corosync and pve-cluster on all nodes in the cluster. After doing this, I can now SSH into and out of the once 'grey' node.

This is the biggest clue yet.

pvesr Call Trace: (servers going offline)

efinley

Active Member

Attachments

aaron

Proxmox Staff Member

efinley

Active Member

Attachments

efinley

Active Member

efinley

Active Member

efinley

Active Member

aaron

Proxmox Staff Member

efinley

Active Member

efinley

Active Member

aaron

Proxmox Staff Member

efinley

Active Member

j2mc

Member

efinley

Active Member

efinley

Active Member

efinley

Active Member

We value your privacy