pvesr Call Trace: (servers going offline)

efinley

Active Member
Jul 16, 2018
27
1
43
54
I have two clusters running the following version:
root@vsys07:/var/log# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-10-pve: 4.15.18-32
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

both clusters were updated from 5.X

randomly a node will become inaccessible. It goes grey in the GUI, all VMs get a question mark next to them and I'm also unable to SSH into the host. The SSH session does NOT time out or get rejected, it just hangs forever. The VMs on the inaccessible host continue to run. I can log into the console, but almost any pve or ssh related command will hang and will not exit with Ctrl-C.

A 'reboot' always hangs on 'A stop job for PVE guest VMs' (<-- wording from memory)
Cold booting the server is the only thing that brings it back.

in /var/log/messages at the time the server becomes inaccessible, I see pvesr Call Trace messages (see attached file).

Any help or insight into this problem would appreciated, I'm getting tired of driving into the office in the middle of the night to reboot servers.
 

Attachments

What do you have in /var/log/kern* and /var/log/syslog* in the affected time frame before the server becomes unresponsive?
 
What do you have in /var/log/kern* and /var/log/syslog* in the affected time frame before the server becomes unresponsive?

Another data point:

I was able to log into the console and 'systemctl restart corosync'. and corosync seemed to come back up and regain quorum - but I still can't SSH into the box and that node in the GUI then changes from a grey question mark to a red X, just like as if it was powered down.
 
What do you have in /var/log/kern* and /var/log/syslog* in the affected time frame before the server becomes unresponsive?

Another data point:

if I 'systemctl restart sshd' it restarts and the old hung sessions are still hung and I still can't ssh into the server.

If there are any other commands you want me to run on the console while it's in this hung state, let me know.
 
What do you have in /var/log/kern* and /var/log/syslog* in the affected time frame before the server becomes unresponsive?

another data point:

If i'm logged into the console I can NOT ssh out of the server, but all the VMs and containers continue to run without issue - reading/writing to disk, reading/writing to the network.
 
Do you have HA enabled?
Can you post the content of /etc/corosync/corosync.conf
 
Do you have HA enabled?
Can you post the content of /etc/corosync/corosync.conf

Code:
root@vsys07:/etc/pve# cat corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: vsys06
    nodeid: 1
    quorum_votes: 1
    ring0_addr: X.Y.241.2
  }
  node {
    name: vsys07
    nodeid: 2
    quorum_votes: 1
    ring0_addr: X.Y.241.3
  }
  node {
    name: vsys09
    nodeid: 3
    quorum_votes: 1
    ring0_addr: X.Y.241.4
  }
  node {
    name: vsys11
    nodeid: 4
    quorum_votes: 1
    ring0_addr: X.Y.241.5
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: LindonHQ
  config_version: 4
  interface {
    bindnetaddr: X.Y.241.2
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}
 
Do you have HA enabled?
Can you post the content of /etc/corosync/corosync.conf

HA is not enabled.

Another data point and this may be the most relevant one... I have 14 nodes running proxmox 6.0:

6 of them are standalone - all stable.

4 of them are in the first cluster - only 3 of them have gone offline

4 of them are in the second cluster - only 3 of them have gone offline

so in total, only 6 of them have gone offline - 3 in each cluster - these 6 systems that have gone offline are all 1.5TB RAM systems, all the others have less than 512GB RAM. So it may have something to do with the high RAM content... However, these 6 systems are also utilized the heaviest so it may also be the high utilization.
 
Do you use the X.Y.241 network for other things than corosync?
If not, is the corosync the only network on the physical port?

With version 3 of corosync in PVE6 the traffic changed from multicast to unicast. Which means that there will be a bit more corosync packages on the network.
The symptoms you see could very well be related to a clogged network.
 
Do you use the X.Y.241 network for other things than corosync?
If not, is the corosync the only network on the physical port?

With version 3 of corosync in PVE6 the traffic changed from multicast to unicast. Which means that there will be a bit more corosync packages on the network.
The symptoms you see could very well be related to a clogged network.

Yes, all traffic is on X.Y.241 (10GE, but only about 15% utilization max). I will try putting corosync on it's own network, which will hopefull fix it - but it seems like a pretty serious bug in corosync if missing a packet can make a server become unresponsive. It seems like missing a packet or two would make it fall out of quorum, but then recover when packets started flowing again.
 
I am using ZFS as my storage... However, the thread you referenced doesn't seem to be related. When my nodes go offline, the VMs and containers keep running just fine. My heaviest node has 83 VMs/containers that are very active on the filesystem and network. They all keep running... for days if I let them. Also, I don't ever see any kernel messages about zfs/zvol or other zfs processes hanging.
 
I'm reviving this thread. I've moved the clustering network to it's own physical NIC on each node going across an isolated switch. This is did not fix the problem. I currently have 1 node that is in the 'grey' state - VMs continuing to run and function normally, but I can't SSH into or out of it and it shows up with a grey '?' on the GUI.

I do have remote access to the console that I could share with someone if anyone wants to take a look.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!