Nodes Reboot After Upgrade to 7.1

pelenque · Jan 26, 2022

Hi,

We are operating a PVE cluster consisting of 7 nodes. Last week cluster was upgraded from 6.4 to 7.1. After the upgrade, we are experiencing almost random reboots on nodes. Could you please help to solve the problem?

mira · Jan 26, 2022

Please provide the syslog of such a reboot for one node.
If possible provide the logs from a different node as well, for the time frame the first node rebooted.

pelenque · Jan 27, 2022

Hi,
Syslogs for rebooted node and the one another node (for the same time period) were attached as zip file. Time for the reboot is Thu Jan 27 08:38.

mira · Jan 27, 2022

Thank you for the syslogs.

Could you also provide the output of pveversion -v?
This looks like it uses C++ for whatever reason. Do you have a custom QEMU version installed?

Code:

Jan 27 03:17:31 cluster-s7 QEMU[1287121]: terminate called after throwing an instance of 'std::system_error'
Jan 27 03:17:31 cluster-s7 QEMU[1287121]:   what():  Resource deadlock avoided

How is the network load and latency? How is the CPU load and IO Wait?
It seems your network is kind of unstable and you're using HA.

Please provide your Corosync config (/etc/pve/corosync.conf) and your network config (/etc/network/interfaces).
A stable network with low latency is a requirement for HA [0].

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_requirements_3

pelenque · Jan 27, 2022

Dear Mira,

There are two log files in the zipped file. cluster-s2 is the one that is rebooted at around 08:38. Would it be possible to focus on that log files? For the requested info please see below for node2 which is rebooted:

pveversion:

Code:

cluster-s2:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.4: 6.4-12
pve-kernel-5.0: 6.0-11
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.162-1-pve: 5.4.162-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.4-1
proxmox-backup-file-restore: 2.1.4-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
pve-zsync: 2.2.1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

interface:

Code:

auto lo
iface lo inet loopback

iface enp1s0f0 inet manual

iface enp1s0f1 inet manual

iface enp3s0f0u14u2 inet manual

iface enp3s0f0u14u2c2 inet manual

auto vmbr0
iface vmbr0 inet static
        address ------------/--
        gateway ---------------
        bridge-ports enp1s0f0
        bridge-stp off
        bridge-fd 0

auto vmbr1
iface vmbr1 inet static
        address -------/-
        bridge-ports enp1s0f1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

Corosync conf:

Code:

cluster-s2:# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}
nodelist {
  node {
    name: NAME-s1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.11
    ring1_addr: GLOBAL_IP_NODE_1
    ring2_addr: GLOBAL_IP_NODE_1
  }
  node {
    name: NAME-s2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.12
    ring1_addr: GLOBAL_IP_NODE_2
    ring2_addr: GLOBAL_IP_NODE_2
  }
  node {
    name: NAME-s3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.13
    ring1_addr: GLOBAL_IP_NODE_3
    ring2_addr: GLOBAL_IP_NODE_3
  }
  node {
    name: NAME-s4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.0.0.14
    ring1_addr: GLOBAL_IP_NODE_4
    ring2_addr: GLOBAL_IP_NODE_4
  }
  node {
    name: NAME-s5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.0.0.15
    ring1_addr: GLOBAL_IP_NODE_5
    ring2_addr: GLOBAL_IP_NODE_5
  }
  node {
    name: NAME-s6
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.0.0.16
    ring1_addr: GLOBAL_IP_NODE_6
    ring2_addr: GLOBAL_IP_NODE_6
  }
  node {
    name: NAME-s7
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.0.0.17
    ring1_addr: GLOBAL_IP_NODE_7
    ring2_addr: GLOBAL_IP_NODE_7
  }
}
quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: NAME
  config_version: 13
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  interface {
    linknumber: 2
    knet_transport: sctp
  }
  ip_version: ipv4-6
  link_mode: active
  ttl: 100
  secauth: on
  version: 2
}

At that time of the reboot network load, load, and iowait it is all okay .

Please take into account that, we have another cluster in another DC (version is different) it has more timeouts but it does not restart at all.

Moreover it had only ring0_addr in the past after the restarts we have added ring1_addr and ring2_addr just in case it is about LAN

pelenque · Jan 28, 2022

Hi ;
Is there a known bug with the latest QEMU/KVM ?
or may be it only happens when it is upgraded from

QEMU 5.2 -> QEMU 6.1

any ideas?
reboots are less today but QEMU crashes all day long.

mira · Jan 28, 2022

Does QEMU also crash on node 2?
If not, please provide the pveversion -v output also for node 7.

The reboots are most likely a combination of unstable network and HA.
Try disabling HA by removing all HA resources (Datacenter -> HA) and then restarting pve-ha-lrm and pve-ha-crm (systemctl restart pve-ha-crm.service pve-ha-lrm.service) or rebooting all nodes.

pelenque · Jan 28, 2022

Hi;

I dont think your theory is correct : if it is about unstable network , why it was working fine with proxmox 6.4 ?

I am also attaching QEMU logs of node1 and node7 (all nodes have similar logs) .

P.S: all pveversions should be identical in our cluster.

pveversion:

Code:

proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.4: 6.4-12
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.162-1-pve: 5.4.162-2
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.4-1
proxmox-backup-file-restore: 2.1.4-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
pve-zsync: 2.2.1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

mira · Jan 28, 2022

What's the output of kvm --version of those nodes?

pelenque · Jan 28, 2022

they are all :

QEMU emulator version 6.1.0 (pve-qemu-kvm_6.1.0)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

Moreover :
apt-cache policy pve-qemu-kvm
pve-qemu-kvm:
Installed: 6.1.0-3
Candidate: 6.1.0-3

mira · Jan 28, 2022

Could you provide the VM config (qm config <VMID>) of one of the VMs that crash?

And also the storage config (/etc/pve/storage.cfg).

pelenque · Jan 28, 2022

VM config :


agent: 1
balloon: 2048
boot: cdn
bootdisk: scsi0
cores: 4
cpu: host
memory: 6144
name: apache-pulsar-zk-s1-monitor
net0: virtio=06:E5:33:B0:D0:68,bridge=vmbr1,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: hdd_vm:vm-192-disk-0,discard=on,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=e8573514-0e81-48c4-9a02-f74d04a1e2e2
sockets: 1

/etc/pve/storage.cfg :


dir: local
        path /var/lib/vz
        content rootdir,images,vztmpl,iso,snippets
        prune-backups keep-all=1
        shared 0

rbd: hdd_vm
        content images,rootdir
        krbd 0
        pool hdd_vm

rbd: ssd_vm
        content images,rootdir
        krbd 0
        pool ssd_vm

cephfs: cloudfr3
        path /mnt/pve/cloudfr3
        content vztmpl,iso,backup

rbd: hdd_vm_erosure33
        content rootdir,images
        krbd 0
        pool hdd_vm_erosure33

mira · Jan 28, 2022

A colleague hinted at an issue with librbd. Could you try `KRBD` to see if it improves the situation?
This will use the RBD implementation of the kernel rather than the librbd one. Sometimes this even improves performance, but comes at the cost of not supporting all the latest features.

pelenque · Jan 28, 2022

thanks Mira; we will try

pelenque · Jan 31, 2022

Hi Mira;

Thanks for the info it improved significantly, but still there are some issues.
it was rebooting or VMs were stopping without reboot 5-10 times a day.
Now it reboots once a day or less.

mira · Jan 31, 2022

Do you still have link flapping in your cluster?
You can check the journal/syslog for Corosync messages that contain link: X is down and link: X is up where X is the link number configured in the config.

pelenque · Jan 31, 2022

Hi;

according to zabbix link and its traffic is never down but according to corosync it sometimes flaps.
but it has 3 links and it does not go under 2 active links, so it should not be a problem ?

[KNET ] link: host: 5 link: 0 is down
[KNET ] host: host: 5 has 2 active links
[KNET ] rx: host: 5 link: 0 is up
[KNET ] host: host: 5 has 3 active links
[KNET ] link: host: 6 link: 1 is down
[KNET ] host: host: 6 has 2 active links
[KNET ] rx: host: 6 link: 1 is up
[KNET ] host: host: 6 has 3 active links

for example today we started a stoped VM and the hypervisor rebooted immadietly, and we dont see any relevant logs at syslog :

Code:

Jan 31 12:24:19 CLUSTER-s6 pvestatd[2610]: status update time (7.095 seconds)
Jan 31 12:25:49 CLUSTER-s6 pvestatd[2610]: status update time (7.106 seconds)
Jan 31 12:26:29 CLUSTER-s6 pvestatd[2610]: status update time (7.079 seconds)
Jan 31 12:27:49 CLUSTER-s6 pvestatd[2610]: status update time (7.085 seconds)
Jan 31 12:29:19 CLUSTER-s6 pvestatd[2610]: status update time (6.999 seconds)
Jan 31 12:29:29 CLUSTER-s6 pvestatd[2610]: status update time (7.075 seconds)
Jan 31 12:29:39 CLUSTER-s6 pvestatd[2610]: status update time (6.953 seconds)
Jan 31 12:30:39 CLUSTER-s6 pvestatd[2610]: status update time (7.075 seconds)
Jan 31 12:32:09 CLUSTER-s6 pvestatd[2610]: status update time (7.050 seconds)
Jan 31 12:33:29 CLUSTER-s6 pvestatd[2610]: status update time (7.124 seconds)
Jan 31 12:36:19 CLUSTER-s6 pvestatd[2610]: status update time (7.015 seconds)
Jan 31 12:36:42 CLUSTER-s6 pvestatd[2610]: status update time (10.038 seconds)
Jan 31 12:37:00 CLUSTER-s6 pmxcfs[2079]: [status] notice: received log
Jan 31 12:37:18 CLUSTER-s6 corosync[2238]:   [KNET  ] link: host: 7 link: 0 is down
Jan 31 12:37:18 CLUSTER-s6 corosync[2238]:   [KNET  ] host: host: 7 has 2 active links
Jan 31 12:37:21 CLUSTER-s6 corosync[2238]:   [KNET  ] rx: host: 7 link: 0 is up
Jan 31 12:37:21 CLUSTER-s6 corosync[2238]:   [KNET  ] host: host: 7 has 3 active links
Jan 31 12:37:29 CLUSTER-s6 pvestatd[2610]: status update time (7.122 seconds)
Jan 31 12:37:42 CLUSTER-s6 pvestatd[2610]: status update time (10.093 seconds)
Jan 31 12:38:09 CLUSTER-s6 pvestatd[2610]: status update time (7.125 seconds)
Jan 31 12:38:32 CLUSTER-s6 pveproxy[4171942]: worker exit
Jan 31 12:38:32 CLUSTER-s6 pveproxy[2948]: worker 4171942 finished
Jan 31 12:38:32 CLUSTER-s6 pveproxy[2948]: starting 1 worker(s)
Jan 31 12:38:32 CLUSTER-s6 pveproxy[2948]: worker 1242409 started
Jan 31 12:38:42 CLUSTER-s6 pvestatd[2610]: status update time (9.999 seconds)
Jan 31 12:39:11 CLUSTER-s6 pmxcfs[2079]: [status] notice: received log
Jan 31 12:39:51 CLUSTER-s6 pmxcfs[2079]: [status] notice: received log
Jan 31 12:39:52 CLUSTER-s6 pvestatd[2610]: status update time (10.049 seconds)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan 31 12:43:52 CLUSTER-s6 sys>
Jan 31 12:43:52 CLUSTER-s6 lvm[733]:   1 logical volume(s) in volume group "ceph-fc60368b-c992-48fe-ab59-6e47c7765e1f" monitored
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] Linux version 5.13.19-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.13.19->
Jan 31 12:43:52 CLUSTER-s6 lvm[733]:   1 logical volume(s) in volume group "ceph-eab49c3a-ec47-4f9d-90e2-506b2d88d66b" monitored
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.13.19-3-pve root=UUID=ea2aa03d-c302-46fb-8a13-5900edabca30 ro vga=normal nomodeset modprobe.blacklist=b>
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] KERNEL supported cpus:
Jan 31 12:43:52 CLUSTER-s6 lvm[733]:   1 logical volume(s) in volume group "ceph-2d7c5f8c-b195-4824-9b31-9df170f6ab96" monitored
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000]   Intel GenuineIntel
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000]   AMD AuthenticAMD
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000]   Hygon HygonGenuine
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000]   Centaur CentaurHauls
Jan 31 12:43:52 CLUSTER-s6 lvm[733]:   1 logical volume(s) in volume group "ceph-a23f6000-61a3-450f-991d-4c2740b30e2c" monitored
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000]   zhaoxin   Shanghai
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-provided physical RAM map:
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000075daffff] usable
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-e820: [mem 0x0000000075db0000-0x0000000075ffffff] reserved
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-e820: [mem 0x0000000076000000-0x00000000d8b2bfff] usable

pelenque · Feb 3, 2022

Hi @mira ;

Imagine the network is not stable, does proxmox reboots itself without writing any log ? or any indication?

JRG · May 10, 2022

pelenque said:
Hi @mira ;

Imagine the network is not stable, does proxmox reboots itself without writing any log ? or any indication?

I'm very interested on getting an answer for this question as well, today I noticed a entire cluster reboot when trying to fix a network issue on one of the nodes of my 6 nodes cluster, couldn't find usable info on logs that points to proxmox rebooting itself all nodes

mira · May 11, 2022

Please provide the output of pveversion -v.
Do you have HA enabled?

We do write to the log before fencing, but it might not reach the disk anymore in time.
If you want as much information as possible, configure a remote syslog via UDP.

Nodes Reboot After Upgrade to 7.1

Active Member

Proxmox Staff Member

Active Member

Attachments

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Attachments

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Member

Proxmox Staff Member

We value your privacy