Nodes Reboot After Upgrade to 7.1

Jul 16, 2018
11
1
23
45
Hi,

We are operating a PVE cluster consisting of 7 nodes. Last week cluster was upgraded from 6.4 to 7.1. After the upgrade, we are experiencing almost random reboots on nodes. Could you please help to solve the problem?
 
  • Like
Reactions: devidhalis07
Please provide the syslog of such a reboot for one node.
If possible provide the logs from a different node as well, for the time frame the first node rebooted.
 
Hi,
Syslogs for rebooted node and the one another node (for the same time period) were attached as zip file. Time for the reboot is Thu Jan 27 08:38.
 

Attachments

  • logs.zip
    327.6 KB · Views: 5
Thank you for the syslogs.

Could you also provide the output of pveversion -v?
This looks like it uses C++ for whatever reason. Do you have a custom QEMU version installed?
Code:
Jan 27 03:17:31 cluster-s7 QEMU[1287121]: terminate called after throwing an instance of 'std::system_error'
Jan 27 03:17:31 cluster-s7 QEMU[1287121]:   what():  Resource deadlock avoided

How is the network load and latency? How is the CPU load and IO Wait?
It seems your network is kind of unstable and you're using HA.

Please provide your Corosync config (/etc/pve/corosync.conf) and your network config (/etc/network/interfaces).
A stable network with low latency is a requirement for HA [0].


[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_requirements_3
 
Dear Mira,

There are two log files in the zipped file. cluster-s2 is the one that is rebooted at around 08:38. Would it be possible to focus on that log files? For the requested info please see below for node2 which is rebooted:

pveversion:

Code:
cluster-s2:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.4: 6.4-12
pve-kernel-5.0: 6.0-11
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.162-1-pve: 5.4.162-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.4-1
proxmox-backup-file-restore: 2.1.4-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
pve-zsync: 2.2.1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1


interface:

Code:
auto lo
iface lo inet loopback

iface enp1s0f0 inet manual

iface enp1s0f1 inet manual

iface enp3s0f0u14u2 inet manual

iface enp3s0f0u14u2c2 inet manual

auto vmbr0
iface vmbr0 inet static
        address ------------/--
        gateway ---------------
        bridge-ports enp1s0f0
        bridge-stp off
        bridge-fd 0

auto vmbr1
iface vmbr1 inet static
        address -------/-
        bridge-ports enp1s0f1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094


Corosync conf:

Code:
cluster-s2:# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}
nodelist {
  node {
    name: NAME-s1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.11
    ring1_addr: GLOBAL_IP_NODE_1
    ring2_addr: GLOBAL_IP_NODE_1
  }
  node {
    name: NAME-s2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.12
    ring1_addr: GLOBAL_IP_NODE_2
    ring2_addr: GLOBAL_IP_NODE_2
  }
  node {
    name: NAME-s3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.13
    ring1_addr: GLOBAL_IP_NODE_3
    ring2_addr: GLOBAL_IP_NODE_3
  }
  node {
    name: NAME-s4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.0.0.14
    ring1_addr: GLOBAL_IP_NODE_4
    ring2_addr: GLOBAL_IP_NODE_4
  }
  node {
    name: NAME-s5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.0.0.15
    ring1_addr: GLOBAL_IP_NODE_5
    ring2_addr: GLOBAL_IP_NODE_5
  }
  node {
    name: NAME-s6
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.0.0.16
    ring1_addr: GLOBAL_IP_NODE_6
    ring2_addr: GLOBAL_IP_NODE_6
  }
  node {
    name: NAME-s7
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.0.0.17
    ring1_addr: GLOBAL_IP_NODE_7
    ring2_addr: GLOBAL_IP_NODE_7
  }
}
quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: NAME
  config_version: 13
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  interface {
    linknumber: 2
    knet_transport: sctp
  }
  ip_version: ipv4-6
  link_mode: active
  ttl: 100
  secauth: on
  version: 2
}

At that time of the reboot network load, load, and iowait it is all okay .

Please take into account that, we have another cluster in another DC (version is different) it has more timeouts but it does not restart at all.

Moreover it had only ring0_addr in the past after the restarts we have added ring1_addr and ring2_addr just in case it is about LAN
 
Last edited:
Hi ;
Is there a known bug with the latest QEMU/KVM ?
or may be it only happens when it is upgraded from
  • QEMU 5.2 -> QEMU 6.1

any ideas?
reboots are less today but QEMU crashes all day long.
 
Does QEMU also crash on node 2?
If not, please provide the pveversion -v output also for node 7.

The reboots are most likely a combination of unstable network and HA.
Try disabling HA by removing all HA resources (Datacenter -> HA) and then restarting pve-ha-lrm and pve-ha-crm (systemctl restart pve-ha-crm.service pve-ha-lrm.service) or rebooting all nodes.
 
Hi;

I dont think your theory is correct : if it is about unstable network , why it was working fine with proxmox 6.4 ?

I am also attaching QEMU logs of node1 and node7 (all nodes have similar logs) .

P.S: all pveversions should be identical in our cluster.


pveversion:
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.4: 6.4-12
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.162-1-pve: 5.4.162-2
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.4-1
proxmox-backup-file-restore: 2.1.4-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
pve-zsync: 2.2.1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1
 

Attachments

  • node1_qemu.txt
    32.8 KB · Views: 5
  • node7_qemu.txt
    72.1 KB · Views: 1
What's the output of kvm --version of those nodes?
 
they are all :

QEMU emulator version 6.1.0 (pve-qemu-kvm_6.1.0)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers


Moreover :
apt-cache policy pve-qemu-kvm
pve-qemu-kvm:
Installed: 6.1.0-3
Candidate: 6.1.0-3
 
Could you provide the VM config (qm config <VMID>) of one of the VMs that crash?

And also the storage config (/etc/pve/storage.cfg).
 
VM config :
agent: 1 balloon: 2048 boot: cdn bootdisk: scsi0 cores: 4 cpu: host memory: 6144 name: apache-pulsar-zk-s1-monitor net0: virtio=06:E5:33:B0:D0:68,bridge=vmbr1,firewall=1 numa: 0 onboot: 1 ostype: l26 scsi0: hdd_vm:vm-192-disk-0,discard=on,size=50G scsihw: virtio-scsi-pci smbios1: uuid=e8573514-0e81-48c4-9a02-f74d04a1e2e2 sockets: 1


/etc/pve/storage.cfg :
dir: local path /var/lib/vz content rootdir,images,vztmpl,iso,snippets prune-backups keep-all=1 shared 0 rbd: hdd_vm content images,rootdir krbd 0 pool hdd_vm rbd: ssd_vm content images,rootdir krbd 0 pool ssd_vm cephfs: cloudfr3 path /mnt/pve/cloudfr3 content vztmpl,iso,backup rbd: hdd_vm_erosure33 content rootdir,images krbd 0 pool hdd_vm_erosure33
 
A colleague hinted at an issue with librbd. Could you try `KRBD` to see if it improves the situation?
This will use the RBD implementation of the kernel rather than the librbd one. Sometimes this even improves performance, but comes at the cost of not supporting all the latest features.
 
Hi Mira;

Thanks for the info it improved significantly, but still there are some issues.
it was rebooting or VMs were stopping without reboot 5-10 times a day.
Now it reboots once a day or less.
 
Do you still have link flapping in your cluster?
You can check the journal/syslog for Corosync messages that contain link: X is down and link: X is up where X is the link number configured in the config.
 
Hi;

according to zabbix link and its traffic is never down but according to corosync it sometimes flaps.
but it has 3 links and it does not go under 2 active links, so it should not be a problem ?

[KNET ] link: host: 5 link: 0 is down
[KNET ] host: host: 5 has 2 active links
[KNET ] rx: host: 5 link: 0 is up
[KNET ] host: host: 5 has 3 active links
[KNET ] link: host: 6 link: 1 is down
[KNET ] host: host: 6 has 2 active links
[KNET ] rx: host: 6 link: 1 is up
[KNET ] host: host: 6 has 3 active links


for example today we started a stoped VM and the hypervisor rebooted immadietly, and we dont see any relevant logs at syslog :
Code:
Jan 31 12:24:19 CLUSTER-s6 pvestatd[2610]: status update time (7.095 seconds)
Jan 31 12:25:49 CLUSTER-s6 pvestatd[2610]: status update time (7.106 seconds)
Jan 31 12:26:29 CLUSTER-s6 pvestatd[2610]: status update time (7.079 seconds)
Jan 31 12:27:49 CLUSTER-s6 pvestatd[2610]: status update time (7.085 seconds)
Jan 31 12:29:19 CLUSTER-s6 pvestatd[2610]: status update time (6.999 seconds)
Jan 31 12:29:29 CLUSTER-s6 pvestatd[2610]: status update time (7.075 seconds)
Jan 31 12:29:39 CLUSTER-s6 pvestatd[2610]: status update time (6.953 seconds)
Jan 31 12:30:39 CLUSTER-s6 pvestatd[2610]: status update time (7.075 seconds)
Jan 31 12:32:09 CLUSTER-s6 pvestatd[2610]: status update time (7.050 seconds)
Jan 31 12:33:29 CLUSTER-s6 pvestatd[2610]: status update time (7.124 seconds)
Jan 31 12:36:19 CLUSTER-s6 pvestatd[2610]: status update time (7.015 seconds)
Jan 31 12:36:42 CLUSTER-s6 pvestatd[2610]: status update time (10.038 seconds)
Jan 31 12:37:00 CLUSTER-s6 pmxcfs[2079]: [status] notice: received log
Jan 31 12:37:18 CLUSTER-s6 corosync[2238]:   [KNET  ] link: host: 7 link: 0 is down
Jan 31 12:37:18 CLUSTER-s6 corosync[2238]:   [KNET  ] host: host: 7 has 2 active links
Jan 31 12:37:21 CLUSTER-s6 corosync[2238]:   [KNET  ] rx: host: 7 link: 0 is up
Jan 31 12:37:21 CLUSTER-s6 corosync[2238]:   [KNET  ] host: host: 7 has 3 active links
Jan 31 12:37:29 CLUSTER-s6 pvestatd[2610]: status update time (7.122 seconds)
Jan 31 12:37:42 CLUSTER-s6 pvestatd[2610]: status update time (10.093 seconds)
Jan 31 12:38:09 CLUSTER-s6 pvestatd[2610]: status update time (7.125 seconds)
Jan 31 12:38:32 CLUSTER-s6 pveproxy[4171942]: worker exit
Jan 31 12:38:32 CLUSTER-s6 pveproxy[2948]: worker 4171942 finished
Jan 31 12:38:32 CLUSTER-s6 pveproxy[2948]: starting 1 worker(s)
Jan 31 12:38:32 CLUSTER-s6 pveproxy[2948]: worker 1242409 started
Jan 31 12:38:42 CLUSTER-s6 pvestatd[2610]: status update time (9.999 seconds)
Jan 31 12:39:11 CLUSTER-s6 pmxcfs[2079]: [status] notice: received log
Jan 31 12:39:51 CLUSTER-s6 pmxcfs[2079]: [status] notice: received log
Jan 31 12:39:52 CLUSTER-s6 pvestatd[2610]: status update time (10.049 seconds)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan 31 12:43:52 CLUSTER-s6 sys>
Jan 31 12:43:52 CLUSTER-s6 lvm[733]:   1 logical volume(s) in volume group "ceph-fc60368b-c992-48fe-ab59-6e47c7765e1f" monitored
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] Linux version 5.13.19-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.13.19->
Jan 31 12:43:52 CLUSTER-s6 lvm[733]:   1 logical volume(s) in volume group "ceph-eab49c3a-ec47-4f9d-90e2-506b2d88d66b" monitored
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.13.19-3-pve root=UUID=ea2aa03d-c302-46fb-8a13-5900edabca30 ro vga=normal nomodeset modprobe.blacklist=b>
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] KERNEL supported cpus:
Jan 31 12:43:52 CLUSTER-s6 lvm[733]:   1 logical volume(s) in volume group "ceph-2d7c5f8c-b195-4824-9b31-9df170f6ab96" monitored
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000]   Intel GenuineIntel
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000]   AMD AuthenticAMD
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000]   Hygon HygonGenuine
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000]   Centaur CentaurHauls
Jan 31 12:43:52 CLUSTER-s6 lvm[733]:   1 logical volume(s) in volume group "ceph-a23f6000-61a3-450f-991d-4c2740b30e2c" monitored
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000]   zhaoxin   Shanghai
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-provided physical RAM map:
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000075daffff] usable
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-e820: [mem 0x0000000075db0000-0x0000000075ffffff] reserved
Jan 31 12:43:52 CLUSTER-s6 kernel: [    0.000000] BIOS-e820: [mem 0x0000000076000000-0x00000000d8b2bfff] usable
 
Hi @mira ;

Imagine the network is not stable, does proxmox reboots itself without writing any log ? or any indication?
I'm very interested on getting an answer for this question as well, today I noticed a entire cluster reboot when trying to fix a network issue on one of the nodes of my 6 nodes cluster, couldn't find usable info on logs that points to proxmox rebooting itself all nodes :(
 
Please provide the output of pveversion -v.
Do you have HA enabled?

We do write to the log before fencing, but it might not reach the disk anymore in time.
If you want as much information as possible, configure a remote syslog via UDP.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!