[SOLVED] Quorum Lost 5 times since 7 Upgrade, 5 days ago

Mikepop · Jul 23, 2021

Hello, we updated only one of our clusters to the new version last weekend. Yesterday custer rebooted two times and 3 time today. Every time it's back one ore more OSDs are down, different each time, if it's destroyed and recreated again it seems there it's no problem with that OSD till now. This is what is logged every time cluster reboots:


Jul 22 18:24:58 int101 pmxcfs[3477]: [status] notice: received log
Jul 22 18:24:58 int101 pmxcfs[3477]: [status] notice: received log
Jul 22 18:24:58 int101 pmxcfs[3477]: [status] notice: received log
Jul 22 18:24:58 int101 pmxcfs[3477]: [status] notice: received log
Jul 22 18:25:00 int101 systemd[1]: Starting Proxmox VE replication runner...
Jul 22 18:25:01 int101 systemd[1]: pvesr.service: Succeeded.
Jul 22 18:25:01 int101 systemd[1]: Finished Proxmox VE replication runner.
Jul 22 18:25:33 int101 pveproxy[1632072]: worker exit
Jul 22 18:25:33 int101 pveproxy[4182]: worker 1632072 finished
Jul 22 18:25:33 int101 pveproxy[4182]: starting 1 worker(s)
Jul 22 18:25:33 int101 pveproxy[4182]: worker 1741226 started
Jul 22 18:26:00 int101 systemd[1]: Starting Proxmox VE replication runner...
Jul 22 18:26:01 int101 systemd[1]: pvesr.service: Succeeded.
Jul 22 18:26:01 int101 systemd[1]: Finished Proxmox VE replication runner.
Jul 22 18:26:51 int101 pmxcfs[3477]: [dcdb] notice: data verification successful
Jul 22 18:27:00 int101 systemd[1]: Starting Proxmox VE replication runner...
Jul 22 18:27:01 int101 systemd[1]: pvesr.service: Succeeded.
Jul 22 18:27:01 int101 systemd[1]: Finished Proxmox VE replication runner.
Jul 22 18:27:08 int101 smartd[2041]: Device: /dev/sdj [SAT], CHECK POWER STATUS spins up disk (0x80 -> 0xff)
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] link: host: 5 link: 0 is down
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] link: host: 4 link: 0 is down
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] host: host: 5 has no active links
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 22 18:27:18 int101 corosync[3865]: [KNET ] host: host: 4 has no active links
Jul 22 18:27:18 int101 kernel: ixgbe 0000:05:00.0 enp5s0f0: NIC Link is Down
Jul 22 18:27:18 int101 kernel: vmbr16: port 1(enp5s0f0.16) entered disabled state
Jul 22 18:27:18 int101 kernel: vmbr17: port 1(enp5s0f0.17) entered disabled state
Jul 22 18:27:18 int101 kernel: vmbr18: port 1(enp5s0f0.18) entered disabled state
Jul 22 18:27:18 int101 kernel: vmbr91: port 1(enp5s0f0.91) entered disabled state
Jul 22 18:27:18 int101 kernel: ixgbe 0000:05:00.1 enp5s0f1: NIC Link is Down
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] link: host: 3 link: 0 is down
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] link: host: 2 link: 0 is down
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] host: host: 3 has no active links
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 22 18:27:19 int101 corosync[3865]: [KNET ] host: host: 2 has no active links
Jul 22 18:27:20 int101 corosync[3865]: [TOTEM ] Token has not been received in 3712 ms
Jul 22 18:27:21 int101 corosync[3865]: [TOTEM ] A processor failed, forming new configuration: token timed out (4950ms), waiting 5940ms for consensus.
Jul 22 18:27:27 int101 corosync[3865]: [QUORUM] Sync members[1]: 1
Jul 22 18:27:27 int101 corosync[3865]: [QUORUM] Sync left[4]: 2 3 4 5
Jul 22 18:27:27 int101 corosync[3865]: [TOTEM ] A new membership (1.1f55) was formed. Members left: 2 3 4 5
Jul 22 18:27:27 int101 corosync[3865]: [TOTEM ] Failed to receive the leave message. failed: 2 3 4 5

This never happened before. Any idea to try to debug and fix it?
Regarding OSD I see these ones after the cluster restarted and till ceph health it's ok again:

Jul 22 18:31:31 int101 ceph-osd[3921]: 2021-07-22T18:31:31.035+0200 7f05c3c40700 -1 --2- 10.10.40.101:0/3921 >> [v2:10.10.40.105:6830/567869,v1:10.10.40.105:6831/567869] conn(0x55b38983b000 0x55b395c39400 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:10.10.40.105:6830/567869,v1:10.10.40.105:6831/567869] is using msgr V1 protocol

But:


root@int101:~# ceph mon dump
epoch 8
fsid b70b6772-1c34-407d-a701-462c14fde916
last_changed 2021-07-18T10:24:22.240410+0200
created 2018-03-01T10:14:32.869926+0100
min_mon_release 16 (pacific)
election_strategy: 1
0: [v2:10.10.40.101:3300/0,v1:10.10.40.101:6789/0] mon.int101
1: [v2:10.10.40.102:3300/0,v1:10.10.40.102:6789/0] mon.int102
2: [v2:10.10.40.103:3300/0,v1:10.10.40.103:6789/0] mon.int103
3: [v2:10.10.40.105:3300/0,v1:10.10.40.105:6789/0] mon.int105
dumped monmap epoch 8

Regards

Mikepop · Jul 23, 2021

Some cluster info:

Code:

Cluster info:

[CODE]# pveversion --verbose

proxmox-ve: 7.0-2 (running kernel: 5.11.22-2-pve)

pve-manager: 7.0-10 (running version: 7.0-10/d2f465d3)

pve-kernel-5.11: 7.0-5

pve-kernel-helper: 7.0-5

pve-kernel-5.4: 6.4-4

pve-kernel-5.11.22-2-pve: 5.11.22-4

pve-kernel-5.4.124-1-pve: 5.4.124-1

pve-kernel-4.13.13-2-pve: 4.13.13-33

ceph: 16.2.5-pve1

ceph-fuse: 16.2.5-pve1

corosync: 3.1.2-pve2

criu: 3.15-1+pve-1

glusterfs-client: 9.2-1

ifupdown: 0.8.36

ksm-control-daemon: 1.4-1

libjs-extjs: 7.0.0-1

libknet1: 1.21-pve1

libproxmox-acme-perl: 1.2.0

libproxmox-backup-qemu0: 1.2.0-1

libpve-access-control: 7.0-4

libpve-apiclient-perl: 3.2-1

libpve-common-perl: 7.0-5

libpve-guest-common-perl: 4.0-2

libpve-http-server-perl: 4.0-2

libpve-storage-perl: 7.0-9

libqb0: 1.0.5-1

libspice-server1: 0.14.3-2.1

lvm2: 2.03.11-2.1

lxc-pve: 4.0.9-4

lxcfs: 4.0.8-pve2

novnc-pve: 1.2.0-3

proxmox-backup-client: 2.0.5-2

proxmox-backup-file-restore: 2.0.5-2

proxmox-mini-journalreader: 1.2-1

proxmox-widget-toolkit: 3.3-5

pve-cluster: 7.0-3

pve-container: 4.0-8

pve-docs: 7.0-5

pve-edk2-firmware: 3.20200531-1

pve-firewall: 4.2-2

pve-firmware: 3.2-4

pve-ha-manager: 3.3-1

pve-i18n: 2.4-1

pve-qemu-kvm: 6.0.0-2

pve-xtermjs: 4.12.0-1

qemu-server: 7.0-10

smartmontools: 7.2-pve2

spiceterm: 3.2-2

vncterm: 1.7-1

zfsutils-linux: 2.0.5-pve1


# cat /etc/hosts

127.0.0.1 localhost.localdomain localhost

# int101 int101 pvelocalhost

#config de corosync

10.10.10.101 int101 int101

10.10.10.102 int102 int102

10.10.10.103 int103 int103

10.10.10.104 int104 int104

10.10.10.105 int105 int105






# cat /etc/apt/sources.list


deb http://ftp.es.debian.org/debian bullseye main contrib

deb http://download.proxmox.com/debian/pve bullseye pve-no-subscription

deb http://deb.debian.org/debian bullseye main contrib non-free


# security updates

deb http://security.debian.org bullseye-security main contrib

deb http://security.debian.org/debian-security/ bullseye-security main contrib non-free



# cat /etc/apt/sources.list.d/pve-enterprise.list

#deb https://enterprise.proxmox.com/debian/pve bullseye pve-enterprise



# cat /etc/apt/sources.list.d/ceph.list

deb http://download.proxmox.com/debian/ceph-pacific bullseye main



# lscpu

Architecture:                    x86_64

CPU op-mode(s):                  32-bit, 64-bit

Byte Order:                      Little Endian

Address sizes:                   46 bits physical, 48 bits virtual

CPU(s):                          40

On-line CPU(s) list:             0-39

Thread(s) per core:              2

Core(s) per socket:              10

Socket(s):                       2

NUMA node(s):                    2

Vendor ID:                       GenuineIntel

CPU family:                      6

Model:                           62

Model name:                      Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

Stepping:                        4

CPU MHz:                         3600.000

CPU max MHz:                     3600.0000

CPU min MHz:                     1200.0000

BogoMIPS:                        5600.17

Virtualization:                  VT-x

L1d cache:                       640 KiB

L1i cache:                       640 KiB

L2 cache:                        5 MiB

L3 cache:                        50 MiB

NUMA node0 CPU(s):               0-9,20-29

NUMA node1 CPU(s):               10-19,30-39

Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages

Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable

Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable

Vulnerability Meltdown:          Mitigation; PTI

Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp

Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization

Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling

Vulnerability Srbds:             Not affected

Vulnerability Tsx async abort:   Not affected

Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d


# pvesh get /cluster/resources --type node --output-format=yaml

---

- cpu: 0.0915255525470517

  disk: 7099645952

  id: node/int104

  level: ''

  maxcpu: 32

  maxdisk: 105897000960

  maxmem: 135144251392

  mem: 50857566208

  node: int104

  status: online

  type: node

  uptime: 39038

- cpu: 0.123915434285345

  disk: 16701620224

  id: node/int103

  level: ''

  maxcpu: 40

  maxdisk: 29194506240

  maxmem: 270432395264

  mem: 137938812928

  node: int103

  status: online

  type: node

  uptime: 39001

- cpu: 0.06901175992771

  disk: 11609001984

  id: node/int102

  level: ''

  maxcpu: 40

  maxdisk: 29194506240

  maxmem: 270432399360

  mem: 131395768320

  node: int102

  status: online

  type: node

  uptime: 39027

- cpu: 0.236421644197562

  disk: 9975189504

  id: node/int105

  level: ''

  maxcpu: 48

  maxdisk: 24174280704

  maxmem: 270435250176

  mem: 125109018624

  node: int105

  status: online

  type: node

  uptime: 39035

- cpu: 0.05319323538242

  disk: 16066510848

  id: node/int101

  level: ''

  maxcpu: 40

  maxdisk: 29194506240

  maxmem: 270432395264

  mem: 141747564544

  node: int101

  status: online

  type: node

  uptime: 39025





# pvecm nodes


Membership information

----------------------

    Nodeid      Votes Name

         1          1 int101 (local)

         2          1 int102

         3          1 int103

         4          1 int104

         5          1 int105


# pvecm status

Cluster information

-------------------

Name:             cluster

Config Version:   5

Transport:        knet

Secure auth:      on


Quorum information

------------------

Date:             Fri Jul 23 18:13:11 2021

Quorum provider:  corosync_votequorum

Nodes:            5

Node ID:          0x00000001

Ring ID:          1.1ff3

Quorate:          Yes


Votequorum information

----------------------

Expected votes:   5

Highest expected: 5

Total votes:      5

Quorum:           3

Flags:            Quorate


Membership information

----------------------

    Nodeid      Votes Name

0x00000001          1 10.10.10.101 (local)

0x00000002          1 10.10.10.102

0x00000003          1 10.10.10.103

0x00000004          1 10.10.10.104

0x00000005          1 10.10.10.105


# cat /etc/pve/corosync.conf 2>/dev/null

logging {

  debug: off

  to_syslog: yes

}


nodelist {

  node {

    name: int101

    nodeid: 1

    quorum_votes: 1

    ring0_addr: 10.10.10.101

  }

  node {

    name: int102

    nodeid: 2

    quorum_votes: 1

    ring0_addr: 10.10.10.102

  }

  node {

    name: int103

    nodeid: 3

    quorum_votes: 1

    ring0_addr: 10.10.10.103

  }

  node {

    name: int104

    nodeid: 4

    quorum_votes: 1

    ring0_addr: 10.10.10.104

  }

  node {

    name: int105

    nodeid: 5

    quorum_votes: 1

    ring0_addr: 10.10.10.105

  }

}


quorum {

  provider: corosync_votequorum

}


totem {

  cluster_name: cluster

  config_version: 5

  interface {

    bindnetaddr: 10.10.10.0

    ringnumber: 0

  }

  ip_version: ipv4

  secauth: on

  version: 2

}



# ha-manager status

quorum OK

master int105 (active, Fri Jul 23 18:13:06 2021)

lrm int101 (active, Fri Jul 23 18:13:06 2021)

lrm int102 (active, Fri Jul 23 18:13:01 2021)

lrm int103 (active, Fri Jul 23 18:13:03 2021)

lrm int104 (active, Fri Jul 23 18:13:12 2021)

lrm int105 (active, Fri Jul 23 18:13:04 2021)

service vm:103 (int103, started)

service vm:105 (int103, started)

service vm:108 (int103, started)

service vm:109 (int102, started)

service vm:110 (int103, started)

service vm:1113 (int103, started)



}

[/CODE]

Spyder13337 · Jul 23, 2021

seem to be some server issue or NIC issue perhaps drivers of some sort

Mikepop · Jul 26, 2021

Finally, this was a FEX rebooting several times, nothing Proxmox's upgrade related.

Regards

Search

Search

[SOLVED] Quorum Lost 5 times since 7 Upgrade, 5 days ago

Mikepop

Well-Known Member

Mikepop

Well-Known Member

Spyder13337

Active Member

Mikepop

Well-Known Member

We value your privacy