[SOLVED] Proxmox 6.1 Cluster lost after reboot

Haider Jarral

Well-Known Member
Aug 18, 2018
121
5
58
38
I happen to reboot my 11 node cluster yesterday all together, big mistake ! It took almost 24 hours to come back up and get out of reboot loop. Now all nodes are up but no cluster or quorum. Nodes can ping each other fine but still no cluster. Logs/Debugs attached.


root@dellr730-1:~# pveversion
pve-manager/6.1-8/806edfe1 (running kernel: 5.3.18-3-pve)

Already rebooted nodes, rebooted switch, restarted services, nothing is working


no-cluster.jpg
no-cluster.jpg
 

Attachments

root@dellr730-1:~# pveversion --verbose


proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)


pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)


pve-kernel-helper: 6.1-8


pve-kernel-5.3: 6.1-6


pve-kernel-5.3.18-3-pve: 5.3.18-3


pve-kernel-5.3.18-2-pve: 5.3.18-2


pve-kernel-5.3.10-1-pve: 5.3.10-1


ceph: 14.2.8-pve1


ceph-fuse: 14.2.8-pve1


corosync: 3.0.3-pve1


criu: 3.11-3


glusterfs-client: 5.5-3


ifupdown: residual config


ifupdown2: 2.0.1-1+pve8


ksm-control-daemon: 1.3-1


libjs-extjs: 6.0.1-10


libknet1: 1.15-pve1


libpve-access-control: 6.0-6


libpve-apiclient-perl: 3.0-3


libpve-common-perl: 6.0-17


libpve-guest-common-perl: 3.0-5


libpve-http-server-perl: 3.0-5


libpve-storage-perl: 6.1-5


libqb0: 1.0.5-1


libspice-server1: 0.14.2-4~pve6+1


lvm2: 2.03.02-pve4


lxc-pve: 3.2.1-1


lxcfs: 4.0.1-pve1


novnc-pve: 1.1.0-1


proxmox-mini-journalreader: 1.1-1


proxmox-widget-toolkit: 2.1-3


pve-cluster: 6.1-4


pve-container: 3.0-23


pve-docs: 6.1-6


pve-edk2-firmware: 2.20200229-1


pve-firewall: 4.0-10


pve-firmware: 3.0-7


pve-ha-manager: 3.0-9


pve-i18n: 2.0-4


pve-qemu-kvm: 4.1.1-4


pve-xtermjs: 4.3.0-1


qemu-server: 6.1-7


smartmontools: 7.1-pve2


spiceterm: 3.1-1


vncterm: 1.6-1


zfsutils-linux: 0.8.3-pve1


root@dellr730-1:~#
 
can you try the following on all nodes
Code:
# to stop fencing in case you have HA configured
systemctl stop pve-ha-lrm
systemctl stop pve-ha-crm
# stop clustering
systemctl stop pve-cluster corosync

then, on all nodes do the following and keep it open for monitoring:
Code:
journalctl -u pve-cluster -u corosync

then, node by node do the following, look at the logs of all nodes, wait for the situation to stabilize before proceeding with the next node
Code:
systemctl start pve-cluster

Code:
pvecm status
corosync-cfgtool -sb

provide additional input to assess the current situation

once more than 50% of the nodes are back in the quorum, /etc/pve on those nodes should be writable again.
one all nodes are connected to the cluster, you should start HA again:
Code:
systemctl start pve-ha-crm
systemctl start pve-ha-lrm

and possible restart pveproxy, pvedaemon and pve-firewall depending on their status.
 
Thanks @fabian , The commands you mentioned helped recover the cluster, I can see nodes green now

Code:
# pvecm status
Cluster information
-------------------
Name:             proxmox-cluster
Config Version:   22
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Apr 20 01:48:05 2020
Quorum provider:  corosync_votequorum
Nodes:            10
Node ID:          0x00000001
Ring ID:          1.ba2c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   11
Highest expected: 11
Total votes:      10
Quorum:           6 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.11 (local)
0x00000002          1 192.168.1.14
0x00000003          1 192.168.1.10
0x00000004          1 192.168.1.16
0x00000005          1 192.168.1.15
0x00000007          1 192.168.1.18
0x00000008          1 192.168.1.19
0x00000009          1 192.168.1.20
0x0000000a          1 192.168.1.21
0x0000000b          1 192.168.1.13
root@dellr730-1:~#
root@dellr730-1:~#
root@dellr730-1:~# corosync-cfgtool -sb
Printing link status.
Local node ID 1
LINK ID 0
    addr    = 192.168.1.11
    status    = 33333133333
root@dellr730-1:~#


But as soon as I enabled lrm and crm, same reboot loop continued, what should I capture or look at narrow down the problem.
 
sorry, that was my mistake. you need to start the LRM first, then the CRM.
 
That did it, Thank you @fabian . From where can I begin to troubleshoot it, any pointers on what should I be looking at to fix in order for such issue to not happen again like changing value of wait seconds before rebooting in case nodes don't all come up at the same time.fixed.jpg
 
That all went fine though but....none of my VMs are booting up now, they are stuck just starting up for 15 minutes now, Is there any chance my all VMs are gone with data during these reboot cycles ? If so any way to recover it.


Code:
~# ceph status
  cluster:
    id:     01bcf2d5-6e96-4d50-81ec-cd6bb55c500e
    health: HEALTH_WARN
            2 osds down
            Long heartbeat ping times on back interface seen, longest is 221731.966 msec
            Long heartbeat ping times on front interface seen, longest is 226239.164 msec
            Reduced data availability: 309 pgs inactive, 144 pgs peering
            Degraded data redundancy: 766/2048625 objects degraded (0.037%), 5 pgs degraded, 5 pgs undersized
            166 slow ops, oldest one blocked for 7638 sec, daemons [osd.103,osd.15,osd.30,osd.64,osd.73,osd.76,osd.81,osd.9,mon.dellr730-1,mon.dellr730-2] have slow ops.
 
  services:
    mon: 3 daemons, quorum dellr730-1,dellr730-2,dellr730-3 (age 2h)
    mgr: hp1(active, since 6h), standbys: dellr730-1
    osd: 105 osds: 100 up (since 109s), 102 in (since 3m); 76 remapped pgs
 
  data:
    pools:   3 pools, 1792 pgs
    objects: 682.88k objects, 2.6 TiB
    usage:   8.2 TiB used, 47 TiB / 55 TiB avail
    pgs:     17.299% pgs not active
             766/2048625 objects degraded (0.037%)
             12113/2048625 objects misplaced (0.591%)
             1408 active+clean
             159  activating
             143  peering
             68   active+clean+remapped
             7    activating+remapped
             5    active+undersized+degraded
             1    remapped+peering
             1    active+clean+scrubbing+deep
 
root@dellr730-1:~#
 
And the weirdness continues and nodes have started crashing again

Code:
root@sm1:~# tail -f /var/log/messages
Apr 20 03:15:51 sm1 kernel: [41625.332130] device tap536i1 entered promiscuous mode
Apr 20 03:15:51 sm1 kernel: [41625.342800] vmbr1: port 3(tap536i1) entered blocking state
Apr 20 03:15:51 sm1 kernel: [41625.342803] vmbr1: port 3(tap536i1) entered disabled state
Apr 20 03:15:51 sm1 kernel: [41625.343957] vmbr1: port 3(tap536i1) entered blocking state
Apr 20 03:15:51 sm1 kernel: [41625.343960] vmbr1: port 3(tap536i1) entered forwarding state
Apr 20 03:15:53 sm1 kernel: [41627.832430] device tap537i1 entered promiscuous mode
Apr 20 03:15:53 sm1 kernel: [41627.843210] vmbr1: port 4(tap537i1) entered blocking state
Apr 20 03:15:53 sm1 kernel: [41627.843212] vmbr1: port 4(tap537i1) entered disabled state
Apr 20 03:15:53 sm1 kernel: [41627.844555] vmbr1: port 4(tap537i1) entered blocking state
Apr 20 03:15:53 sm1 kernel: [41627.844558] vmbr1: port 4(tap537i1) entered forwarding state
Apr 20 03:51:09 sm1 kernel: [43743.438188] pvesr           D    0 243331      1 0x00000000
Apr 20 03:51:09 sm1 kernel: [43743.438191] Call Trace:
Apr 20 03:51:09 sm1 kernel: [43743.438202]  __schedule+0x2bb/0x660
Apr 20 03:51:09 sm1 kernel: [43743.438208]  ? filename_parentat.isra.58.part.59+0xf7/0x180
Apr 20 03:51:09 sm1 kernel: [43743.438210]  schedule+0x33/0xa0
Apr 20 03:51:09 sm1 kernel: [43743.438217]  rwsem_down_write_slowpath+0x2e3/0x480
Apr 20 03:51:09 sm1 kernel: [43743.438220]  down_write+0x3d/0x40
Apr 20 03:51:09 sm1 kernel: [43743.438222]  filename_create+0x8e/0x180
Apr 20 03:51:09 sm1 kernel: [43743.438224]  do_mkdirat+0x59/0x110
Apr 20 03:51:09 sm1 kernel: [43743.438226]  __x64_sys_mkdir+0x1b/0x20
Apr 20 03:51:09 sm1 kernel: [43743.438230]  do_syscall_64+0x5a/0x130
Apr 20 03:51:09 sm1 kernel: [43743.438233]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 20 03:51:09 sm1 kernel: [43743.438236] RIP: 0033:0x7f49d22d00d7
Apr 20 03:51:09 sm1 kernel: [43743.438243] Code: Bad RIP value.
Apr 20 03:51:09 sm1 kernel: [43743.438244] RSP: 002b:00007ffc5376ceb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
Apr 20 03:51:09 sm1 kernel: [43743.438246] RAX: ffffffffffffffda RBX: 00005598ac571260 RCX: 00007f49d22d00d7
Apr 20 03:51:09 sm1 kernel: [43743.438247] RDX: 00005598ac2951f4 RSI: 00000000000001ff RDI: 00005598b04b9bb0
Apr 20 03:51:09 sm1 kernel: [43743.438248] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000004
Apr 20 03:51:09 sm1 kernel: [43743.438249] R10: 0000000000000000 R11: 0000000000000246 R12: 00005598ad93d7e8
Apr 20 03:51:09 sm1 kernel: [43743.438249] R13: 00005598b04b9bb0 R14: 00005598b00fd420 R15: 00000000000001ff
 
That all went fine though but....none of my VMs are booting up now, they are stuck just starting up for 15 minutes now, Is there any chance my all VMs are gone with data during these reboot cycles ? If so any way to recover it.


Code:
~# ceph status
  cluster:
    id:     01bcf2d5-6e96-4d50-81ec-cd6bb55c500e
    health: HEALTH_WARN
            2 osds down
            Long heartbeat ping times on back interface seen, longest is 221731.966 msec
            Long heartbeat ping times on front interface seen, longest is 226239.164 msec
            Reduced data availability: 309 pgs inactive, 144 pgs peering
            Degraded data redundancy: 766/2048625 objects degraded (0.037%), 5 pgs degraded, 5 pgs undersized
            166 slow ops, oldest one blocked for 7638 sec, daemons [osd.103,osd.15,osd.30,osd.64,osd.73,osd.76,osd.81,osd.9,mon.dellr730-1,mon.dellr730-2] have slow ops.

  services:
    mon: 3 daemons, quorum dellr730-1,dellr730-2,dellr730-3 (age 2h)
    mgr: hp1(active, since 6h), standbys: dellr730-1
    osd: 105 osds: 100 up (since 109s), 102 in (since 3m); 76 remapped pgs

  data:
    pools:   3 pools, 1792 pgs
    objects: 682.88k objects, 2.6 TiB
    usage:   8.2 TiB used, 47 TiB / 55 TiB avail
    pgs:     17.299% pgs not active
             766/2048625 objects degraded (0.037%)
             12113/2048625 objects misplaced (0.591%)
             1408 active+clean
             159  activating
             143  peering
             68   active+clean+remapped
             7    activating+remapped
             5    active+undersized+degraded
             1    remapped+peering
             1    active+clean+scrubbing+deep

root@dellr730-1:~#

that looks like your Ceph cluster has not yet fully started/caught up. is your Ceph network and your corosync network separated?
 
you should really think about separating them physically. else Ceph traffic can easily lead to quorum loss, which leads to fencing, which leads to more Ceph traffic to rebalance, which makes forming/keeping a quorum even harder, ...
 
yes, but a cold-start of both PVE and Ceph clusters creates extra load on the cluster (the state falls apart when shutting down, and needs to be synced up again on start - simultaneously). if you then also start all your HA guests while Ceph is still rebalancing and Corosync loses connections, it will be very hard to get back to a stable, quorate state.
 
So I found the culprit, it was stack links which was having lots of packet drops which caused unstable cluster. Thank you for all you help @fabian . Really appreciate it.

Is there any timer I can change to increase for node to wait for quorum before rebooting ?

stacklink.jpeg
 
no, the watchdog timer is not configurable. once the threshold is reached, we need to fence otherwise other nodes cannot safely treat the node as fenced/gone.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!