[SOLVED] Proxmox 6.1 Cluster lost after reboot

Haider Jarral · Apr 20, 2020

I happen to reboot my 11 node cluster yesterday all together, big mistake ! It took almost 24 hours to come back up and get out of reboot loop. Now all nodes are up but no cluster or quorum. Nodes can ping each other fine but still no cluster. Logs/Debugs attached.

root@dellr730-1:~# pveversion
pve-manager/6.1-8/806edfe1 (running kernel: 5.3.18-3-pve)

Already rebooted nodes, rebooted switch, restarted services, nothing is working

Haider Jarral · Apr 20, 2020

root@dellr730-1:~# pveversion --verbose

proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)

pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)

pve-kernel-helper: 6.1-8

pve-kernel-5.3: 6.1-6

pve-kernel-5.3.18-3-pve: 5.3.18-3

pve-kernel-5.3.18-2-pve: 5.3.18-2

pve-kernel-5.3.10-1-pve: 5.3.10-1

ceph: 14.2.8-pve1

ceph-fuse: 14.2.8-pve1

corosync: 3.0.3-pve1

criu: 3.11-3

glusterfs-client: 5.5-3

ifupdown: residual config

ifupdown2: 2.0.1-1+pve8

ksm-control-daemon: 1.3-1

libjs-extjs: 6.0.1-10

libknet1: 1.15-pve1

libpve-access-control: 6.0-6

libpve-apiclient-perl: 3.0-3

libpve-common-perl: 6.0-17

libpve-guest-common-perl: 3.0-5

libpve-http-server-perl: 3.0-5

libpve-storage-perl: 6.1-5

libqb0: 1.0.5-1

libspice-server1: 0.14.2-4~pve6+1

lvm2: 2.03.02-pve4

lxc-pve: 3.2.1-1

lxcfs: 4.0.1-pve1

novnc-pve: 1.1.0-1

proxmox-mini-journalreader: 1.1-1

proxmox-widget-toolkit: 2.1-3

pve-cluster: 6.1-4

pve-container: 3.0-23

pve-docs: 6.1-6

pve-edk2-firmware: 2.20200229-1

pve-firewall: 4.0-10

pve-firmware: 3.0-7

pve-ha-manager: 3.0-9

pve-i18n: 2.0-4

pve-qemu-kvm: 4.1.1-4

pve-xtermjs: 4.3.0-1

qemu-server: 6.1-7

smartmontools: 7.1-pve2

spiceterm: 3.1-1

vncterm: 1.6-1

zfsutils-linux: 0.8.3-pve1

root@dellr730-1:~#

fabian · Apr 20, 2020

can you try the following on all nodes

Code:

# to stop fencing in case you have HA configured
systemctl stop pve-ha-lrm
systemctl stop pve-ha-crm
# stop clustering
systemctl stop pve-cluster corosync

then, on all nodes do the following and keep it open for monitoring:

Code:

journalctl -u pve-cluster -u corosync

then, node by node do the following, look at the logs of all nodes, wait for the situation to stabilize before proceeding with the next node

Code:

systemctl start pve-cluster

Code:

pvecm status
corosync-cfgtool -sb

provide additional input to assess the current situation

once more than 50% of the nodes are back in the quorum, /etc/pve on those nodes should be writable again.
one all nodes are connected to the cluster, you should start HA again:

Code:

systemctl start pve-ha-crm
systemctl start pve-ha-lrm

and possible restart pveproxy, pvedaemon and pve-firewall depending on their status.

Haider Jarral · Apr 20, 2020

Thanks @fabian , The commands you mentioned helped recover the cluster, I can see nodes green now

Code:

# pvecm status
Cluster information
-------------------
Name:             proxmox-cluster
Config Version:   22
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Apr 20 01:48:05 2020
Quorum provider:  corosync_votequorum
Nodes:            10
Node ID:          0x00000001
Ring ID:          1.ba2c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   11
Highest expected: 11
Total votes:      10
Quorum:           6 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.11 (local)
0x00000002          1 192.168.1.14
0x00000003          1 192.168.1.10
0x00000004          1 192.168.1.16
0x00000005          1 192.168.1.15
0x00000007          1 192.168.1.18
0x00000008          1 192.168.1.19
0x00000009          1 192.168.1.20
0x0000000a          1 192.168.1.21
0x0000000b          1 192.168.1.13
root@dellr730-1:~#
root@dellr730-1:~#
root@dellr730-1:~# corosync-cfgtool -sb
Printing link status.
Local node ID 1
LINK ID 0
    addr    = 192.168.1.11
    status    = 33333133333
root@dellr730-1:~#

But as soon as I enabled lrm and crm, same reboot loop continued, what should I capture or look at narrow down the problem.

Haider Jarral · Apr 20, 2020

This is what I see on HA settings.

fabian · Apr 20, 2020

sorry, that was my mistake. you need to start the LRM first, then the CRM.

Haider Jarral · Apr 20, 2020

That did it, Thank you @fabian . From where can I begin to troubleshoot it, any pointers on what should I be looking at to fix in order for such issue to not happen again like changing value of wait seconds before rebooting in case nodes don't all come up at the same time.

Haider Jarral · Apr 20, 2020

That all went fine though but....none of my VMs are booting up now, they are stuck just starting up for 15 minutes now, Is there any chance my all VMs are gone with data during these reboot cycles ? If so any way to recover it.

Code:

~# ceph status
  cluster:
    id:     01bcf2d5-6e96-4d50-81ec-cd6bb55c500e
    health: HEALTH_WARN
            2 osds down
            Long heartbeat ping times on back interface seen, longest is 221731.966 msec
            Long heartbeat ping times on front interface seen, longest is 226239.164 msec
            Reduced data availability: 309 pgs inactive, 144 pgs peering
            Degraded data redundancy: 766/2048625 objects degraded (0.037%), 5 pgs degraded, 5 pgs undersized
            166 slow ops, oldest one blocked for 7638 sec, daemons [osd.103,osd.15,osd.30,osd.64,osd.73,osd.76,osd.81,osd.9,mon.dellr730-1,mon.dellr730-2] have slow ops.
 
  services:
    mon: 3 daemons, quorum dellr730-1,dellr730-2,dellr730-3 (age 2h)
    mgr: hp1(active, since 6h), standbys: dellr730-1
    osd: 105 osds: 100 up (since 109s), 102 in (since 3m); 76 remapped pgs
 
  data:
    pools:   3 pools, 1792 pgs
    objects: 682.88k objects, 2.6 TiB
    usage:   8.2 TiB used, 47 TiB / 55 TiB avail
    pgs:     17.299% pgs not active
             766/2048625 objects degraded (0.037%)
             12113/2048625 objects misplaced (0.591%)
             1408 active+clean
             159  activating
             143  peering
             68   active+clean+remapped
             7    activating+remapped
             5    active+undersized+degraded
             1    remapped+peering
             1    active+clean+scrubbing+deep
 
root@dellr730-1:~#

Haider Jarral · Apr 20, 2020

And the weirdness continues and nodes have started crashing again

Code:

root@sm1:~# tail -f /var/log/messages
Apr 20 03:15:51 sm1 kernel: [41625.332130] device tap536i1 entered promiscuous mode
Apr 20 03:15:51 sm1 kernel: [41625.342800] vmbr1: port 3(tap536i1) entered blocking state
Apr 20 03:15:51 sm1 kernel: [41625.342803] vmbr1: port 3(tap536i1) entered disabled state
Apr 20 03:15:51 sm1 kernel: [41625.343957] vmbr1: port 3(tap536i1) entered blocking state
Apr 20 03:15:51 sm1 kernel: [41625.343960] vmbr1: port 3(tap536i1) entered forwarding state
Apr 20 03:15:53 sm1 kernel: [41627.832430] device tap537i1 entered promiscuous mode
Apr 20 03:15:53 sm1 kernel: [41627.843210] vmbr1: port 4(tap537i1) entered blocking state
Apr 20 03:15:53 sm1 kernel: [41627.843212] vmbr1: port 4(tap537i1) entered disabled state
Apr 20 03:15:53 sm1 kernel: [41627.844555] vmbr1: port 4(tap537i1) entered blocking state
Apr 20 03:15:53 sm1 kernel: [41627.844558] vmbr1: port 4(tap537i1) entered forwarding state
Apr 20 03:51:09 sm1 kernel: [43743.438188] pvesr           D    0 243331      1 0x00000000
Apr 20 03:51:09 sm1 kernel: [43743.438191] Call Trace:
Apr 20 03:51:09 sm1 kernel: [43743.438202]  __schedule+0x2bb/0x660
Apr 20 03:51:09 sm1 kernel: [43743.438208]  ? filename_parentat.isra.58.part.59+0xf7/0x180
Apr 20 03:51:09 sm1 kernel: [43743.438210]  schedule+0x33/0xa0
Apr 20 03:51:09 sm1 kernel: [43743.438217]  rwsem_down_write_slowpath+0x2e3/0x480
Apr 20 03:51:09 sm1 kernel: [43743.438220]  down_write+0x3d/0x40
Apr 20 03:51:09 sm1 kernel: [43743.438222]  filename_create+0x8e/0x180
Apr 20 03:51:09 sm1 kernel: [43743.438224]  do_mkdirat+0x59/0x110
Apr 20 03:51:09 sm1 kernel: [43743.438226]  __x64_sys_mkdir+0x1b/0x20
Apr 20 03:51:09 sm1 kernel: [43743.438230]  do_syscall_64+0x5a/0x130
Apr 20 03:51:09 sm1 kernel: [43743.438233]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 20 03:51:09 sm1 kernel: [43743.438236] RIP: 0033:0x7f49d22d00d7
Apr 20 03:51:09 sm1 kernel: [43743.438243] Code: Bad RIP value.
Apr 20 03:51:09 sm1 kernel: [43743.438244] RSP: 002b:00007ffc5376ceb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
Apr 20 03:51:09 sm1 kernel: [43743.438246] RAX: ffffffffffffffda RBX: 00005598ac571260 RCX: 00007f49d22d00d7
Apr 20 03:51:09 sm1 kernel: [43743.438247] RDX: 00005598ac2951f4 RSI: 00000000000001ff RDI: 00005598b04b9bb0
Apr 20 03:51:09 sm1 kernel: [43743.438248] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000004
Apr 20 03:51:09 sm1 kernel: [43743.438249] R10: 0000000000000000 R11: 0000000000000246 R12: 00005598ad93d7e8
Apr 20 03:51:09 sm1 kernel: [43743.438249] R13: 00005598b04b9bb0 R14: 00005598b00fd420 R15: 00000000000001ff

fabian · Apr 20, 2020

Haider Jarral said:

That all went fine though but....none of my VMs are booting up now, they are stuck just starting up for 15 minutes now, Is there any chance my all VMs are gone with data during these reboot cycles ? If so any way to recover it.

Code:

~# ceph status
  cluster:
    id:     01bcf2d5-6e96-4d50-81ec-cd6bb55c500e
    health: HEALTH_WARN
            2 osds down
            Long heartbeat ping times on back interface seen, longest is 221731.966 msec
            Long heartbeat ping times on front interface seen, longest is 226239.164 msec
            Reduced data availability: 309 pgs inactive, 144 pgs peering
            Degraded data redundancy: 766/2048625 objects degraded (0.037%), 5 pgs degraded, 5 pgs undersized
            166 slow ops, oldest one blocked for 7638 sec, daemons [osd.103,osd.15,osd.30,osd.64,osd.73,osd.76,osd.81,osd.9,mon.dellr730-1,mon.dellr730-2] have slow ops.

  services:
    mon: 3 daemons, quorum dellr730-1,dellr730-2,dellr730-3 (age 2h)
    mgr: hp1(active, since 6h), standbys: dellr730-1
    osd: 105 osds: 100 up (since 109s), 102 in (since 3m); 76 remapped pgs

  data:
    pools:   3 pools, 1792 pgs
    objects: 682.88k objects, 2.6 TiB
    usage:   8.2 TiB used, 47 TiB / 55 TiB avail
    pgs:     17.299% pgs not active
             766/2048625 objects degraded (0.037%)
             12113/2048625 objects misplaced (0.591%)
             1408 active+clean
             159  activating
             143  peering
             68   active+clean+remapped
             7    activating+remapped
             5    active+undersized+degraded
             1    remapped+peering
             1    active+clean+scrubbing+deep

root@dellr730-1:~#

that looks like your Ceph cluster has not yet fully started/caught up. is your Ceph network and your corosync network separated?

Haider Jarral · Apr 20, 2020

Its a separate network and separate vlan but same interfaces.

fabian · Apr 20, 2020

you should really think about separating them physically. else Ceph traffic can easily lead to quorum loss, which leads to fencing, which leads to more Ceph traffic to rebalance, which makes forming/keeping a quorum even harder, ...

Haider Jarral · Apr 20, 2020

But this was all a stable setup before reboot mess.

fabian · Apr 20, 2020

yes, but a cold-start of both PVE and Ceph clusters creates extra load on the cluster (the state falls apart when shutting down, and needs to be synced up again on start - simultaneously). if you then also start all your HA guests while Ceph is still rebalancing and Corosync loses connections, it will be very hard to get back to a stable, quorate state.

Haider Jarral · Apr 21, 2020

So I found the culprit, it was stack links which was having lots of packet drops which caused unstable cluster. Thank you for all you help @fabian . Really appreciate it.

Is there any timer I can change to increase for node to wait for quorum before rebooting ?

fabian · Apr 21, 2020

no, the watchdog timer is not configurable. once the threshold is reached, we need to fence otherwise other nodes cannot safely treat the node as fenced/gone.

Search

Search

[SOLVED] Proxmox 6.1 Cluster lost after reboot

Haider Jarral

Well-Known Member

Attachments

Haider Jarral

Well-Known Member

fabian

Proxmox Staff Member

Haider Jarral

Well-Known Member

Haider Jarral

Well-Known Member

fabian

Proxmox Staff Member

Haider Jarral

Well-Known Member

Haider Jarral

Well-Known Member

Haider Jarral

Well-Known Member

fabian

Proxmox Staff Member

Haider Jarral

Well-Known Member

fabian

Proxmox Staff Member

Haider Jarral

Well-Known Member

fabian

Proxmox Staff Member

Haider Jarral

Well-Known Member

fabian

Proxmox Staff Member

We value your privacy