Two out of Three nodes offline after network outage

stringman · Jan 30, 2021

I have a three node hyperconverged cluster running PVE 6.2-6 and CEPH. After recovering from network problems involving VLANs reverting to an older configuration and waiting overnight, only one of the three nodes shows that it's online. A Fencing message went out this morning about one of the nodes:

The node 'engpve03' failed and needs manual intervention.
The node 'engpve03' failed and needs manual intervention.

The PVE HA manager tries to fence it and recover the
configured HA resources to a healthy node if possible.

Current fence status: SUCCEED
fencing: acknowledged - got agent lock for node 'engpve03'

Overall Cluster status:
-----------------------

{
"manager_status" : {
"master_node" : "engpve02",
"node_status" : {
"engpve01" : "online",
"engpve02" : "online",
"engpve03" : "fence"
},
"service_status" : {
"vm:106" : {
"node" : "engpve03",
"state" : "fence",
"uid" : "pzYAZMtHOexujrFOU7jQDQ"
},
"vm:107" : {
"node" : "engpve01",
"running" : 1,
"state" : "started",
"uid" : "pOoLZ42BGI+aQLfnqZOoMg"
},
"vm:116" : {
"node" : "engpve02",
"running" : 1,
"state" : "started",
"uid" : "HTscCYkjwviZDhm4sybpEg"
},
"vm:127" : {
"node" : "engpve03",
"state" : "fence",
"uid" : "lohRoemTldpf8dfKtKypDg"
},
"vm:133" : {
"node" : "engpve03",
"state" : "fence",
"uid" : "QKNtrLn1Th/qQezJ4CeVXA"
}
},
"timestamp" : 1612014332
},
"node_status" : {
"engpve01" : "online",
"engpve02" : "online",
"engpve03" : "unknown"
}
}

I've rebooted each node, one at a time. CEPH reports all three Meta Data servers and Managers online and two monitors online. No quorum on engpve01.

root@engpve01:~# journalctl -b -u corosync -u pve-cluster

-- Logs begin at Sat 2021-01-30 11:59:34 CST, end at Sat 2021-01-30 12:28:24 CST. --
Jan 30 11:59:47 engpve01 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 30 11:59:47 engpve01 pmxcfs[3056]: [quorum] crit: quorum_initialize failed: 2
Jan 30 11:59:47 engpve01 pmxcfs[3056]: [quorum] crit: can't initialize service
Jan 30 11:59:47 engpve01 pmxcfs[3056]: [confdb] crit: cmap_initialize failed: 2
Jan 30 11:59:47 engpve01 pmxcfs[3056]: [confdb] crit: can't initialize service
Jan 30 11:59:47 engpve01 pmxcfs[3056]: [dcdb] crit: cpg_initialize failed: 2
Jan 30 11:59:47 engpve01 pmxcfs[3056]: [dcdb] crit: can't initialize service
Jan 30 11:59:47 engpve01 pmxcfs[3056]: [status] crit: cpg_initialize failed: 2
Jan 30 11:59:47 engpve01 pmxcfs[3056]: [status] crit: can't initialize service
Jan 30 11:59:48 engpve01 systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 30 11:59:48 engpve01 systemd[1]: Starting Corosync Cluster Engine...
Jan 30 11:59:48 engpve01 corosync[3285]: [MAIN ] Corosync Cluster Engine 3.0.4 starting up
Jan 30 11:59:48 engpve01 corosync[3285]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jan 30 11:59:49 engpve01 corosync[3285]: [TOTEM ] Initializing transport (Kronosnet).
Jan 30 11:59:49 engpve01 corosync[3285]: [TOTEM ] kronosnet crypto initialized: aes256/sha256
Jan 30 11:59:49 engpve01 corosync[3285]: [TOTEM ] totemknet initialized
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jan 30 11:59:49 engpve01 corosync[3285]: [SERV ] Service engine loaded: corosync configuration map access [0]
Jan 30 11:59:49 engpve01 corosync[3285]: [QB ] server name: cmap
Jan 30 11:59:49 engpve01 corosync[3285]: [SERV ] Service engine loaded: corosync configuration service [1]
Jan 30 11:59:49 engpve01 corosync[3285]: [QB ] server name: cfg
Jan 30 11:59:49 engpve01 corosync[3285]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jan 30 11:59:49 engpve01 corosync[3285]: [QB ] server name: cpg
Jan 30 11:59:49 engpve01 corosync[3285]: [SERV ] Service engine loaded: corosync profile loading service [4]
Jan 30 11:59:49 engpve01 corosync[3285]: [SERV ] Service engine loaded: corosync resource monitoring service [6]
Jan 30 11:59:49 engpve01 corosync[3285]: [WD ] Watchdog not enabled by configuration
Jan 30 11:59:49 engpve01 corosync[3285]: [WD ] resource load_15min missing a recovery key.
Jan 30 11:59:49 engpve01 corosync[3285]: [WD ] resource memory_used missing a recovery key.
Jan 30 11:59:49 engpve01 corosync[3285]: [WD ] no resources configured.
Jan 30 11:59:49 engpve01 corosync[3285]: [SERV ] Service engine loaded: corosync watchdog service [7]
Jan 30 11:59:49 engpve01 corosync[3285]: [QUORUM] Using quorum provider corosync_votequorum
Jan 30 11:59:49 engpve01 corosync[3285]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jan 30 11:59:49 engpve01 corosync[3285]: [QB ] server name: votequorum
Jan 30 11:59:49 engpve01 corosync[3285]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jan 30 11:59:49 engpve01 corosync[3285]: [QB ] server name: quorum
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 0)
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 1 has no active links
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 0)
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 2 has no active links
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 2 has no active links
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 2 has no active links
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 3 has no active links
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 3 has no active links
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 30 11:59:49 engpve01 corosync[3285]: [TOTEM ] A new membership (1.21f08) was formed. Members joined: 1
Jan 30 11:59:49 engpve01 corosync[3285]: [KNET ] host: host: 3 has no active links
Jan 30 11:59:49 engpve01 corosync[3285]: [QUORUM] Members[1]: 1
Jan 30 11:59:49 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 30 11:59:49 engpve01 systemd[1]: Started Corosync Cluster Engine.
Jan 30 11:59:53 engpve01 pmxcfs[3056]: [status] notice: update cluster info (cluster name engpve, version = 3)
Jan 30 11:59:53 engpve01 pmxcfs[3056]: [dcdb] notice: members: 1/3056
Jan 30 11:59:53 engpve01 pmxcfs[3056]: [dcdb] notice: all data is up to date
Jan 30 11:59:53 engpve01 pmxcfs[3056]: [status] notice: members: 1/3056
Jan 30 11:59:53 engpve01 pmxcfs[3056]: [status] notice: all data is up to date
Jan 30 11:59:55 engpve01 corosync[3285]: [KNET ] rx: host: 2 link: 0 is up
Jan 30 11:59:55 engpve01 corosync[3285]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 30 11:59:55 engpve01 corosync[3285]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Jan 30 11:59:55 engpve01 corosync[3285]: [KNET ] pmtud: Global data MTU changed to: 1397
Jan 30 11:59:57 engpve01 corosync[3285]: [TOTEM ] A new membership (1.21f2b) was formed. Members
Jan 30 11:59:57 engpve01 corosync[3285]: [QUORUM] Members[1]: 1
Jan 30 11:59:57 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 30 11:59:59 engpve01 corosync[3285]: [TOTEM ] A new membership (1.21f2f) was formed. Members
Jan 30 11:59:59 engpve01 corosync[3285]: [QUORUM] Members[1]: 1
Jan 30 11:59:59 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 30 12:00:01 engpve01 corosync[3285]: [TOTEM ] A new membership (1.21f33) was formed. Members
Jan 30 12:00:01 engpve01 corosync[3285]: [QUORUM] Members[1]: 1
Jan 30 12:00:01 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 30 12:00:03 engpve01 corosync[3285]: [TOTEM ] A new membership (1.21f37) was formed. Members
Jan 30 12:00:03 engpve01 corosync[3285]: [QUORUM] Members[1]: 1
Jan 30 12:00:03 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 30 12:00:05 engpve01 corosync[3285]: [TOTEM ] A new membership (1.21f3b) was formed. Members
Jan 30 12:00:05 engpve01 corosync[3285]: [QUORUM] Members[1]: 1

# several pages of this

Jan 30 12:08:57 engpve01 pmxcfs[3056]: [status] notice: cpg_send_message retried 1 times
Jan 30 12:08:59 engpve01 corosync[3285]: [TOTEM ] A new membership (1.22353) was formed. Members
Jan 30 12:08:59 engpve01 corosync[3285]: [QUORUM] Members[1]: 1
Jan 30 12:08:59 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 30 12:09:01 engpve01 corosync[3285]: [TOTEM ] A new membership (1.22357) was formed. Members
Jan 30 12:09:01 engpve01 corosync[3285]: [QUORUM] Members[1]: 1
Jan 30 12:09:01 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 30 12:09:03 engpve01 corosync[3285]: [TOTEM ] A new membership (1.2235b) was formed. Members
Jan 30 12:09:03 engpve01 corosync[3285]: [QUORUM] Members[1]: 1

# several more pages of this, ending with

Jan 30 12:29:00 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 30 12:29:02 engpve01 corosync[3285]: [TOTEM ] A new membership (1.22c8f) was formed. Members
Jan 30 12:29:02 engpve01 corosync[3285]: [QUORUM] Members[1]: 1
Jan 30 12:29:02 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 30 12:29:04 engpve01 corosync[3285]: [TOTEM ] A new membership (1.22c93) was formed. Members
Jan 30 12:29:04 engpve01 corosync[3285]: [QUORUM] Members[1]: 1
Jan 30 12:29:04 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 30 12:29:06 engpve01 corosync[3285]: [TOTEM ] A new membership (1.22c97) was formed. Members
Jan 30 12:29:06 engpve01 corosync[3285]: [QUORUM] Members[1]: 1
Jan 30 12:29:06 engpve01 corosync[3285]: [MAIN ] Completed service synchronization, ready to provide service.

It looks like that all stopped twelve hours ago.

oot@engpve01:~# date
Sat 30 Jan 2021 12:36:47 PM CST

stringman · Jan 30, 2021

From the first node I can ssh to the second and then to the third, but the first and the third nodes cannot see eachother. They can't ping or ssh to eachother, even though they're on the same subnet:

10: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether a0:36:9f:21:15:5e brd ff:ff:ff:ff:ff:ff
inet 10.2.0.144/22 scope global vmbr0
valid_lft forever preferred_lft forever
inet6 fe80::a236:9fff:fe21:155e/64 scope link
valid_lft forever preferred_lft forever

8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether a0:36:9f:21:0d:10 brd ff:ff:ff:ff:ff:ff
inet 10.2.0.142/22 scope global vmbr0
valid_lft forever preferred_lft forever
inet6 fe80::a236:9fff:fe21:d10/64 scope link
valid_lft forever preferred_lft forever

I'm unable to authenticate on the third node, though I can ssh with a certificate from the second. Authentication options don't show up from the web GUI on the third node.

stringman · Jan 31, 2021

root@engpve01:~# ceph -s
cluster:
id: 0544a545-75f6-4c11-a96f-32585f7ba767
health: HEALTH_WARN
1 filesystem is degraded
1 MDSs report slow metadata IOs
467 slow ops, oldest one blocked for 12132 sec, mon.engpve01 has slow ops
1/3 mons down, quorum engpve01,engpve02

services:
mon: 3 daemons, quorum engpve01,engpve02 (age 0.308351s), out of quorum: engpve03
mgr: engpve02(active, since 3h), standbys: engpve03, engpve01
mds: cephfs:1/1 {0=engpve03=up:replay} 2 up:standby
osd: 48 osds: 48 up (since 3h), 48 in (since 5h)

data:
pools: 4 pools, 1312 pgs
objects: 1.87M objects, 6.8 TiB
usage: 19 TiB used, 16 TiB / 35 TiB avail
pgs: 1312 active+clean

io:
client: 763 B/s wr, 0 op/s rd, 0 op/s wr

stringman · Jan 31, 2021

The network configuration on the first node had changed so that it wasn't using the correct NIC to communicate externally. After correcting the network configuration on that node the cluster came back up, but 16 out of 48 OSDs are still down, and PGs are all undersized and degraded.

stringman · Jan 31, 2021

1 OSD host down, all three monitors up, but two metadata servers with question marks...

OSDs are coming up after rebooting the first node again.

stringman · Jan 31, 2021

All green, HEALTH_OK!

stringman · Feb 1, 2021

It looks like one of my statically routed NICs decided to pick up a DHCP address via the bridged interface while the other VLANs were down. Why did this happen? How do I disable this feature?

William Edwards · Feb 1, 2021

stringman said:
It looks like one of my statically routed NICs decided to pick up a DHCP address via the bridged interface while the other VLANs were down. Why did this happen? How do I disable this feature?

What do you mean by 'statically routed NIC'? A static route? A NIC with a static IP address? That would be a Debian/Linux issue.

stringman · Feb 1, 2021

William Edwards said:
What do you mean by 'statically routed NIC'? A NIC with a static IP address? A NIC with a static IP address? That would be a Debian/Linux issue.

The NICs that I'm using for CEPH traffic are directly wired to each other with static IPs in a 10.15.15.x subnet and communicating with each other over static routes. After the network outage, I noticed that one of them changed to a 10.2.0.x address in the GUI, even though the interfaces file still had the static 10.15.15.x IP. I had to change it in the GUI to get everything communicating again.

Search

Search

Two out of Three nodes offline after network outage

stringman

Member

stringman

Member

stringman

Member

stringman

Member

stringman

Member

stringman

Member

stringman

Member

William Edwards

Well-Known Member

stringman

Member