[SOLVED] Shutting down any node makes VMs unavailable

bensode · May 22, 2019

Good morning. After updating from 5.4-3 to 5.4-5 on my 7 node cluster I experience a complete network failure of all VMs in the cluster if I shutdown any one node. Once the node goes down we also lose the ability to console into any VM and the VMs all disconnect from the network. If we restart the node, once it returns online, normal operation occurs and the VMs are all reachable and in the same state they were in. This was not the experience prior to 5.4-5 so I'm curious if we've experienced a bug of sorts.

What information can I supply to help us with this issue? I have need to shutdown 5 of the nodes for hardware maintenance (two were done last week before the patch without issue).

Thanks!

Jeff Wadsworth · May 22, 2019

What is your shared storage?

bensode · May 22, 2019

Ceph with 108 ssd drives split evenly between the hosts.

Jeff Wadsworth · May 22, 2019

What is your OSD pool default size? The min? If you shut off a node, what is the status of Ceph?

bensode · May 23, 2019

How can I get the OSD pool default size? When I shut off a node, ceph goes into Ceph_Warn that the OSDs for the host that went down are offline.

Rafael Barasuol Rohden · May 23, 2019

I'm in a similar situation.
I'm using version 5.4.5.

When I shutdown one of the nodes all the vms also turned off and only became active again after turning on the power of the node.

My architecture with 6 nodes:

An Iscsi Storage: Using LVM to share the disk between nodes - 8T
An Iscsi Storage: Using LVM to share disk between nodes - 4T
An NFS for backup storage.
An NFS for temporary storage for testing.

All storages are connected by a dell 10Gb switch and a VLAN through two unique NICs for this.
**each nodes with two NIC
All in the same network: 172.16.0.0

So, my question is: Why?

Jeff Wadsworth · May 23, 2019

I am running some tests on a fresh install of 5.4. So far, the VM's work fine with the loss of 1 node in a 3 node cluster with ceph (3 OSD's per node) Using osd pool 3/1 for the test. If it was 3/2, even one node going offline would halt your VM's. Is your VM using the ceph storage for its image?
To check your pool size (replicas) on the GUI, just select a node and under Ceph check the Pools. It will have a "Size/min" entry.

bensode · May 23, 2019

Ah ok I wasn't sure what you were asking PGs or what. It's a 2/2 and distributed across all the nodes so no two PGs will be on the same node. I don't believe it is a storage issue. What is happening is that the guests continue to run but they lose all networking. I can shell into each node, bring up the web gui for each when the one node is down. But anyone with a connection to a guest is lost ... rdp sessions resume if not down for too long. Console from NoVNC will fail to connect. Further, this was not the behavior prior to the latest kernel and patch to 5.4-4.

Rafael Barasuol Rohden · May 24, 2019

This problem happened in our cluster where hot VMs are. If it had happened in a test environment I would not be so nervous.
I was updating the nodes, and in my case, the problem only happened when the last node was disconnected. This node I would not upgrade but remove from the cluster.

Bengt Nolin · May 25, 2019

bensode: If you run 2/2 ceph pools it means when one node goes down you likely cannot access much of the pool anymore since you only have 1 copies of some objects that partially existed in the PG:s on the node that went down.
.
2/2 means 2 replicas and deny access to objects when below 2.
3/2 means 3 replicas and deny access to objects when below 2, so you can lose 1 node since if you have host replication rules for the pools no PG can exist twice on the same host (your failure domain for the pool is host), and you still got 2 or 3 copies for all PG:s and objects in those PG:s.

From the ceph documentation on pools example:

ceph osd pool set data min_size 2

"This ensures that no object in the data pool will receive I/O with fewer than min_size replicas."

This might be in addition to other issues of course, I'm only pointing out that you probably want to reconsider your ceph pool configuration.

bensode · May 30, 2019

@Bengt Nolin -- I can rebuild those pools that's not a problem and good to know but I'm positive that this is not the cause.

I have three nodes only with NVMe drives and their own Ceph pool and OSDs. The other four nodes have SSD drives and their own Ceph pool and OSDs. If I evacuate all guests from one of the NVMe nodes and reboot or shut it down, I lose networking on every VM in the cluster. This never happened before the patching -- we've done several patches with kernel updates requiring reboots without this symptom. Once the NVMe node returns online the VMs magically have networking and consoles. How can this not be directly caused by something in the patch? I wish I could get a staff response on this issue as I really don't want to rebuild the cluster and then possibly face this being an issue again with any patches down the road ...

bensode · May 31, 2019

Looks like I have no other option now but to rebuild this cluster tonight.

David Calvache Casas · May 31, 2019

please post : ceph -s ,ceph health detail , pvecm status.

So yo dont have to rebuild the pool, just increment the replica size to 3.
Something like

ceph osd pool set POOL_NAME size 3

bensode · May 31, 2019

I'd like to only have two replicas but will increase it to three to help flesh this out. If I only wanted 2 replicas across 5 nodes (I'm planning to remove two of the hosts) is it safe for a 2/1 or do we really need 3/2? Below are the results of what you asked for.

ceph -s
cluster:
id: 2c6baf2f-c1d4-4da6-9dbd-fc1b850074cc
health: HEALTH_OK
services:
mon: 3 daemons, quorum prdpvessd01,prdpvessd03,prdpvessd05
mgr: prdpvessd03(active), standbys: prdpvessd05, prdpvessd01
osd: 68 osds: 68 up, 68 in
data:
pools: 1 pools, 2048 pgs
objects: 632.16k objects, 2.41TiB
usage: 4.52TiB used, 54.9TiB / 59.4TiB avail
pgs: 2048 active+clean
io:
client: 16.5KiB/s rd, 11.7MiB/s wr, 2op/s rd, 492op/s wr

ceph health detail
HEALTH_OK

pvecm status
Quorum information
------------------
Date: Fri May 31 09:19:28 2019
Quorum provider: corosync_votequorum
Nodes: 7
Node ID: 0x00000001
Ring ID: 5/1436
Quorate: Yes

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 7
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000005 1 10.0.3.140
0x00000006 1 10.0.3.141
0x00000001 1 10.0.3.145 (local)
0x00000007 1 10.0.3.146
0x00000002 1 10.0.3.147
0x00000003 1 10.0.3.148
0x00000004 1 10.0.3.149

David Calvache Casas · Jun 3, 2019

2/1 in a test enviroment it's OK,

In a production system it's a noway. If you lost one one, nothing happens, but if something occurs before the rebuild, you are game over. Be ready for a mass restauration for backups, because you will have data losses for sure.

With 3/2 , if you lost a second node before the rebuild, nothing happens except the freeze of the system, as soon you recover one node, you can get in work again.

Sorry for my bad english..

bensode · Jun 5, 2019

David Calvache Casas said:
2/1 in a test enviroment it's OK,

In a production system it's a noway. If you lost one one, nothing happens, but if something occurs before the rebuild, you are game over. Be ready for a mass restauration for backups, because you will have data losses for sure.

With 3/2 , if you lost a second node before the rebuild, nothing happens except the freeze of the system, as soon you recover one node, you can get in work again.

Sorry for my bad english..

Thanks for the help. I'm going to rebuild the environment correctly.

Bengt Nolin · Jun 5, 2019

Did you check the proxmox and ceph logs from during the outage?
It would be interesting to see what the cluster thought happened, if it lost quorum or something. Similar with ceph, what did the monitors think happened?

Did you set ceph to "noout" during the node reboot?
Since you got nvme disks you may well saturate your network when repairing, causing problems if you do not have exclusive physical links for ceph.

bensode · Jun 6, 2019

Bengt Nolin said:
Did you check the proxmox and ceph logs from during the outage?
It would be interesting to see what the cluster thought happened, if it lost quorum or something. Similar with ceph, what did the monitors think happened?

Did you set ceph to "noout" during the node reboot?
Since you got nvme disks you may well saturate your network when repairing, causing problems if you do not have exclusive physical links for ceph.

Unfortunately I did not keep the logs before I rebuilt. I believe that the ceph storage was the culprit with the PG setup. Initially there were not very many guests on the cluster and over the weeks between the last two patches many were added. I can see where losing a node in 2/2 would cause the storage to drop and cause those symptoms due to the increase of utilization.

bensode · Jun 8, 2019

Bengt Nolin said:
bensode: If you run 2/2 ceph pools it means when one node goes down you likely cannot access much of the pool anymore since you only have 1 copies of some objects that partially existed in the PG:s on the node that went down.
.
2/2 means 2 replicas and deny access to objects when below 2.
3/2 means 3 replicas and deny access to objects when below 2, so you can lose 1 node since if you have host replication rules for the pools no PG can exist twice on the same host (your failure domain for the pool is host), and you still got 2 or 3 copies for all PG:s and objects in those PG:s.

From the ceph documentation on pools example:

ceph osd pool set data min_size 2

"This ensures that no object in the data pool will receive I/O with fewer than min_size replicas."

This might be in addition to other issues of course, I'm only pointing out that you probably want to reconsider your ceph pool configuration.

Curious as I didn't see mentioned in the documentation if this can be changed after created. Would it be possible to adjust the min_size and data size when it is already in use there is enough room to accommodate? In other words could I have gone from 2/2 down to 2/1 or up from 2/2 to 3/2 without rebuilding? David mentioned above that I could increase to 3/2 but is the reverse true?

sb-jw · Jun 9, 2019

Yes, you can increase this. I have done this too on a productive CEPH Storage, but you have to really really make sure you have enough space and your crush rule is correct (distribute the data to Nodes, not to OSDs).

I don't try to decrease the replicas, but I think this should work too, because it's only replicas, there is no change in the crush map needed.

[SOLVED] Shutting down any node makes VMs unavailable

Member

Member

Member

Member

Member

Active Member

Member

Member

Active Member

Well-Known Member

Member

Member

Renowned Member

Member

Renowned Member

Member

Well-Known Member

Member

Member

Famous Member

We value your privacy