[SOLVED] Shutting down any node makes VMs unavailable

Discussion in 'Proxmox VE: Installation and configuration' started by bensode, May 22, 2019.

  1. bensode

    bensode Member

    Joined:
    Jan 9, 2019
    Messages:
    40
    Likes Received:
    3
    Good morning. After updating from 5.4-3 to 5.4-5 on my 7 node cluster I experience a complete network failure of all VMs in the cluster if I shutdown any one node. Once the node goes down we also lose the ability to console into any VM and the VMs all disconnect from the network. If we restart the node, once it returns online, normal operation occurs and the VMs are all reachable and in the same state they were in. This was not the experience prior to 5.4-5 so I'm curious if we've experienced a bug of sorts.

    What information can I supply to help us with this issue? I have need to shutdown 5 of the nodes for hardware maintenance (two were done last week before the patch without issue).

    Thanks!
     
  2. Jeff Wadsworth

    Jeff Wadsworth New Member

    Joined:
    Aug 1, 2015
    Messages:
    11
    Likes Received:
    0
    What is your shared storage?
     
  3. bensode

    bensode Member

    Joined:
    Jan 9, 2019
    Messages:
    40
    Likes Received:
    3
    Ceph with 108 ssd drives split evenly between the hosts.
     
  4. Jeff Wadsworth

    Jeff Wadsworth New Member

    Joined:
    Aug 1, 2015
    Messages:
    11
    Likes Received:
    0
    What is your OSD pool default size? The min? If you shut off a node, what is the status of Ceph?
     
  5. bensode

    bensode Member

    Joined:
    Jan 9, 2019
    Messages:
    40
    Likes Received:
    3
    How can I get the OSD pool default size? When I shut off a node, ceph goes into Ceph_Warn that the OSDs for the host that went down are offline.
     
  6. Rafael Barasuol Rohden

    Joined:
    Oct 9, 2017
    Messages:
    2
    Likes Received:
    0
    I'm in a similar situation.
    I'm using version 5.4.5.

    When I shutdown one of the nodes all the vms also turned off and only became active again after turning on the power of the node.

    My architecture with 6 nodes:

    An Iscsi Storage: Using LVM to share the disk between nodes - 8T
    An Iscsi Storage: Using LVM to share disk between nodes - 4T
    An NFS for backup storage.
    An NFS for temporary storage for testing.

    All storages are connected by a dell 10Gb switch and a VLAN through two unique NICs for this.
    **each nodes with two NIC
    All in the same network: 172.16.0.0

    So, my question is: Why?
     
  7. Jeff Wadsworth

    Jeff Wadsworth New Member

    Joined:
    Aug 1, 2015
    Messages:
    11
    Likes Received:
    0
    I am running some tests on a fresh install of 5.4. So far, the VM's work fine with the loss of 1 node in a 3 node cluster with ceph (3 OSD's per node) Using osd pool 3/1 for the test. If it was 3/2, even one node going offline would halt your VM's. Is your VM using the ceph storage for its image?
    To check your pool size (replicas) on the GUI, just select a node and under Ceph check the Pools. It will have a "Size/min" entry.
     
  8. bensode

    bensode Member

    Joined:
    Jan 9, 2019
    Messages:
    40
    Likes Received:
    3
    Ah ok I wasn't sure what you were asking PGs or what. It's a 2/2 and distributed across all the nodes so no two PGs will be on the same node. I don't believe it is a storage issue. What is happening is that the guests continue to run but they lose all networking. I can shell into each node, bring up the web gui for each when the one node is down. But anyone with a connection to a guest is lost ... rdp sessions resume if not down for too long. Console from NoVNC will fail to connect. Further, this was not the behavior prior to the latest kernel and patch to 5.4-4.
     
  9. Rafael Barasuol Rohden

    Joined:
    Oct 9, 2017
    Messages:
    2
    Likes Received:
    0
    This problem happened in our cluster where hot VMs are. If it had happened in a test environment I would not be so nervous.
    I was updating the nodes, and in my case, the problem only happened when the last node was disconnected. This node I would not upgrade but remove from the cluster.
     
  10. Bengt Nolin

    Bengt Nolin New Member
    Proxmox Subscriber

    Joined:
    Feb 19, 2019
    Messages:
    7
    Likes Received:
    1
    bensode: If you run 2/2 ceph pools it means when one node goes down you likely cannot access much of the pool anymore since you only have 1 copies of some objects that partially existed in the PG:s on the node that went down.
    .
    2/2 means 2 replicas and deny access to objects when below 2.
    3/2 means 3 replicas and deny access to objects when below 2, so you can lose 1 node since if you have host replication rules for the pools no PG can exist twice on the same host (your failure domain for the pool is host), and you still got 2 or 3 copies for all PG:s and objects in those PG:s.

    From the ceph documentation on pools example:

    ceph osd pool set data min_size 2​

    "This ensures that no object in the data pool will receive I/O with fewer than min_size replicas."

    This might be in addition to other issues of course, I'm only pointing out that you probably want to reconsider your ceph pool configuration.
     
  11. bensode

    bensode Member

    Joined:
    Jan 9, 2019
    Messages:
    40
    Likes Received:
    3
    @Bengt Nolin -- I can rebuild those pools that's not a problem and good to know but I'm positive that this is not the cause.

    I have three nodes only with NVMe drives and their own Ceph pool and OSDs. The other four nodes have SSD drives and their own Ceph pool and OSDs. If I evacuate all guests from one of the NVMe nodes and reboot or shut it down, I lose networking on every VM in the cluster. This never happened before the patching -- we've done several patches with kernel updates requiring reboots without this symptom. Once the NVMe node returns online the VMs magically have networking and consoles. How can this not be directly caused by something in the patch? I wish I could get a staff response on this issue as I really don't want to rebuild the cluster and then possibly face this being an issue again with any patches down the road ...
     
  12. bensode

    bensode Member

    Joined:
    Jan 9, 2019
    Messages:
    40
    Likes Received:
    3
    Looks like I have no other option now but to rebuild this cluster tonight.
     
  13. David Calvache Casas

    Joined:
    Jun 14, 2013
    Messages:
    31
    Likes Received:
    2
    please post : ceph -s ,ceph health detail , pvecm status.


    So yo dont have to rebuild the pool, just increment the replica size to 3.
    Something like

    ceph osd pool set POOL_NAME size 3
     
    #13 David Calvache Casas, May 31, 2019
    Last edited: May 31, 2019
  14. bensode

    bensode Member

    Joined:
    Jan 9, 2019
    Messages:
    40
    Likes Received:
    3
    I'd like to only have two replicas but will increase it to three to help flesh this out. If I only wanted 2 replicas across 5 nodes (I'm planning to remove two of the hosts) is it safe for a 2/1 or do we really need 3/2? Below are the results of what you asked for.

    ceph -s

    cluster:
    id: 2c6baf2f-c1d4-4da6-9dbd-fc1b850074cc
    health: HEALTH_OK
    services:
    mon: 3 daemons, quorum prdpvessd01,prdpvessd03,prdpvessd05
    mgr: prdpvessd03(active), standbys: prdpvessd05, prdpvessd01
    osd: 68 osds: 68 up, 68 in
    data:
    pools: 1 pools, 2048 pgs
    objects: 632.16k objects, 2.41TiB
    usage: 4.52TiB used, 54.9TiB / 59.4TiB avail
    pgs: 2048 active+clean
    io:
    client: 16.5KiB/s rd, 11.7MiB/s wr, 2op/s rd, 492op/s wr


    ceph health detail
    HEALTH_OK

    pvecm status
    Quorum information
    ------------------
    Date: Fri May 31 09:19:28 2019
    Quorum provider: corosync_votequorum
    Nodes: 7
    Node ID: 0x00000001
    Ring ID: 5/1436
    Quorate: Yes

    Votequorum information
    ----------------------
    Expected votes: 7
    Highest expected: 7
    Total votes: 7
    Quorum: 4
    Flags: Quorate

    Membership information
    ----------------------
    Nodeid Votes Name
    0x00000005 1 10.0.3.140
    0x00000006 1 10.0.3.141
    0x00000001 1 10.0.3.145 (local)
    0x00000007 1 10.0.3.146
    0x00000002 1 10.0.3.147
    0x00000003 1 10.0.3.148
    0x00000004 1 10.0.3.149
     
  15. David Calvache Casas

    Joined:
    Jun 14, 2013
    Messages:
    31
    Likes Received:
    2
    2/1 in a test enviroment it's OK,

    In a production system it's a noway. If you lost one one, nothing happens, but if something occurs before the rebuild, you are game over. Be ready for a mass restauration for backups, because you will have data losses for sure.

    With 3/2 , if you lost a second node before the rebuild, nothing happens except the freeze of the system, as soon you recover one node, you can get in work again.


    Sorry for my bad english.. :)
     
    bensode likes this.
  16. bensode

    bensode Member

    Joined:
    Jan 9, 2019
    Messages:
    40
    Likes Received:
    3
    Thanks for the help. I'm going to rebuild the environment correctly.
     
  17. Bengt Nolin

    Bengt Nolin New Member
    Proxmox Subscriber

    Joined:
    Feb 19, 2019
    Messages:
    7
    Likes Received:
    1
    Did you check the proxmox and ceph logs from during the outage?
    It would be interesting to see what the cluster thought happened, if it lost quorum or something. Similar with ceph, what did the monitors think happened?

    Did you set ceph to "noout" during the node reboot?
    Since you got nvme disks you may well saturate your network when repairing, causing problems if you do not have exclusive physical links for ceph.
     
  18. bensode

    bensode Member

    Joined:
    Jan 9, 2019
    Messages:
    40
    Likes Received:
    3
    Unfortunately I did not keep the logs before I rebuilt. I believe that the ceph storage was the culprit with the PG setup. Initially there were not very many guests on the cluster and over the weeks between the last two patches many were added. I can see where losing a node in 2/2 would cause the storage to drop and cause those symptoms due to the increase of utilization.
     
    Bengt Nolin likes this.
  19. bensode

    bensode Member

    Joined:
    Jan 9, 2019
    Messages:
    40
    Likes Received:
    3
    Curious as I didn't see mentioned in the documentation if this can be changed after created. Would it be possible to adjust the min_size and data size when it is already in use there is enough room to accommodate? In other words could I have gone from 2/2 down to 2/1 or up from 2/2 to 3/2 without rebuilding? David mentioned above that I could increase to 3/2 but is the reverse true?

    upload_2019-6-7_18-51-26.png
     
  20. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    445
    Likes Received:
    37
    Yes, you can increase this. I have done this too on a productive CEPH Storage, but you have to really really make sure you have enough space and your crush rule is correct (distribute the data to Nodes, not to OSDs).

    I don't try to decrease the replicas, but I think this should work too, because it's only replicas, there is no change in the crush map needed.
     
    bensode likes this.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice