updating nodes with ceph

Discussion in 'Proxmox VE: Installation and configuration' started by ethaniel86, May 22, 2019.

Tags:
  1. ethaniel86

    ethaniel86 New Member

    Joined:
    Oct 6, 2015
    Messages:
    7
    Likes Received:
    0
    Is there a KB we can refer on how to properly update a cluster with ceph storage? Do we update node one at a time, or we can do it in parallel? Thanks!

    We are on Proxmox 5.3 currently updating to latest release 5.4
     
  2. rhonda

    rhonda Proxmox Staff Member
    Staff Member

    Joined:
    Sep 3, 2018
    Messages:
    47
    Likes Received:
    11
    In general I suggest upgrading one node at a time, and moving all containers and VMs off that node for the time being. Especially when it comes to kernel upgrades you want to reboot anyway. :)

    Try to move VMs and containers to a node that has the same or a newer stack, not the other way round, to avoid potential side effects. There shouldn't be any between 5.3 and 5.4, but this is general advice.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. ethaniel86

    ethaniel86 New Member

    Joined:
    Oct 6, 2015
    Messages:
    7
    Likes Received:
    0
    alright thanks. Is there known issues with cloud-init and 5.4? Most of our OS templates depend on cloud-init for provisioning.
     
  4. rhonda

    rhonda Proxmox Staff Member
    Staff Member

    Joined:
    Sep 3, 2018
    Messages:
    47
    Likes Received:
    11
    We are unaware of new issues with 5.4 and cloud-init. There has been some things fixed, but if you used it with 5.3 it very well is expected to work the same with 5.4 too.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. RokaKen

    RokaKen New Member
    Proxmox Subscriber

    Joined:
    Oct 11, 2018
    Messages:
    18
    Likes Received:
    4
    I realize the above advice has been the "official" response and consistent among staff responses. However, I'm concerned with "unnecessary" rebalancing of PGs triggered by rebooting a node (and related side effects). I would like to propose the following for comment by staff and members with more extensive experience with CEPH:

    Code:
    # Node maintenance
    
        # stop and wait for scrub and deep-scrub operations
    
    ceph osd set noscrub
    ceph osd set nodeep-scrub
    
    ceph status
    
        # set cluster in maintenance mode with :
    
    ceph osd set norecover
    ceph osd set nobackfill
    ceph osd set norebalance
    ceph osd set noout
    
    
        # for node 1..N
    
        #  migrate VMs and CTs off node
    
    (GUI or CLI)
    
        # run updates
    
    apt update && pveupgrade
    
        # determine which OSDs on node, and verify OK to reboot
    
    systemctl status system-ceph\\x2dosd.slice
    
    ceph osd ok-to-stop <id> [<ids>...]
    
    shutdown -r now
    
        # wait for node to come back on line and quorate
    
        # next N
    
    # restore
    
    ceph osd unset noout
    ceph osd unset norebalance
    ceph osd unset nobackfill
    ceph osd unset norecover
    
    
    
        # when all the PGs are active, re-enable the scrub and deep-scrub operations
    
    ceph status
    
    ceph osd unset noscrub
    ceph osd unset nodeep-scrub
    
    
        # done
    
    Thanks
     
    #5 RokaKen, May 22, 2019
    Last edited: May 22, 2019
  6. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    445
    Likes Received:
    37
    @RokaKen normally it's enough to set noout, for scrubbing you should define some time ranges for the night or other times where is not much traffic.

    On my setup I only set noout (which I have set generally because of the Cluster size) and don't have any problems up to today with this. There was no trouble in the cluster, if you have an correct crushmap there should be no reason for problems. If you have problems at planned maintenance, what are you doing if a node fails?
     
  7. RokaKen

    RokaKen New Member
    Proxmox Subscriber

    Joined:
    Oct 11, 2018
    Messages:
    18
    Likes Received:
    4
    Thanks @sb-jw I was considering limiting the scrub time ranges, but those ranges will also coincide with planned maintenance windows most of the time. As for the rest, I didn't have any specific problem -- more my perception of unnecessary churn in cluster by dropping multiple OSDs during a node reboot. I realize that most options beyond 'noout' are overkill, but might provoke discussion. Node failure scenarios should probably be a different thread/topic.
     
  8. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,172
    Likes Received:
    191
    The default timeout before recovery starts, is 600sec. If the nodes boot faster then it shouldn't trigger a rebalance, but every new object written will be distributed either to other nodes or will be copied after the nodes is back up. In general data movement in some way or another is always to be expected. But as @sb-jw said, it should not be a problem otherwise there is an issue with the cluster per se.

    EDIT: Ofc, setting noout is a good measure to do.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    RokaKen likes this.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice