Cluster nodes "offline" but working

Discussion in 'Proxmox VE: Installation and configuration' started by jarenas, Apr 12, 2018.

  1. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
    I have created a new cluster with 4 nodes, the problem is that when I reboot them all of them are working, but after some minutes some of them says that are disconnected.

    [​IMG]

    When this happens I execute in the shell node:

    service corosync restart

    After execute this command the node is online again, but the next one (cp3) turn to "offline".

    I don't know what it's happening.

    Can somebody help me?

    Thanks, regards!
     
    #1 jarenas, Apr 12, 2018
    Last edited: Apr 13, 2018
  2. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,664
    Likes Received:
    309
    Hi,

    try to restart the pvestatd on the whole cluster.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
    I have done it and nothing happens :(
     
  4. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    Try all 4

    service pve-cluster restart
    service pveproxy restart
    service pvedaemon restart
    service pvestatd restart
     
  5. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
    I've done it these commands in three nodes, but in one I'm getting this:

    root@cp2:~# service pve-cluster restart
    Job for pve-cluster.service failed because the control process exited with error code.
    See "systemctl status pve-cluster.service" and "journalctl -xe" for details.


    journal ctl

    Apr 15 11:15:28 cp2 pvestatd[232476]: status update error: Connection refused
    Apr 15 11:15:29 cp2 pveproxy[510975]: worker exit
    Apr 15 11:15:29 cp2 pveproxy[510976]: worker exit
    Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510975 finished
    Apr 15 11:15:29 cp2 pveproxy[510791]: starting 1 worker(s)
    Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510976 finished
    Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510978 started
    Apr 15 11:15:29 cp2 pveproxy[510977]: worker exit
    Apr 15 11:15:29 cp2 pveproxy[510978]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
    Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510977 finished
    Apr 15 11:15:29 cp2 pveproxy[510791]: starting 2 worker(s)
    Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510979 started
    Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510980 started
    Apr 15 11:15:29 cp2 pveproxy[510979]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
    Apr 15 11:15:29 cp2 pveproxy[510980]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
    Apr 15 11:15:32 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
    Apr 15 11:15:32 cp2 pve-ha-lrm[4838]: ipcc_send_rec[1] failed: Connection refused
    Apr 15 11:15:32 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
    Apr 15 11:15:32 cp2 pve-ha-lrm[4838]: ipcc_send_rec[2] failed: Connection refused
    Apr 15 11:15:32 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
    Apr 15 11:15:32 cp2 pve-ha-lrm[4838]: ipcc_send_rec[3] failed: Connection refused
    Apr 15 11:15:34 cp2 pveproxy[510978]: worker exit
    Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510978 finished
    Apr 15 11:15:34 cp2 pveproxy[510791]: starting 1 worker(s)
    Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510986 started
    Apr 15 11:15:34 cp2 pveproxy[510979]: worker exit
    Apr 15 11:15:34 cp2 pveproxy[510980]: worker exit
    Apr 15 11:15:34 cp2 pveproxy[510986]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
    Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510979 finished
    Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510980 finished
    Apr 15 11:15:34 cp2 pveproxy[510791]: starting 2 worker(s)
    Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510987 started
    Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510988 started
    Apr 15 11:15:34 cp2 pveproxy[510987]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
    Apr 15 11:15:34 cp2 pveproxy[510988]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.


    I think that this is happening because it can't mount the pve configuration filesystem
     
  6. r.jochum

    r.jochum Member

    Joined:
    Mar 26, 2018
    Messages:
    88
    Likes Received:
    13
    What gives
    Code:
    $ systemctl
    any errors?

    Especialy look for "pve-ha-crm.service"
     
  7. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
    These have errors:

    pve-cluster.service loaded failed failed The Proxmox VE cluster filesystem
    pvesr.service loaded failed failed Proxmox VE replication runner
    zfs-mount.service loaded failed failed Mount ZFS filesystems
    zfs-share.service loaded failed failed ZFS file system shares

    And also this when I execute systemctl status pve-ha-crm.service:

    systemctl status pve-ha-crm.service
    ● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon
    Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
    Active: active (running) since Thu 2018-04-12 16:44:08 CEST; 2 days ago
    Main PID: 3460 (pve-ha-crm)
    Tasks: 1 (limit: 36864)
    Memory: 81.2M
    CPU: 19.103s
    CGroup: /system.slice/pve-ha-crm.service
    └─3460 pve-ha-crm

    Apr 15 13:00:00 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
    Apr 15 13:00:05 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
    Apr 15 13:00:05 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
    Apr 15 13:00:05 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
    Apr 15 13:00:10 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
    Apr 15 13:00:10 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
    Apr 15 13:00:10 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
    Apr 15 13:00:15 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
    Apr 15 13:00:15 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
    Apr 15 13:00:15 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused


    Thanks! Regards

     
  8. r.jochum

    r.jochum Member

    Joined:
    Mar 26, 2018
    Messages:
    88
    Likes Received:
    13
    Do you have the right ip set for "pvelocalhost" ? I've seen this before where pvelocalhost was wrong. But that was a single node.
     
  9. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
    I don't know what do you mean with ip set for pvelocalhost, I've got configured two ips, one for my lan network and another for CEPH in /etc/network/interfaces.
     
  10. r.jochum

    r.jochum Member

    Joined:
    Mar 26, 2018
    Messages:
    88
    Likes Received:
    13
  11. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
  12. GadgetPig

    GadgetPig Member

    Joined:
    Apr 26, 2016
    Messages:
    138
    Likes Received:
    19
    For node "cp2" that shows offline, could you post that server's output of

    #cat /etc/corosync/corosync.conf
     
    #12 GadgetPig, Apr 16, 2018
    Last edited: Apr 16, 2018
  13. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
    cat /etc/corosync/corosync.conf
    logging {
    debug: off
    to_syslog: yes
    }

    nodelist {
    node {
    name: cp1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: cp1
    }
    node {
    name: cp2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: cp2
    }
    node {
    name: cp3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: cp3
    }
    node {
    name: cp4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: cp4
    }
    }

    quorum {
    provider: corosync_votequorum
    }

    totem {
    cluster_name: cpx
    config_version: 4
    interface {
    bindnetaddr: 10.85.20.101
    ringnumber: 0
    }
    ip_version: ipv4
    secauth: on
    version: 2
    }
     
  14. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
    After doing this I have this problem:
     

    Attached Files:

  15. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
    Nobody knows anything?
     
  16. r.jochum

    r.jochum Member

    Joined:
    Mar 26, 2018
    Messages:
    88
    Likes Received:
    13
    Hi jarenas,

    i've had a talk with alwin, the root cause could be either your switch and multicast packages or an incorrect of /etc/hosts on each node.

    first make sure you can resolve "cpX" from each node, if that works have a look at https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

    Start omping on each node then restart corosync if omping reports errors at the same time the host goes down in pve its your switch.

    Ask here if you need help with your switch.

    I hope you get that fixed.
     
    #16 r.jochum, May 7, 2018
    Last edited: May 8, 2018
  17. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
    Yes, it is a switch problem (with multicast), so finally I tried to convert it to unicast, following the next steps in this page:

    https://pve.proxmox.com/wiki/Multicast_notes

    But I had problems:

    https://forum.proxmox.com/threads/need-to-restore-corosync-conf-file.43286/#post-207623

    This is my original corosync.conf and I have written (bold words) the things that I think that I have to moddify in the file:

    logging {
    debug: off
    to_syslog: yes
    }

    nodelist {
    node {
    name: cp1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: cp1
    }
    node {
    name: cp2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: cp2
    }
    node {
    name: cp3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: cp3
    }
    node {
    name: cp4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: cp4
    }
    }

    quorum {
    provider: corosync_votequorum
    }

    totem {
    <-------------------------------------------------- Here I think that I have to add "transport: udpu"

    cluster_name: cp-oficina
    config_version: 4
    interface {
    bindnetaddr: 10.85.20.101
    ringnumber: 0
    }
    ip_version: ipv4
    secauth: on
    version: 2 <------------------------------------ And here I think that I have to change the version

    }
     
  18. r.jochum

    r.jochum Member

    Joined:
    Mar 26, 2018
    Messages:
    88
    Likes Received:
    13
    Don't go UNICAST you'll likely get a lot of troubles (lots of traffic), fix your switch :)
     
  19. jarenas

    jarenas Member

    Joined:
    Mar 7, 2018
    Messages:
    33
    Likes Received:
    0
    I would like, but at this moment it's not possible because the switch have got a lot of important configurations and it needs an update to work with multicast, and if we update it maybe this configuration won't work.

    Thanks, regards!
     
  20. canifer

    canifer New Member

    Joined:
    Mar 1, 2019
    Messages:
    1
    Likes Received:
    0
    Hello, I just have the same problem after having a power outage. Just run
    Code:
    systemctl restart corosync
    and its connected again in cluster just like that.

    Thank you.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice