Node with question mark

Discussion in 'Proxmox VE: Installation and configuration' started by decibel83, Feb 5, 2018.

  1. decibel83

    decibel83 Member

    Joined:
    Oct 15, 2008
    Messages:
    140
    Likes Received:
    0
    Hi,

    suddenly one node in my Proxmox 5.1 cluster becomes unavailable into the web interface and it's icon becomes grey with a question mark, like in the following screenshots:

    Screen Shot 2018-02-05 at 09.18.32.png

    This happened three times on three different nodes (on node01 and node02, I rebooted them to solve the problem).

    All virtual machines on the node are running good, and the web interface on the failed node is reachable, but it shows the same (if I connect to the node03 web interface all nodes are green except itself).

    All nodes are correctly pingable from node03.

    In the datacenter summary all 11 nodes are Online.

    I don't see any error in /var/log/syslog on node03.

    The call to the status API (https://node1:8006/api2/json/cluster/status) returns all nodes online but node03 is marked as local cand has "level":null instead of "level":"":

    Code:
    {"data":[{"nodes":11,"id":"cluster","type":"cluster","name":"mycluster","version":11,"quorate":1},
    {"nodeid":6,"id":"node/node10","local":0,"name":"node10","online":1,"ip":"192.168.60.10","level":"","type":"node"},
    {"ip":"192.168.60.2","level":"","type":"node","nodeid":10,"id":"node/node02","local":0,"name":"node02","online":1},
    {"ip":"192.168.60.11","level":"","type":"node","nodeid":2,"id":"node/node11","local":0,"name":"node11","online":1},
    {"ip":"192.168.60.3","level":null,"type":"node","nodeid":7,"id":"node/node03","local":0,"name":"node03","online":1},
    {"online":1,"nodeid":8,"id":"node/node01","name":"node01","local":1,"ip":"192.168.60.1","level":"","type":"node"},
    {"name":"node09","local":0,"nodeid":11,"id":"node/node09","online":1,"type":"node","level":"","ip":"192.168.60.9"},
    {"level":"","type":"node","ip":"192.168.60.5","id":"node/node05","nodeid":1,"local":0,"name":"node05","online":1},
    {"ip":"192.168.60.6","level":"","type":"node","id":"node/node06","nodeid":3,"name":"node06","local":0,"online":1},
    {"ip":"192.168.60.8","type":"node","level":"","online":1,"local":0,"name":"node08","id":"node/node08","nodeid":5},
    {"level":"","type":"node","ip":"192.168.60.4","nodeid":9,"id":"node/node04","local":0,"name":"node04","online":1},
    {"ip":"192.168.60.7","type":"node","level":"","name":"node07","local":0,"id":"node/node07","nodeid":4,"online":1}]}
    This is the pvecm status output from node03:

    Code:
    root@node03:~# pvecm status
    Quorum information
    ------------------
    Date:             Mon Feb  5 10:09:54 2018
    Quorum provider:  corosync_votequorum
    Nodes:            11
    Node ID:          0x00000007
    Ring ID:          8/1256
    Quorate:          Yes
    
    Votequorum information
    ----------------------
    Expected votes:   11
    Highest expected: 11
    Total votes:      11
    Quorum:           6
    Flags:            Quorate
    
    Membership information
    ----------------------
        Nodeid      Votes Name
    0x00000008          1 192.168.60.1
    0x0000000a          1 192.168.60.2
    0x00000007          1 192.168.60.3 (local)
    0x00000009          1 192.168.60.4
    0x00000001          1 192.168.60.5
    0x00000003          1 192.168.60.6
    0x00000004          1 192.168.60.7
    0x00000005          1 192.168.60.8
    0x0000000b          1 192.168.60.9
    0x00000006          1 192.168.60.10
    0x00000002          1 192.168.60.11
    All nodes are updated to the last version of Proxmox, the last version of the kernel (I updated all packages yesterday):

    Code:
    root@node03:~# pveversion -v
    proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
    pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
    pve-kernel-4.13.4-1-pve: 4.13.4-26
    pve-kernel-4.13.13-2-pve: 4.13.13-33
    pve-kernel-4.13.13-5-pve: 4.13.13-38
    pve-kernel-4.13.13-3-pve: 4.13.13-34
    libpve-http-server-perl: 2.0-8
    lvm2: 2.02.168-pve6
    corosync: 2.4.2-pve3
    libqb0: 1.0.1-1
    pve-cluster: 5.0-19
    qemu-server: 5.0-20
    pve-firmware: 2.0-3
    libpve-common-perl: 5.0-25
    libpve-guest-common-perl: 2.0-14
    libpve-access-control: 5.0-7
    libpve-storage-perl: 5.0-17
    pve-libspice-server1: 0.12.8-3
    vncterm: 1.5-3
    pve-docs: 5.1-16
    pve-qemu-kvm: 2.9.1-6
    pve-container: 2.0-18
    pve-firewall: 3.0-5
    pve-ha-manager: 2.0-4
    ksm-control-daemon: 1.2-2
    glusterfs-client: 3.8.8-1
    lxc-pve: 2.1.1-2
    lxcfs: 2.0.8-1
    criu: 2.11.1-1~bpo90
    novnc-pve: 0.6-4
    smartmontools: 6.5+svn4324-1
    zfsutils-linux: 0.7.4-pve2~bpo9
    Could you help me please?

    Thanks!
     
    #1 decibel83, Feb 5, 2018
    Last edited: Feb 5, 2018
  2. Hugo Matos

    Hugo Matos New Member

    Joined:
    Mar 9, 2016
    Messages:
    1
    Likes Received:
    0
    Hi,
    I have the same problem, and the only way to fix it is rebooting.

    Does anyone else have this problem?
     

    Attached Files:

  3. masterdaweb

    masterdaweb Member

    Joined:
    Apr 17, 2017
    Messages:
    67
    Likes Received:
    1
    Same problem here, I've just posted another thread about this.

    I think this is caused by the last updates.
     
  4. fadmedi

    fadmedi New Member

    Joined:
    Feb 17, 2018
    Messages:
    2
    Likes Received:
    0
    hello everyone,

    I have the 5.1 version
    I have the same problem happened 3 times since last month in 2 differents nodes. I have to reboot the node to rectify the issue, this is not a permanent resolution.
    In console, my cluster status is ok. I have composition of 3 nodes.
    1 node is not responding on the web interface, but its containers works fine.
    Guys please help me, i would like to find a permanent resolution for this problem.

    upload_2018-2-17_17-43-36.png


    upload_2018-2-17_17-42-26.png
     
    #4 fadmedi, Feb 17, 2018
    Last edited: Feb 17, 2018
  5. dcsapak

    dcsapak Proxmox Staff Member
    Staff Member

    Joined:
    Feb 1, 2016
    Messages:
    2,167
    Likes Received:
    191
    you should check if the pvestatd daemon still runs, and if maybe there is a storage which blocks (e.g. nfs) since the pvestatd is responsible for collecting/sending that information across the cluster, if
    it hangs/crashes (most often because of a error with a storage) it stops sending that information
     
  6. lastb0isct

    lastb0isct Member

    Joined:
    Dec 29, 2015
    Messages:
    61
    Likes Received:
    0
    I've been having this issue as well...whenever i initiate a backup the system faults and is thrown into this state. There really isn't anything in the logs to go off of. Restarting services doesn't seem to fix the issue either. My backup is being sent to a nfs share, but it NEVER had issues like this before 5.x version of Proxmox.

    pvestatd is still running and a restart doesn't solve the issue.
     
  7. masterdaweb

    masterdaweb Member

    Joined:
    Apr 17, 2017
    Messages:
    67
    Likes Received:
    1
    same here, every week
     
  8. fadmedi

    fadmedi New Member

    Joined:
    Feb 17, 2018
    Messages:
    2
    Likes Received:
    0
    Meanwhile waiting update from prox team, I have decided to downgrade 1 node to the 5.0 version. I do not have issue since 4 days now.
     
  9. masterdaweb

    masterdaweb Member

    Joined:
    Apr 17, 2017
    Messages:
    67
    Likes Received:
    1
    In my case everything runs good for 2 - 3 weeks, and then it happens.

    I'm investigating if it could be caused by using Unicast instead of Multicast.

    Are you using Unicast too ?
     
  10. lastb0isct

    lastb0isct Member

    Joined:
    Dec 29, 2015
    Messages:
    61
    Likes Received:
    0
    I'm having issues with this constantly now, not just when backing up. I'm really not able to find any possiblities as to why this is happening. Even simple poweroff's of CTs cause this to happen now. Is there anyone on the proxmox team that would be able to help us with this?!
     
  11. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    12,825
    Likes Received:
    304
    Without knowing your setup, I assume the issue is somewhere in your cluster network. Most issues are caused when storage and cluster network is not separated and/or the network does not meet the requirement regarding latency and reliability. Or storage overloads.

    Do you have a separate cluster network? Test with omping if it works reliable.

    We can deeply analyse your setup by logging into your cluster via SSH, please contact our enterprise support team (subscription needed).
     
  12. venk25

    venk25 New Member

    Joined:
    Feb 5, 2018
    Messages:
    6
    Likes Received:
    0
    I ran into this grey question mark situation yesterday on PVE 5.1-43. Single node setup (setup PVE fresh 2 weeks ago) with all default options; no cluster. Rebooted node and it fixed the issue.

    After reboot, I applied all updates - now at 5.1-46. Let’s see if this happens again.
     
  13. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    33
  14. Jospeh Huber

    Jospeh Huber Member

    Joined:
    Apr 18, 2016
    Messages:
    52
    Likes Received:
    2
    5.1.46, now this happens the second time in this week on the same node in my cluster.
    When I restart the "pvestatd" on the node, the KVMs become visible again.
    Some of the lxc containers are running and some are dead.
    "pct list" hangs...
     
  15. masterdaweb

    masterdaweb Member

    Joined:
    Apr 17, 2017
    Messages:
    67
    Likes Received:
    1
    It's happening every week for me too. I have 12 nodes, and when it happens, I have to stop all proxmox services, in every node:

    service pve-cluster stop
    service corosync stop
    service pvestatd stop
    service pveproxy stop
    service pvedaemon stop

    and then

    service pve-cluster start
    service corosync start
    service pvestatd start
    service pveproxy start
    service pvedaemon start
     
  16. Jospeh Huber

    Jospeh Huber Member

    Joined:
    Apr 18, 2016
    Messages:
    52
    Likes Received:
    2
    Thanks for the hint, I will also try the "solution" from here ... we don't have zfs:
    https://forum.proxmox.com/threads/p...el-tainted-pvestatd-frozen.38408/#post-189727
     
  17. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    33
    It didn't work in my case.

    I had to restart the node to get issue fixed.
     
  18. Jospeh Huber

    Jospeh Huber Member

    Joined:
    Apr 18, 2016
    Messages:
    52
    Likes Received:
    2
    Unfortunately both solutions do not work for me.
    Everyday one node crashes... unusable.
    @tom Is there an upgrade planned for this issue?
     
  19. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    33
    yes, I also face same issue.

    I am having sleepless nights for last one week.

    Proxmox is a nightmare now, everyday 2 or 3 nodes crash for me. I have 25 nodes with LXC.

    No reply from Proxmox till now.
     
  20. Kaijia Feng

    Kaijia Feng New Member

    Joined:
    Mar 8, 2017
    Messages:
    5
    Likes Received:
    0
    Same issue here on a 16-node LXC cluster. Only begin with a recent update and reboot, so I also suspect this to be an issue with the kernel. But, I also notice the issue persists for an hour or two everytime, then everything backs to normal. So instead of rebooting every node, I just wait (of course, this is not a solution for hosting provider).
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice