Search results

  1. P

    Proxmox 6 Cluster with Proxmox 5.4?

    Got it, thanks Appreciate the response
  2. P

    Proxmox 6 Cluster with Proxmox 5.4?

    No they were all became part of a healthy quorum. That happened during the upgrading 5.4 nodes to corosync 3. Whenever one of the nodes got upgraded to corosync 3, it wasn't part of the quorum anymore, so my questions was how to migrate the VMs to them to upgrade the other ones.
  3. P

    all nodes got rebooted and there is no log - Cluster disaster

    I really appreciate the time you put to post here, they really helped me. as the last question, can you give me a clue, if you configure those IPMI watchdog before using them? or basically just adding them in /etc/default/pve-ha-manager
  4. P

    Proxmox 6 Cluster with Proxmox 5.4?

    In the link you provided it is said: Note: changes to any VM/CT or the cluster in general are not allowed for the duration of the upgrade! Then somewhere below it says: Move important Virtual Machines and Containers If any VMs and CTs need to keep running for the duration of the upgrade...
  5. P

    all nodes got rebooted and there is no log - Cluster disaster

    Again I remember reading this at "https://pve.proxmox.com/wiki/High_Availability" So I commented out everything in my /etc/default/pve-ha-manager. Because still I have no idea how to configure the hardware watchdog. And apparently this softdog timer is so low that an undetected interruption by...
  6. P

    all nodes got rebooted and there is no log - Cluster disaster

    Thanks again for shedding light on this Unfortunately, I didn't set any of the servers to rely on the hardware watchdogs: this is the output of the commands you suggested: #dmesg | grep -i watch # fuser -v /dev/watchdog USER PID ACCESS COMMAND /dev/watchdog...
  7. P

    all nodes got rebooted and there is no log - Cluster disaster

    Thanks for your info, It gave me some ideas. But what you mean by my "watchdog device"? According to what read , The system should be using the default Linux watchdog (Softdog). Also there is no module on /etc/default/pve-ha-manager Do you have any idea, where I can get more logs and detail...
  8. P

    all nodes got rebooted and there is no log - Cluster disaster

    Unfortunately, there was nothing in the logs except the watchdog warning " watchdog-mux[1153]: client watchdog expired - disable watchdog updates" Is there anywhere else I should check to find more logs about this? Regarding your theory, I'm wondering how could that be possible, all nodes are...
  9. P

    all nodes got rebooted and there is no log - Cluster disaster

    The cluster has 13 nodes (14 with the new faulty node). 5 nodes have 10GB NIC and 100GBNIC and they are SSD Ceph nodes 3 Nodes with only 10GB NIC are HDD Ceph nodes rest are just compute nodes SYSLOGs and daemon.logs for that period are attached. The reboots happened at 19:23, you can see in...
  10. P

    all nodes got rebooted and there is no log - Cluster disaster

    UPDATE: I've found this on one of the nodes: watchdog-mux[1153]: client watchdog expired - disable watchdog updates apparently something triggered the watchdog. But how another node restarting its network service can cause this?
  11. P

    all nodes got rebooted and there is no log - Cluster disaster

    Hi folks I have added a new node to my cluster today, then I realized the new node's network configuration might have an issue that it cannot communicate with the CEPH IP ranges, I have restarted the network service using "systemctl restart networking" after this, the disaster happened and I...
  12. P

    I reinstalled a node in the cluster and now the cluster is messy

    Not at the moment, because we update our nodes one by one. But there has been a time where all nodes had the same version and this issue was still there.
  13. P

    I reinstalled a node in the cluster and now the cluster is messy

    # pvecm status Quorum information ------------------ Date: Thu Jan 2 17:43:42 2020 Quorum provider: corosync_votequorum Nodes: 13 Node ID: 0x00000006 Ring ID: 3/2800 Quorate: Yes Votequorum information ---------------------- Expected votes...
  14. P

    I reinstalled a node in the cluster and now the cluster is messy

    Yes I have read that and followed exactly like instructed. As I mentioned in my statement, I even ran "pvecm updatecerts " in the nodes. But the issue is still there. That's why I',m sharing it here.
  15. P

    I reinstalled a node in the cluster and now the cluster is messy

    Hi folks We have a cluster consisting of 14 nodes. For some reason, I had to remove 3 of these nodes, reinstall Proxmox and bring them back into the cluster. It happened over time for different reasons. Now the issue is, none of my nodes can SSH to the nodes which have been re-installed. They...
  16. P

    [SOLVED] No such cluster node

    Does anyone know what will happen by restarting pve-cluster corosync services? Because while restarting those services, the node appears offline in Proxmox GUI. Do the VMs on that node become offline or experience service interruptions / reboot?
  17. P

    Real thin provisioning with CEPH

    But even if I check my pool used size by 'rados df' it shows the provisioned size not the actual size.
  18. P

    Real thin provisioning with CEPH

    I know I can see the actual size using those commands, but the issue is my storage in Proxmox will reduce by the given size, not the actual size. For example, I have a 1TB Pool and storage of CEPH (RDB) in Proxmox, once I create an empty VM with a 100GB virtual disk, my storage has 900GB left...
  19. P

    Real thin provisioning with CEPH

    Hi guys We have Proxmox and Ceph as our storage. I understand that by turning on discard on the VMs and doing fstrim we can reclaim the unused storage. But my question is why as soon as I create a VM with E.G 200GB storage from CEPH pool, Proxmox will deduct that 200GB from the CEPH POOL. Even...
  20. P

    Linux Kernel Version Stuck, Will Not Upgrade

    Updating grub didn't make a difference. I will try to reinstall it as well. I realized this issue happened only in two of my nodes, which are using ZFS drives for boot. Could this be related?