Search results

  1. W

    [SOLVED] Ghost monitor in CEPH cluster

    I did it. I even deleted all /var/lib/ceph folder and all ceph* related services in /etc/system.d/.. and rebooted that node but pveceph purge still says: root@pve-node4:~# pveceph purge detected running ceph services- unable to purge data what pveceph purge checks for "running ceph...
  2. W

    [SOLVED] Ghost monitor in CEPH cluster

    Nothing changes( root@pve-node4:~# pveceph purge detected running ceph services- unable to purge data root@pve-node4:~# pveceph createmon monitor 'pve-node4' already exists root@pve-node4:~#
  3. W

    [SOLVED] Ghost monitor in CEPH cluster

    Not sure it's somehow related but I dont have any OSDs in my cluster by the moment root@pve-node4:~# systemctl | grep ceph- ● ceph-mon@pve-node4.service loaded failed failed Ceph cluster...
  4. W

    [SOLVED] Ghost monitor in CEPH cluster

    Yeap, systemd service was enabled but disabling does change nothing ceph log on pve-node4 on mon start: Oct 04 13:41:25 pve-node4 systemd[1]: Started Ceph cluster monitor daemon. Oct 04 13:41:25 pve-node4 ceph-mon[436732]: 2019-10-04 13:41:25.495 7f5aed4ec440 -1 mon.pve-node4@-1(???) e14 not...
  5. W

    [SOLVED] Ghost monitor in CEPH cluster

    After an update from 5.x to 6.x one CEPH monitors became "ghost" With status "stopped" and address "unknown" It can be neither run, created or deleted with errors as below: create: monitor address '10.10.10.104' already in use (500 ) destroy : no such monitor id 'pve-node4' (500) I deleted...
  6. W

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    In my environment with libknet* 1.12-pve1 (from no-subscription repo) cluster has become much more stable (no "link down" and corosync seg fault so far >48hrs)
  7. W

    [SOLVED] Why do KNET chose ring with higher priority instead of lower one (as said in manual?)

    Here is an answer... https://github.com/corosync/corosync/commit/0a323ff2ed0f2aff9cb691072906e69cb96ed662 PVE wiki should be get updated accordingly Dumn corosync...
  8. W

    [SOLVED] Why do KNET chose ring with higher priority instead of lower one (as said in manual?)

    Could anyone explain why do corosync (KNET) choose best link with the highest priority instead of the lowest one (as written in PVE wiki)? Very confused with corosync3 indeed... quorum { provider: corosync_votequorum } totem { cluster_name: amarao-cluster config_version: 20 interface...
  9. W

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    Another observation is that in my setups only nodes with no swap (zfs as root and NFS share as datastore) and vm.swappiness=0 in sysctl.conf are affected I do remember the unresolved issue with PVE 5.x where swap has been used even with vm.swappiness=0 by pve process. Couldn't this be the case...
  10. W

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    Another hang which breaks even NFS connection and trace linux kernel
  11. W

    [SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

    Could the problem be related to jumbo frames and/or dual ring configuration? I'm facing the same issue - corosync randomly hangs on different nodes. I've two rings 10Gbe + 1Gbe with mtu = 9000 on both nets
  12. W

    PVE 6 cluster nodes randomly hangs (10gbe network down)

    Don't know how this could be related but following was observed during the boot [Wed Sep 11 04:37:27 2019] ACPI: Using IOAPIC for interrupt routing [Wed Sep 11 04:37:27 2019] HEST: Table parsing has been initialized. [Wed Sep 11 04:37:27 2019] PCI: Using host bridge windows from ACPI; if...
  13. W

    PVE 6 cluster nodes randomly hangs (10gbe network down)

    There was no unsuspected activity on that node at the time of hanging
  14. W

    PVE 6 cluster nodes randomly hangs (10gbe network down)

    root@pve-node3:~# dmesg -T | grep Intel [Sun Sep 8 04:22:18 2019] Intel GenuineIntel [Sun Sep 8 04:22:19 2019] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz (family: 0x6, model: 0x2d, stepping: 0x7) [Sun Sep 8 04:22:19 2019] Performance Events: PEBS fmt1+, SandyBridge events...
  15. W

    PVE 6 cluster nodes randomly hangs (10gbe network down)

    root@pve-node3:~# lspci 00:00.0 Host bridge: Intel Corporation Xeon E5/Core i7 DMI2 (rev 07) 00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07) 00:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07) 00:03.0 PCI bridge...
  16. W

    PVE 6 cluster nodes randomly hangs (10gbe network down)

    root@pve-node3:~# uname -a Linux pve-node3 5.0.21-1-pve #1 SMP PVE 5.0.21-2 (Wed, 28 Aug 2019 15:12:18 +0200) x86_64 GNU/Linux root@pve-node3:~# pveversion -v proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve) pve-manager: 6.0-7 (running version: 6.0-7/28984024) pve-kernel-5.0: 6.0-7...
  17. W

    PVE 6 cluster nodes randomly hangs (10gbe network down)

    root@pve-node3:~# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host...
  18. W

    PVE 6 cluster nodes randomly hangs (10gbe network down)

    [Sun Sep 8 04:23:20 2019] fwbr143i0: port 2(tap143i0) entered disabled state [Sun Sep 8 04:23:20 2019] fwbr143i0: port 2(tap143i0) entered blocking state [Sun Sep 8 04:23:20 2019] fwbr143i0: port 2(tap143i0) entered forwarding state [Sun Sep 8 07:25:56 2019] perf: interrupt took too long...
  19. W

    PVE 6 cluster nodes randomly hangs (10gbe network down)

    I've noticed that after installing PVE 6.x ckuster with 10Gb net for intercluster and storage (NFS) communications cluster nodes randomly hangs - still available through ethernet (1Gbe) nework but NOT accesible via main 10Gbe, so neither cluster nor storage are availible Yesterday it happened...