[SOLVED] 5.0 Beta2 - Cluster and storage issues

D0peX

Member
May 5, 2017
32
0
11
Hi again,

I have my Ceph working, however i cannot seem to get the cluster working properly at all.
Cluster is losing connection at random intervals (all nodes going red-cross), and also going back to all green. I do not have a single storage volume available,while i have a NFS share, ceph-vm, ceph-lxc, LVM. Neither of those are available when restoring/creating VM's/Containers.

This is on a 3 node cluster, with 10gbe mesh for ceph. and 'just' using the standard vmbr0 for pve per default installation.
- I do have open Vswitch installed, not using it though.
- tried this 2x with clean installs on all hosts, including apt-get upgrades
- after reboot i shorly see NFS and local as storage options
- Cluster information shows all incorrect information. each host = 48 gb and 16 cores

Screens below

Even while nodes are offline in Webgui (it still works) there still is quorum. Other hosts still ping-able as well.
Code:
root@hv1:~# pvecm status
Quorum information
------------------
Date:             Tue Jun 13 21:46:33 2017
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1/12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.10 (local)
0x00000002          1 192.168.1.11
0x00000003          1 192.168.1.12

Code:
root@hv1:~# pveversion
pve-manager/5.0-10/0d270679 (running kernel: 4.10.11-1-pve)

journalctl -xe
Code:
root@hv1:~# journalctl -xe
-- Subject: Unit rpc-statd-notify.service has finished start-up
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit rpc-statd-notify.service has finished starting up.
--
-- The start-up result is done.
Jun 13 21:25:20 hv1 systemd[1]: Started NFS status monitor for NFSv2/3 locking..
-- Subject: Unit rpc-statd.service has finished start-up
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit rpc-statd.service has finished starting up.
--
-- The start-up result is done.
Jun 13 21:25:20 hv1 kernel: FS-Cache: Loaded
Jun 13 21:25:20 hv1 kernel: FS-Cache: Netfs 'nfs' registered for caching
Jun 13 21:25:44 hv1 pveproxy[2038]: proxy detected vanished client connection
Jun 13 21:25:58 hv1 pveproxy[2036]: proxy detected vanished client connection
Jun 13 21:26:08 hv1 pveproxy[2035]: worker 2036 finished
Jun 13 21:26:08 hv1 pveproxy[2035]: starting 1 worker(s)
Jun 13 21:26:08 hv1 pveproxy[2035]: worker 11062 started
Jun 13 21:26:09 hv1 pveproxy[2037]: proxy detected vanished client connection
Jun 13 21:26:12 hv1 pveproxy[11061]: got inotify poll request in wrong process - disabling inotify
Jun 13 21:26:31 hv1 pveproxy[11061]: proxy detected vanished client connection
Jun 13 21:26:31 hv1 pveproxy[11061]: proxy detected vanished client connection
Jun 13 21:26:35 hv1 pveproxy[11061]: proxy detected vanished client connection
Jun 13 21:26:36 hv1 pveproxy[11061]: worker exit
Jun 13 21:26:38 hv1 pveproxy[2037]: proxy detected vanished client connection
Jun 13 21:26:39 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:26:59 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:27:09 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:27:39 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:28:43 hv1 pveproxy[11062]: Clearing outdated entries from certificate cache
Jun 13 21:29:01 hv1 kernel: perf: interrupt took too long (4329 > 4231), lowering kernel.perf_event_max_sample_
Jun 13 21:29:11 hv1 pveproxy[2037]: proxy detected vanished client connection
Jun 13 21:29:11 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:29:12 hv1 pmxcfs[2407]: [status] notice: received log
Jun 13 21:29:13 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:29:16 hv1 pvestatd[1716]: status update time (300.150 seconds)
Jun 13 21:29:17 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:34:22 hv1 rrdcached[1465]: flushing old values
Jun 13 21:34:22 hv1 rrdcached[1465]: rotating journals
Jun 13 21:34:22 hv1 rrdcached[1465]: started new journal /var/lib/rrdcached/journal/rrd.journal.1497382462.2513
Jun 13 21:38:28 hv1 pmxcfs[2407]: [dcdb] notice: data verification successful
Jun 13 21:39:16 hv1 pvestatd[1716]: status update time (600.133 seconds)
Jun 13 21:41:17 hv1 kernel: perf: interrupt took too long (5479 > 5411), lowering kernel.perf_event_max_sample_
Jun 13 21:44:09 hv1 pmxcfs[2407]: [status] notice: received log
Jun 13 21:45:31 hv1 pveproxy[2038]: proxy detected vanished client connection
Jun 13 21:46:01 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:46:12 hv1 pmxcfs[2407]: [status] notice: received log
Jun 13 21:49:16 hv1 pvestatd[1716]: status update time (600.152 seconds)
 

Attachments

  • chrome_2017-06-13_22-04-57.png
    chrome_2017-06-13_22-04-57.png
    113.3 KB · Views: 11
  • chrome_2017-06-13_22-05-32.png
    chrome_2017-06-13_22-05-32.png
    56 KB · Views: 11
  • chrome_2017-06-13_22-14-20.png
    chrome_2017-06-13_22-14-20.png
    138.1 KB · Views: 11
Last edited:
Jun 13 21:39:16 hv1 pvestatd[1716]: status update time (600.133 seconds)

you have some storage which is taking way too long to respond when doing the status / utilization checks..
 
Hi @fabian . Any suggestion how i can pinpoint which what is causing it? I find it quite strange, because with almost the exact same setup i had in 4.4 i had no issues. (For some reason i cannot get 4.4 working on my hosts, it cannot find LVM volume after an upgrade) Thnx
 
check the system log for any obvious errors (network shares dropping out or similar), and check the storages one by one for slowness. for the last step (if you know a bit of perl) you could modify the code pvestatd uses to get the storage status (/usr/share/perl5/PVE/Storage.pm -> storage_info sub and log the time it takes for each storage to get updated). apt-get install --reinstall libpve-storage-perl will revert back to the original package's code.
 
I'm not familliar with coding. However, i do find it very strange that local volumes and even CEPH are not working properly. the cehp network are in the 0.3ms latency. So that should not be the issue.
 
@fabian Could DNS be an issue? I added hosts to the cluster on IP.
Code:
root@hv1:~# pveperf
CPU BOGOMIPS:      76798.88
REGEX/SECOND:      1500880
HD SIZE:           33.22 GB (/dev/mapper/pve-root)
BUFFERED READS:    75.47 MB/sec
AVERAGE SEEK TIME: 6.51 ms
FSYNCS/SECOND:     2537.55
DNS EXT:           85.78 ms
DNS INT:           79.47 ms (mgmt.REDACTED.TLD)

Edit: changed DNS settings. hosts resolve properly now, but does not seem to fix the issue
 
Last edited:
@fabian So, i fixed the issue after MANY hours (as you can tell)
Several re-installs and tackling issue left and right, i managed to pull it off.

First off, i'd like to suggest an update of the wiki in: Full Mesh Network For Ceph
Adding static routes in '/etc/network/interfaces' has changed a bit, since net-tools does not seem to be included in 5.0 (and it is OLD from what i've read). This meant that 'ifconfig' and 'route' are no longer valid commands and are replaced by 'ip a' and 'ip route' Thus the following is the correct syntax for static routes on ifup and ifdown:

Code:
# Connected to Node2 (.51)
auto ens3
iface ens3 inet static
        address  10.15.15.50
        netmask  255.255.255.0
        up ip route add 10.15.15.51 dev ens3
        down ip route del 10.15.15.51 dev ens3

Furthermore, losing cluster connections most likely had to do with me using the VMBR0 ip of the monitor nodes when adding RBD storage. This obviously had to be the IP of the monitors on the Ceph network (the mesh network). Using 10.15.15.50(...) instead of the LAN ip's resolved this issue.

DUH

Thank you for trying to help anyways!
D0peX
 
yes, that article needs to be converted to iproute! thanks for the pointer.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!