[SOLVED] 5.0 Beta2 - Cluster and storage issues

D0peX · Jun 13, 2017

Hi again,

I have my Ceph working, however i cannot seem to get the cluster working properly at all.
Cluster is losing connection at random intervals (all nodes going red-cross), and also going back to all green. I do not have a single storage volume available,while i have a NFS share, ceph-vm, ceph-lxc, LVM. Neither of those are available when restoring/creating VM's/Containers.

This is on a 3 node cluster, with 10gbe mesh for ceph. and 'just' using the standard vmbr0 for pve per default installation.
- I do have open Vswitch installed, not using it though.
- tried this 2x with clean installs on all hosts, including apt-get upgrades
- after reboot i shorly see NFS and local as storage options
- Cluster information shows all incorrect information. each host = 48 gb and 16 cores

Screens below

Even while nodes are offline in Webgui (it still works) there still is quorum. Other hosts still ping-able as well.

Code:

root@hv1:~# pvecm status
Quorum information
------------------
Date:             Tue Jun 13 21:46:33 2017
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1/12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.10 (local)
0x00000002          1 192.168.1.11
0x00000003          1 192.168.1.12

Code:

root@hv1:~# pveversion
pve-manager/5.0-10/0d270679 (running kernel: 4.10.11-1-pve)

journalctl -xe

Code:

root@hv1:~# journalctl -xe
-- Subject: Unit rpc-statd-notify.service has finished start-up
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit rpc-statd-notify.service has finished starting up.
--
-- The start-up result is done.
Jun 13 21:25:20 hv1 systemd[1]: Started NFS status monitor for NFSv2/3 locking..
-- Subject: Unit rpc-statd.service has finished start-up
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit rpc-statd.service has finished starting up.
--
-- The start-up result is done.
Jun 13 21:25:20 hv1 kernel: FS-Cache: Loaded
Jun 13 21:25:20 hv1 kernel: FS-Cache: Netfs 'nfs' registered for caching
Jun 13 21:25:44 hv1 pveproxy[2038]: proxy detected vanished client connection
Jun 13 21:25:58 hv1 pveproxy[2036]: proxy detected vanished client connection
Jun 13 21:26:08 hv1 pveproxy[2035]: worker 2036 finished
Jun 13 21:26:08 hv1 pveproxy[2035]: starting 1 worker(s)
Jun 13 21:26:08 hv1 pveproxy[2035]: worker 11062 started
Jun 13 21:26:09 hv1 pveproxy[2037]: proxy detected vanished client connection
Jun 13 21:26:12 hv1 pveproxy[11061]: got inotify poll request in wrong process - disabling inotify
Jun 13 21:26:31 hv1 pveproxy[11061]: proxy detected vanished client connection
Jun 13 21:26:31 hv1 pveproxy[11061]: proxy detected vanished client connection
Jun 13 21:26:35 hv1 pveproxy[11061]: proxy detected vanished client connection
Jun 13 21:26:36 hv1 pveproxy[11061]: worker exit
Jun 13 21:26:38 hv1 pveproxy[2037]: proxy detected vanished client connection
Jun 13 21:26:39 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:26:59 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:27:09 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:27:39 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:28:43 hv1 pveproxy[11062]: Clearing outdated entries from certificate cache
Jun 13 21:29:01 hv1 kernel: perf: interrupt took too long (4329 > 4231), lowering kernel.perf_event_max_sample_
Jun 13 21:29:11 hv1 pveproxy[2037]: proxy detected vanished client connection
Jun 13 21:29:11 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:29:12 hv1 pmxcfs[2407]: [status] notice: received log
Jun 13 21:29:13 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:29:16 hv1 pvestatd[1716]: status update time (300.150 seconds)
Jun 13 21:29:17 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:34:22 hv1 rrdcached[1465]: flushing old values
Jun 13 21:34:22 hv1 rrdcached[1465]: rotating journals
Jun 13 21:34:22 hv1 rrdcached[1465]: started new journal /var/lib/rrdcached/journal/rrd.journal.1497382462.2513
Jun 13 21:38:28 hv1 pmxcfs[2407]: [dcdb] notice: data verification successful
Jun 13 21:39:16 hv1 pvestatd[1716]: status update time (600.133 seconds)
Jun 13 21:41:17 hv1 kernel: perf: interrupt took too long (5479 > 5411), lowering kernel.perf_event_max_sample_
Jun 13 21:44:09 hv1 pmxcfs[2407]: [status] notice: received log
Jun 13 21:45:31 hv1 pveproxy[2038]: proxy detected vanished client connection
Jun 13 21:46:01 hv1 pveproxy[11062]: proxy detected vanished client connection
Jun 13 21:46:12 hv1 pmxcfs[2407]: [status] notice: received log
Jun 13 21:49:16 hv1 pvestatd[1716]: status update time (600.152 seconds)

fabian · Jun 14, 2017

Jun 13 21:39:16 hv1 pvestatd[1716]: status update time (600.133 seconds)

you have some storage which is taking way too long to respond when doing the status / utilization checks..

D0peX · Jun 14, 2017

Hi @fabian . Any suggestion how i can pinpoint which what is causing it? I find it quite strange, because with almost the exact same setup i had in 4.4 i had no issues. (For some reason i cannot get 4.4 working on my hosts, it cannot find LVM volume after an upgrade) Thnx

fabian · Jun 14, 2017

check the system log for any obvious errors (network shares dropping out or similar), and check the storages one by one for slowness. for the last step (if you know a bit of perl) you could modify the code pvestatd uses to get the storage status (/usr/share/perl5/PVE/Storage.pm -> storage_info sub and log the time it takes for each storage to get updated). apt-get install --reinstall libpve-storage-perl will revert back to the original package's code.

D0peX · Jun 14, 2017

I'm not familliar with coding. However, i do find it very strange that local volumes and even CEPH are not working properly. the cehp network are in the 0.3ms latency. So that should not be the issue.

D0peX · Jun 14, 2017

@fabian Could DNS be an issue? I added hosts to the cluster on IP.

Code:

root@hv1:~# pveperf
CPU BOGOMIPS:      76798.88
REGEX/SECOND:      1500880
HD SIZE:           33.22 GB (/dev/mapper/pve-root)
BUFFERED READS:    75.47 MB/sec
AVERAGE SEEK TIME: 6.51 ms
FSYNCS/SECOND:     2537.55
DNS EXT:           85.78 ms
DNS INT:           79.47 ms (mgmt.REDACTED.TLD)

Edit: changed DNS settings. hosts resolve properly now, but does not seem to fix the issue

D0peX · Jun 17, 2017

@fabian So, i fixed the issue after MANY hours (as you can tell)
Several re-installs and tackling issue left and right, i managed to pull it off.

First off, i'd like to suggest an update of the wiki in: Full Mesh Network For Ceph
Adding static routes in '/etc/network/interfaces' has changed a bit, since net-tools does not seem to be included in 5.0 (and it is OLD from what i've read). This meant that 'ifconfig' and 'route' are no longer valid commands and are replaced by 'ip a' and 'ip route' Thus the following is the correct syntax for static routes on ifup and ifdown:

Code:

# Connected to Node2 (.51)
auto ens3
iface ens3 inet static
        address  10.15.15.50
        netmask  255.255.255.0
        up ip route add 10.15.15.51 dev ens3
        down ip route del 10.15.15.51 dev ens3

Furthermore, losing cluster connections most likely had to do with me using the VMBR0 ip of the monitor nodes when adding RBD storage. This obviously had to be the IP of the monitors on the Ceph network (the mesh network). Using 10.15.15.50(...) instead of the LAN ip's resolved this issue.

DUH

Thank you for trying to help anyways!
D0peX

fabian · Jun 19, 2017

yes, that article needs to be converted to iproute! thanks for the pointer.

Search

Search

[SOLVED] 5.0 Beta2 - Cluster and storage issues

D0peX

Member

Attachments

fabian

Proxmox Staff Member

D0peX

Member

fabian

Proxmox Staff Member

D0peX

Member

D0peX

Member

D0peX

Member

fabian

Proxmox Staff Member