[SOLVED] Adding New node Cluster - Ceph got timeout(500)

CyberGuy · Dec 20, 2021

Hello guys,

After adding new node to cluster - all succesfull, I run on some problem when installing and managing Ceph on new node from cluster. This is production Cluster.
The new node is not, monitor, manager and none of the drives was used as OSD. Beacuse when I try to reach interface of Datacenter -> Nodes -> Node[Number] -> Ceph
I got got Timeout, but when I try to reach any other nodes already existing in Cluster all seems fine.
New node has been only add to cluster it has not be use.

When I run ceph -s on the new node it showing the all good but I noticed some strange issues like:
- this got Timeout 500 on any sub pages of Ceph
- Proper Config display in Configuration
- Outdated OSD
- Config file from /etc/ceph/ceph.conf has not link to /etc/pve/ceph.conf

Any other things works, I am not sure what going on, bellow I post packages for new node and already existed node in Cluster:

CLusterNodeAlreadyExist:~# apt list --installed | grep -i ceph

ceph-base/stable,now 15.2.14-pve1~bpo10 amd64 [installed,upgradable to: 15.2.15-pve1~bpo10]
ceph-common/stable,now 15.2.14-pve1~bpo10 amd64 [installed,upgradable to: 15.2.15-pve1~bpo10]
ceph-fuse/stable,now 15.2.14-pve1~bpo10 amd64 [installed,upgradable to: 15.2.15-pve1~bpo10]
ceph-mds/stable,now 15.2.14-pve1~bpo10 amd64 [installed,upgradable to: 15.2.15-pve1~bpo10]
ceph-mgr-modules-core/stable,now 15.2.14-pve1~bpo10 all [installed,upgradable to: 15.2.15-pve1~bpo10]
ceph-mgr/stable,now 15.2.14-pve1~bpo10 amd64 [installed,upgradable to: 15.2.15-pve1~bpo10]
ceph-mon/stable,now 15.2.14-pve1~bpo10 amd64 [installed,upgradable to: 15.2.15-pve1~bpo10]
ceph-osd/stable,now 15.2.14-pve1~bpo10 amd64 [installed,upgradable to: 15.2.15-pve1~bpo10]
ceph/stable,now 15.2.14-pve1~bpo10 amd64 [installed,upgradable to: 15.2.15-pve1~bpo10]
libcephfs2/stable,now 15.2.14-pve1~bpo10 amd64 [installed,upgradable to: 15.2.15-pve1~bpo10]
python-cephfs/oldstable,now 12.2.11+dfsg1-2.1+b1 amd64 [installed]
python3-ceph-argparse/stable,now 15.2.14-pve1~bpo10 all [installed,upgradable to: 15.2.15-pve1~bpo10]
python3-ceph-common/stable,now 15.2.14-pve1~bpo10 all [installed,upgradable to: 15.2.15-pve1~bpo10]
python3-cephfs/stable,now 15.2.14-pve1~bpo10 amd64 [installed,upgradable to: 15.2.15-pve1~bpo10]
CLusterNodeAlreadyExist:

NEWNODE:~# apt list --installed | grep -i ceph

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

ceph-base/stable,now 15.2.15-pve1~bpo10 amd64 [installed,automatic]
ceph-common/stable,now 15.2.15-pve1~bpo10 amd64 [installed]
ceph-fuse/stable,now 15.2.15-pve1~bpo10 amd64 [installed]
ceph-mds/stable,now 15.2.15-pve1~bpo10 amd64 [installed]
ceph-mgr-modules-core/stable,now 15.2.15-pve1~bpo10 all [installed,automatic]
ceph-mgr/stable,now 15.2.15-pve1~bpo10 amd64 [installed,automatic]
ceph-mon/stable,now 15.2.15-pve1~bpo10 amd64 [installed,automatic]
ceph-osd/stable,now 15.2.15-pve1~bpo10 amd64 [installed,automatic]
ceph/stable,now 15.2.15-pve1~bpo10 amd64 [installed]
libcephfs2/stable,now 15.2.15-pve1~bpo10 amd64 [installed]
python-cephfs/oldstable,now 12.2.11+dfsg1-2.1+b1 amd64 [installed]
python3-ceph-argparse/stable,now 15.2.15-pve1~bpo10 all [installed,automatic]
python3-ceph-common/stable,now 15.2.15-pve1~bpo10 all [installed,automatic]
python3-cephfs/stable,now 15.2.15-pve1~bpo10 amd64 [installed,automatic]
NEWNODE:

* rename the nodes.

Maybe any of you got similar thing?

CyberGuy · Dec 20, 2021

Additional details from /var/log/syslog:

Dec 20 09:58:58 pve8 pvedaemon[3613510]: got timeout
Dec 20 09:59:00 pve8 pvestatd[3255]: got timeout
Dec 20 09:59:00 pve8 pvedaemon[3613510]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:00 pve8 pvedaemon[3613510]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:00 pve8 pvestatd[3255]: status update time (5.469 seconds)
Dec 20 09:59:00 pve8 systemd[1]: Starting Proxmox VE replication runner...
Dec 20 09:59:01 pve8 systemd[1]: pvesr.service: Succeeded.
Dec 20 09:59:01 pve8 systemd[1]: Started Proxmox VE replication runner.
Dec 20 09:59:05 pve8 pvedaemon[3613510]: got timeout
Dec 20 09:59:05 pve8 pvestatd[3255]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:05 pve8 pvestatd[3255]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:05 pve8 pvedaemon[3763961]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:05 pve8 pvedaemon[3763961]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:10 pve8 pvestatd[3255]: got timeout
Dec 20 09:59:10 pve8 pvedaemon[3763961]: got timeout
Dec 20 09:59:11 pve8 pvestatd[3255]: status update time (5.472 seconds)
Dec 20 09:59:15 pve8 pvestatd[3255]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:18 pve8 pvedaemon[3763961]: got timeout
Dec 20 09:59:19 pve8 pvedaemon[3763961]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:19 pve8 pvedaemon[3763961]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:20 pve8 pvestatd[3255]: got timeout
Dec 20 09:59:20 pve8 pvestatd[3255]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:20 pve8 pvestatd[3255]: status update time (5.473 seconds)
Dec 20 09:59:25 pve8 pvestatd[3255]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:26 pve8 pvedaemon[3763961]: got timeout
Dec 20 09:59:26 pve8 pvedaemon[3763961]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:26 pve8 pvedaemon[3763961]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:30 pve8 pvestatd[3255]: got timeout
Dec 20 09:59:31 pve8 pvestatd[3255]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:31 pve8 pvestatd[3255]: status update time (5.455 seconds)
Dec 20 09:59:31 pve8 pvedaemon[3763961]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 20 09:59:31 pve8 pvedaemon[3763961]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory

aaron · Dec 20, 2021

Check if the network works as expected.
Ping the previous nodes from the new node and vice versa on the network used for Corosync (Proxmox VE Cluster network) and the network(s) used for Ceph as well.
Do you have the firewall enabled?

Edit: And of course all the other possible networks you have set up between the cluster nodes

CyberGuy · Dec 20, 2021

Hi Aaron,

I got something like that from dmesg:

[91271.540614] perf: interrupt took too long (2512 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[104089.165187] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
[104089.165226] MII link monitoring set to 100 ms
[104089.176168] bond0: (slave ibp130s0): enslaved VLAN challenged slave. Adding VLANs will be blocked as long as it is part of bond.
[104089.177374] bond0: (slave ibp130s0): The slave device specified does not support setting the MAC address
[104089.177420] bond0: (slave ibp130s0): Setting fail_over_mac to active for active-backup mode
[104089.180142] bond0: (slave ibp130s0): Enslaving as a backup interface with a down link
[104089.185719] bond0: (slave ibp130s0d1): enslaved VLAN challenged slave. Adding VLANs will be blocked as long as it is part of bond.
[104089.185765] bond0: (slave ibp130s0d1): The slave device specified does not support setting the MAC address
[104089.188510] bond0: (slave ibp130s0d1): Enslaving as a backup interface with a down link
[104089.194966] ibp130s0: mtu > 2044 will cause multicast packet drops.
[104089.195477] ibp130s0d1: mtu > 2044 will cause multicast packet drops.
[104089.196027] bond0: (slave ibp130s0): link status up, enabling it in 0 ms
[104089.196055] bond0: (slave ibp130s0d1): link status up, enabling it in 200 ms
[104089.196546] bond0: (slave ibp130s0): link status definitely up, 40000 Mbps full duplex
[104089.196631] bond0: (slave ibp130s0): making interface the new active one
[104089.196812] bond0: active interface up!
[104089.196883] bond0: (slave ibp130s0d1): invalid new link 3 on slave
[104089.196980] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[104089.403546] bond0: (slave ibp130s0d1): link status definitely up, 40000 Mbps full duplex
[112509.889794] perf: interrupt took too long (3162 > 3140), lowering kernel.perf_event_max_sample_rate to 63250
[114348.838099] sctp: Hash tables configured (bind 2048/2048)
[114360.921141] FS-Cache: Loaded
[114360.927377] Key type ceph registered
[114360.927777] libceph: loaded (mon/osd proto 15/24)
[114360.932953] FS-Cache: Netfs 'ceph' registered for caching
[114360.932981] ceph: loaded (mds proto 32)
[114360.938916] libceph: mon3 (1)192.168.200.244:6789 session established
[114360.939362] libceph: mon3 (1)192.168.200.244:6789 socket closed (con state OPEN)
[114360.939390] libceph: mon3 (1)192.168.200.244:6789 session lost, hunting for new mon
[114360.941665] libceph: mon4 (1)192.168.200.245:6789 session established
[114360.958763] libceph: client110618835 fsid ad057332-18a6-49e9-a94e-1ae2a20bdc9f
[132998.404013] perf: interrupt took too long (3962 > 3952), lowering kernel.perf_event_max_sample_rate to 50250
[179452.288400] perf: interrupt took too long (4957 > 4952), lowering kernel.perf_event_max_sample_rate to 40250
[237410.176222] perf: interrupt took too long (6198 > 6196), lowering kernel.perf_event_max_sample_rate to 32250

As Well Corosync shows the exact same ip, we use 192.168.200.0/24 for corosync and ceph storage.

I am able to ping and vice versa.

aaron · Dec 20, 2021

CyberGuy said:
As Well Corosync shows the exact same ip, we use 192.168.200.0/24 for corosync and ceph storage.

How fast is that network?
Usually it is not a good idea. Ideally, you give Corosync one physical network for itself. Doesn't need to be fast, but latency needs to be low. 1Gbit is usually fine. Then you can configure more links for corosync to switch to if the used network link becomes unusable.
If you cannot give Corosync its own dedicated network, make sure to configure more than one link for it so it can fall back to that one which will hopefully enable Corosync to keep up a stable connection. This is important if you do use the PVE HA stack.

Now back to the network problems.
Would you mind sharing your /etc/network/interfaces? If you have some public IPs in there, masked them in a way that different ones can still be distinguished.

Do you use a larger MTU? If so, then there could be something up with that. If you use an MTU of 9000, you can test it with:

Code:

ping -M do -s 8000 <ip>

The -M do will make sure the packets will not be fragmented into smaller ones and the size of 8000 bytes is usually large enough to test any issues and leaving enough overhead for the ICMP headers and any potential VLAN tags and whatnot.

CyberGuy · Dec 20, 2021

Hi Aaron,
Thank you for looking in to my problem.

The Network we use is 40 Gbit/s (mellanox). All previous nodes has been added on the same link without issue.

root@pve8:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface enp7s0f0 inet manual

auto vmbr0
iface vmbr0 inet static
address XXX.XXX.XXX.XXX
netmask 255.255.255.0
gateway XXX.XXX.XXX.XXX
bridge_ports enp7s0f0
bridge_stp off
bridge_fd 0

auto enp7s0f1
iface enp7s0f1 inet static
address 12.12.12.248
netmask 255.255.255.0
mtu 1500

iface eno1 inet manual

iface eno2 inet manual

iface ibp130s0 inet manual

iface ibp130s0d1 inet manual

#used for ceph, crosssync
auto bond0
iface bond0 inet static
address 192.168.200.248
netmask 255.255.255.0
slaves ibp130s0 ibp130s0d1
bond_miimon 100
bond_mode active-backup
pre-up modprobe ib_ipoib
pre-up echo connected > /sys/class/net/ibp130s0/mode
pre-up echo connected > /sys/class/net/ibp130s0d1/mode
pre-up modprobe bond0
mtu 65520

Dec 19 23:59:39 pve8 pvestatd[3255]: status update time (5.461 seconds)
Dec 19 23:59:40 pve8 pvedaemon[2849395]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 19 23:59:40 pve8 pvedaemon[2849395]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 19 23:59:43 pve8 pvestatd[3255]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 19 23:59:46 pve8 pvedaemon[2849395]: got timeout
Dec 19 23:59:48 pve8 pvestatd[3255]: got timeout
Dec 19 23:59:48 pve8 pvestatd[3255]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 19 23:59:48 pve8 pvestatd[3255]: status update time (5.473 seconds)
Dec 19 23:59:54 pve8 pvestatd[3255]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 19 23:59:55 pve8 pvedaemon[2849395]: got timeout
Dec 19 23:59:55 pve8 pvedaemon[2849395]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 19 23:59:55 pve8 pvedaemon[2849395]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 19 23:59:56 pve8 pvedaemon[2849395]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 19 23:59:56 pve8 pvedaemon[2849395]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 19 23:59:59 pve8 pvestatd[3255]: got timeout
Dec 19 23:59:59 pve8 pvestatd[3255]: iscsi session scan failed: /usr/bin/iscsiadm: error while loading shared libraries: libisns-nocrypto.so.0: cannot open shared object file: No such file or directory
Dec 19 23:59:59 pve8 pvestatd[3255]: status update time (5.461 seconds)
Dec 20 00:00:00 pve8 systemd[1]: Starting Proxmox VE replication runner...
Dec 20 00:00:00 pve8 systemd[1]: Starting Daily man-db regeneration...
Dec 20 00:00:00 pve8 systemd[1]: Starting Rotate log files...
Dec 20 00:00:01 pve8 systemd[1]: Reloading PVE API Proxy Server.
Dec 20 00:00:01 pve8 systemd[1]: man-db.service: Succeeded.
Dec 20 00:00:01 pve8 systemd[1]: Started Daily man-db regeneration.
Dec 20 00:00:01 pve8 systemd[1]: pvesr.service: Succeeded.
Dec 20 00:00:01 pve8 systemd[1]: Started Proxmox VE replication runner.
Dec 20 00:00:01 pve8 pveproxy[3146062]: send HUP to 3293
Dec 20 00:00:01 pve8 pveproxy[3293]: received signal HUP
Dec 20 00:00:01 pve8 pveproxy[3293]: server closing
Dec 20 00:00:01 pve8 pveproxy[3293]: server shutdown (restart)
Dec 20 00:00:01 pve8 systemd[1]: Reloaded PVE API Proxy Server.
Dec 20 00:00:01 pve8 systemd[1]: Reloading PVE SPICE Proxy Server.
Dec 20 00:00:01 pve8 pvedaemon[2849395]: got timeout
Dec 20 00:00:02 pve8 spiceproxy[3146090]: send HUP to 3300
Dec 20 00:00:02 pve8 spiceproxy[3300]: received signal HUP
Dec 20 00:00:02 pve8 spiceproxy[3300]: server closing
Dec 20 00:00:02 pve8 spiceproxy[3300]: server shutdown (restart)
Dec 20 00:00:02 pve8 systemd[1]: Reloaded PVE SPICE Proxy Server.
Dec 20 00:00:02 pve8 pvefw-logger[40348]: received terminate request (signal)
Dec 20 00:00:02 pve8 pvefw-logger[40348]: stopping pvefw logger
Dec 20 00:00:02 pve8 systemd[1]: Stopping Proxmox VE firewall logger...
Dec 20 00:00:02 pve8 systemd[1]: pvefw-logger.service: Succeeded.
Dec 20 00:00:02 pve8 systemd[1]: Stopped Proxmox VE firewall logger.
Dec 20 00:00:02 pve8 systemd[1]: Starting Proxmox VE firewall logger...
Dec 20 00:00:02 pve8 pvefw-logger[3146099]: starting pvefw logger
Dec 20 00:00:02 pve8 systemd[1]: Started Proxmox VE firewall logger.
Dec 20 00:00:02 pve8 spiceproxy[3300]: restarting server
Dec 20 00:00:02 pve8 spiceproxy[3300]: starting 1 worker(s)
Dec 20 00:00:02 pve8 spiceproxy[3300]: worker 3146102 started

aaron · Dec 20, 2021

CyberGuy said:
pre-up modprobe bond0

This line looks like it should not be there.

The syslogs you attached don't show anything interesting, except that iscsiadm seems to have some issues.

Is your system up to date? It oculd be the cause for the pvestatd timeouts, but should not have an effect on the Ceph issues (timeout when you click on the ceph panel)

CyberGuy said:
- Config file from /etc/ceph/ceph.conf has not link to /etc/pve/ceph.conf

Sorry that I notice that just now... Are the contents in the separate /etc/ceph/ceph.conf file the same as in /etc/pve/ceph.conf?

Was there anything during the installation that did not go exactly as expected? As I don't understand how that file ended up not being a symlink to the ceph config in the /etc/pve directory.

CyberGuy · Dec 21, 2021

Hi Aaron,

Seems like all config files are not connected, and entire machine should be reinstalled. would be better that way.

Thank you for help, solution is found.

Search

Search

[SOLVED] Adding New node Cluster - Ceph got timeout(500)

CyberGuy

Member

CyberGuy

Member

aaron

Proxmox Staff Member

CyberGuy

Member

aaron

Proxmox Staff Member

CyberGuy

Member

aaron

Proxmox Staff Member

CyberGuy

Member