[SOLVED] Clusterbildung Zertifikate fehlen

aChris

New Member
Jun 1, 2023
6
2
3
Hallo,

ich bin gerade daran einen 3-Node-Cluster aufzusetzen.
Node 1 hat erfolgreich einen Cluster gebildet.
Node 2 konnte zwar dem Cluster beitreten, der Zustand des Clusters ist allerdings nicht OK.

Oberflächlich ist im GUI der Node der Status der jeweils anderen Node mit grauem Fragezeichen angegeben.

Unter der Haube ist in /etc/pve/.members für die lokale Node die korrekte IP eingetragen, für die jeweils andere aber nicht obwohl in /etc/hosts die richtigen Einträge vorhanden sind. Auch können sich die Nodes untereinander erfolgreich per SSH verbinden.

/etc/pve/.members
{
"nodename": "pve2",
"version": 7,
"cluster": { "name": "<correct Clustername>", "version": 2, "nodes": 2, "quorate": 1 },
"nodelist": {
"pve1": { "id": 1, "online": 1},
"pve2": { "id": 2, "online": 1, "ip": "192.168.111.220"}
}
127.0.0.1 localhost.localdomain localhost
192.168.111.210 pve1.prox.mox pve1
192.168.111.220 pve2.prox.mox pve2
192.168.111.230 pve3.prox.mox pve3

# The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Beim versuch die Zertifikate neu zu generieren wurde folgende Fehlermeldung aufgeworfen:

root@pve2:/etc/ssh# pvecm updatecerts -f
(re)generate node files
generate new node certificate
Could not read private key from /etc/pve/nodes/pve2/pve-ssl.key
unable to generate pve certificate request:
command 'openssl req -batch -new -config /tmp/pvesslconf-10140.tmp -key /etc/pve/nodes/pve2/pve-ssl.key -out /tmp/pvecertreq-10140.tmp' failed: exit code 1

Was daran liegt, dass /etc/pve/nodes/pve2/pve-ssl.key komplett leer ist.
Auch ist, als Node2 dem Cluster beigetreten ist, /etc/pve/priv/known_hosts nicht mit den SSH-Keys von Node 2 befüllt worden.
Als Lösungsversuch habe ich den SSH-Key von Node2 händisch von /etc/pve/nodes/pve2/ssh_known_hosts eingetragen. Allerdings ohne Erfolg.

systemctl stop corosync
systemctl stop pve-cluster
pmxcfs -l
nano /etc/pve/priv/known_hosts
killall pmxcfs
systemctrl start corosync
systemctl start pve-cluster

/etc/pve/nodes/pve2/pve-ssl.pem' does not exist! (500)

communication failure (0)

Beim Versuch etwas aus /etc/pve auszulesen oder "pvecm updatecerts -f" auszuführen, friert regelmäßig das CLI ein bis corosync.service auf einer der beiden Nodes neugestartet wird.


pve-manager/8.2.7/3e0176e6bb2ade3b (running kernel: 6.8.12-2-pve)

Hätte jemand weitere Vorschläge?
 
Auszug dmesg:
[ 3196.135689] task:pvescheduler state:D stack:0 pid:15736 tgid:15736 ppid:3625 flags:0x00000006
[ 3196.135693] Call Trace:
[ 3196.135696] <TASK>
[ 3196.135700] __schedule+0x401/0x15e0
[ 3196.135710] schedule+0x33/0x110
[ 3196.135713] schedule_preempt_disabled+0x15/0x30
[ 3196.135716] rwsem_down_write_slowpath+0x392/0x6a0
[ 3196.135721] down_write+0x5c/0x80
[ 3196.135723] filename_create+0xaf/0x1b0
[ 3196.135728] do_mkdirat+0x59/0x180
[ 3196.135732] __x64_sys_mkdir+0x4a/0x70
[ 3196.135734] x64_sys_call+0x2e3/0x24b0
[ 3196.135738] do_syscall_64+0x81/0x170
[ 3196.135742] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3196.135745] ? do_syscall_64+0x8d/0x170
[ 3196.135748] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3196.135750] ? syscall_exit_to_user_mode_prepare+0x17b/0x1a0
[ 3196.135754] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3196.135756] ? syscall_exit_to_user_mode+0x89/0x260
[ 3196.135759] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3196.135761] ? do_syscall_64+0x8d/0x170
[ 3196.135763] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3196.135765] ? irqentry_exit+0x43/0x50
[ 3196.135768] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3196.135769] ? exc_page_fault+0x94/0x1b0
[ 3196.135773] entry_SYSCALL_64_after_hwframe+0x78/0x80
[ 3196.135776] RIP: 0033:0x747c05d04e27
[ 3196.135798] RSP: 002b:00007ffc80301f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[ 3196.135801] RAX: ffffffffffffffda RBX: 000059dc3c6e32a0 RCX: 0000747c05d04e27
[ 3196.135802] RDX: 000000000000001f RSI: 00000000000001ff RDI: 000059dc430deaa0
[ 3196.135803] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[ 3196.135805] R10: 0000000000000000 R11: 0000000000000246 R12: 000059dc3c6e8c88
[ 3196.135806] R13: 000059dc430deaa0 R14: 000059dc3e1487c0 R15: 00000000000001ff
[ 3196.135811] </TASK>
[ 3196.135813] INFO: task pveproxy:16023 blocked for more than 245 seconds.
[ 3196.135814] Tainted: P O 6.8.12-2-pve #1
[ 3196.135816] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3196.135816] task:pveproxy state:D stack:0 pid:16023 tgid:16023 ppid:1 flags:0x00000006
[ 3196.135820] Call Trace:
[ 3196.135821] <TASK>
[ 3196.135823] __schedule+0x401/0x15e0
[ 3196.135828] schedule+0x33/0x110
[ 3196.135831] schedule_preempt_disabled+0x15/0x30
[ 3196.135833] rwsem_down_write_slowpath+0x392/0x6a0
[ 3196.135838] down_write+0x5c/0x80
[ 3196.135840] filename_create+0xaf/0x1b0
[ 3196.135843] do_mkdirat+0x59/0x180
[ 3196.135847] __x64_sys_mkdir+0x4a/0x70
[ 3196.135849] x64_sys_call+0x2e3/0x24b0
[ 3196.135851] do_syscall_64+0x81/0x170
[ 3196.135854] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3196.135856] ? do_syscall_64+0x8d/0x170
[ 3196.135858] ? irqentry_exit+0x43/0x50
[ 3196.135860] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3196.135862] ? exc_page_fault+0x94/0x1b0
[ 3196.135865] entry_SYSCALL_64_after_hwframe+0x78/0x80
[ 3196.135868] RIP: 0033:0x75dd653c9e27
[ 3196.135870] RSP: 002b:00007ffc37fff768 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[ 3196.135872] RAX: ffffffffffffffda RBX: 00006166c7ea02a0 RCX: 000075dd653c9e27
[ 3196.135874] RDX: 00006166c659870f RSI: 00000000000001ff RDI: 00006166ccf01000
[ 3196.135875] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 3196.135876] R10: 00006166ccb10f68 R11: 0000000000000246 R12: 00006166c95e4d08
[ 3196.135877] R13: 00006166ccf01000 R14: 00006166ccd72b80 R15: 00000000000001ff
[ 3196.135881] </TASK>
[ 3196.135882] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings

Auszug Journal
Apr 04 14:49:57 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:57 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:58 pve1 audit[19882]: NETFILTER_CFG table=filter family=7 entries=0 op=xt_replace pid=19882 subj=unconfined comm="ebtables-restor"
Apr 04 14:49:58 pve1 audit[19882]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=6031eebbce60 items=0 ppid=3566 pid=19882 auid=4294967295 uid=0 g>
Apr 04 14:49:58 pve1 audit: PROCTITLE proctitle="ebtables-restore"
Apr 04 14:49:58 pve1 pmxcfs[15488]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/pve1: -1
Apr 04 14:49:58 pve1 pmxcfs[15488]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-node/pve1: /var/lib/rrdcached/db/pve2-node/pve1: illegal attempt to updat>
Apr 04 14:49:58 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:58 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:58 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:58 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:58 pve1 pmxcfs[15488]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve1/local-zfs: -1
Apr 04 14:49:58 pve1 pmxcfs[15488]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/pve1/local-zfs: /var/lib/rrdcached/db/pve2-storage/pve1/local-zfs>
Apr 04 14:49:58 pve1 pmxcfs[15488]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve1/local: -1
Apr 04 14:49:58 pve1 pmxcfs[15488]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/pve1/local: /var/lib/rrdcached/db/pve2-storage/pve1/local: illega>
Apr 04 14:49:58 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:58 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:58 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:58 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:59 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:49:59 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:00 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:01 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:01 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:01 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:02 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:02 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:02 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:03 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:03 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:03 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:03 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:03 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:03 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:04 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:04 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:04 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:04 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:05 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:05 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:06 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:07 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:07 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:08 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:08 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:08 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:08 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:08 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:08 pve1 pmxcfs[15488]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/pve1: -1
Apr 04 14:50:08 pve1 pmxcfs[15488]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-node/pve1: /var/lib/rrdcached/db/pve2-node/pve1: illegal attempt to updat>
Apr 04 14:50:08 pve1 audit[19922]: NETFILTER_CFG table=filter family=7 entries=0 op=xt_replace pid=19922 subj=unconfined comm="ebtables-restor"
Apr 04 14:50:08 pve1 audit[19922]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=5925fad9fe60 items=0 ppid=3566 pid=19922 auid=4294967295 uid=0 g>
Apr 04 14:50:08 pve1 audit: PROCTITLE proctitle="ebtables-restore"
Apr 04 14:50:08 pve1 pmxcfs[15488]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve1/local: -1
Apr 04 14:50:08 pve1 pmxcfs[15488]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/pve1/local: /var/lib/rrdcached/db/pve2-storage/pve1/local: illega>
Apr 04 14:50:08 pve1 pmxcfs[15488]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve1/local-zfs: -1
Apr 04 14:50:08 pve1 pmxcfs[15488]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/pve1/local-zfs: /var/lib/rrdcached/db/pve2-storage/pve1/local-zfs>
Apr 04 14:50:08 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:08 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:08 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:08 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:09 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:09 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:10 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:11 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:11 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:11 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:11 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:11 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:12 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:12 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:13 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:13 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:13 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:13 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:13 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:13 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
Apr 04 14:50:13 pve1 corosync[15441]: [TOTEM ] Retransmit List: 16 17
 
Hallo,
corosync[15441]: [TOTEM ] Retransmit List: 16 17 bedeutet, dass die Kommunikation zwischen den Nodes nicht funktioniert. Der eine Node würde gerne Änderungen an den anderen melden, erhält aber keine Bestätigung.
Wie sieht denn die Netzwerkkonfiguration der Nodes aus? (cat /etc/network/interfaces)
Was ist denn die Ausgabe von pvecm status und corosync-cfgtool -n?
 
Hallo,

Sorry für die späte Antwort. Mir kam ein Krankenstand dazwischen.
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto enp65s0f0np0
iface enp65s0f0np0 inet manual

auto enp65s0f1np1
iface enp65s0f1np1 inet manual

auto enp129s0f0np0
iface enp129s0f0np0 inet manual

auto ens3f1np1
iface ens3f1np1 inet manual
ovs_type OVSPort
ovs_bridge vmbr2
ovs_mtu 9000
ovs_options vlan_mode=native-untagged other_config:rstp-port-admin-edge=false other_config:rstp-port-mcheck=true other_config:rstp-port-auto-edge=false other_config:rstp-enable=true other_config:rstp-path-cost=150

iface usb0 inet manual

auto enp129s0f1np1
iface enp129s0f1np1 inet manual

iface enxbe3af2b6059f inet manual

auto ens3f0np0
iface ens3f0np0 inet manual
ovs_type OVSPort
ovs_bridge vmbr2
ovs_mtu 9000
ovs_options vlan_mode=native-untagged other_config:rstp-port-admin-edge=false other_config:rstp-port-mcheck=true other_config:rstp-enable=true other_config:rstp-port-auto-edge=false other_config:rstp-path-cost=150

auto bond0
iface bond0 inet manual
bond-slaves enp129s0f0np0 enp129s0f1np1
bond-miimon 100
bond-mode active-backup
bond-primary enp129s0f0np0
#VM Network

auto vmbr0
iface vmbr0 inet static
address 192.168.111.220/24
gateway 192.168.111.254
bridge-ports enp65s0f0np0
bridge-stp off
bridge-fd 0
#GUI

auto vmbr1
iface vmbr1 inet manual
bridge-ports bond0
bridge-stp off
bridge-fd 0
#VM Network

auto vmbr2
iface vmbr2 inet static
address 192.168.120.220/24
ovs_type OVSBridge
ovs_ports ens3f0np0 ens3f1np1
ovs_mtu 9000
up ovs-vsctl set Bridge ${IFACE} rstp_enable=true other_config:rstp-priority=32768 other_config:rstp-forward-delay=4 other_config:rstp-max-age=6
post-up sleep 10
#CEPH

auto vmbr3
iface vmbr3 inet static
address 192.168.130.220/24
bridge-ports enp65s0f1np1
bridge-stp off
bridge-fd 0
#Corosync

Das CEPH-Netzwerk ist als Full Mesh ausgeführt.
Die restlichen Interfaces sind über 2 Switches angeschlossen.
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#/etc/network/interface_4

corosync-cfgtool -n
Local node ID 2, transport knet
nodeid: 1 reachable
LINK: 0 udp (192.168.130.220->192.168.130.210) enabled connected mtu: 1397
LINK: 1 udp (192.168.120.220->192.168.120.210) enabled connected mtu: 8885

Cluster information
-------------------
Name: Avasys
Config Version: 2
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Apr 9 09:33:54 2025
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 1.32c
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.130.210
0x00000002 1 192.168.130.220 (local)

Innerhalb der letzten vier Tage haben sich die beiden Knoten irgendwann anscheinend doch irgendwie austauschen können, weil sie heute ihre gegenseitigen IP-Adressen zum ersten Mal richtig in /etc/pve/.members eingetragen hatten und das GUI alles OK zeigte.
Das grundlegende Problem besteht jedoch nach wie vor, da nach einem Neustart von Node1 dieselben Kommunikationsprobleme bestehen.