3 Node Cluster HA Fence während Backups

Frank bartels

Member
Mar 6, 2018
17
0
21
52
Hallo zusammen,

wir betreiben in unserem Unternehmen einen 3 Node HA Cluster mit Ceph. Wir haben das Phänomen das regelmäßig während
des VZdump einer der Nodes aus dem Cluster gefenct wird. Nicht jeden Tag aber ca. alle 1 oder 2 Wochen.
Wir sind überfragt:

Die Daten: Die Uhrzeiten sind auf allen Servern gleich, auch die Hardware Clock, falls da Fragen kommen sollten.

Code:
proxmox-ve: 5.4-2 (running kernel: 4.15.18-20-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-8
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-54
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-6
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-40
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-54
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

Fehlermeldung vor dem Fence von unserem Node TOM:

Code:
Sep 12 23:56:18 tom pvestatd[2728]: status update time (10.562 seconds)
Sep 12 23:56:18 tom pmxcfs[2340]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/tom/HAPVE01-Backup-122: -1
Sep 12 23:56:18 tom pmxcfs[2340]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/tom/local: -1
Sep 12 23:56:18 tom pmxcfs[2340]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/tom/HAPVE02: -1
Sep 12 23:56:36 tom pve-firewall[2714]: firewall update time (6.510 seconds)
Sep 12 23:56:36 tom pvestatd[2728]: status update time (7.377 seconds)
Sep 12 23:56:38 tom rrdcached[2261]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/tom/HAPVE01-Backup-122) failed with status -1. (/var/lib/rrdcached/db/pve2-storage/tom/HAPVE01-Backup-122: illegal attempt to update using time 1568325108 when last update time is 1568325378 (minimum one second step))
Sep 12 23:56:38 tom rrdcached[2261]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/tom/local) failed with status -1. (/var/lib/rrdcached/db/pve2-storage/tom/local: illegal attempt to update using time 1568325108 when last update time is 1568325378 (minimum one second step))
Sep 12 23:56:38 tom rrdcached[2261]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/tom/HAPVE02) failed with status -1. (/var/lib/rrdcached/db/pve2-storage/tom/HAPVE02: illegal attempt to update using time 1568325108 when last update time is 1568325378 (minimum one second step))
Sep 12 23:57:03 tom pvestatd[2728]: status update time (5.106 seconds)
Sep 12 23:57:03 tom systemd[1]: Started Proxmox VE replication runner.
Sep 12 23:57:03 tom systemd[1]: Starting Proxmox VE replication runner...
Sep 12 23:57:16 tom systemd[1]: Started Proxmox VE replication runner.
Sep 12 23:57:16 tom vzdump[715910]: INFO: Finished Backup of VM 106 (00:02:12)
Sep 12 23:57:16 tom vzdump[715910]: INFO: Starting Backup of VM 108 (qemu)
Sep 12 23:57:17 tom qm[716608]: <root@pam> update VM 108: -lock backup
Sep 12 23:58:00 tom systemd[1]: Starting Proxmox VE replication runner...
Sep 12 23:58:09 tom pve-firewall[2714]: firewall update time (9.550 seconds)
Sep 12 23:58:09 tom pvestatd[2728]: status update time (11.183 seconds)
Sep 12 23:58:09 tom pmxcfs[2340]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/tom/HAPVE01-Backup-122: -1
Sep 12 23:58:09 tom pmxcfs[2340]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/tom/HAPVE02: -1
Sep 12 23:58:09 tom pmxcfs[2340]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/tom/local: -1

Wie gesagt die Uhrzeiten sind auf allen Nodes gleich.

Vielen Dank für das Durchsehen und eventuelle Lösungsansätze.

MFG
Code:
ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable)
Code:
time pvesm status:

Name                      Type     Status           Total            Used       Available        %
HAPVE01-Backup-122         nfs     active     11402145792      6987758080      4414387712   61.28%
HAPVE02                    rbd     active      3328851955      1935063667      1393788288   58.13%
local                      dir     active        57411424        10213428        44251948   17.79%

real    0m0,974s
user    0m0,712s
sys    0m0,178s
 
Last edited:
teilt sich der cluster das netzwerk mit einem anderen load (backup/vms/ceph)? falls ja kann es sein dass das backup einfach das netzwerk so stark auslastet dass der cluster nicht mehr kommunizieren kann
am besten für das cluster netzwerk ein eigenes physisches netzwerk einrichten (wenn möglich redundant)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!