Hi all,
Hope everyone is keeping safe and sane
We run a fairly busy production Proxmox cluster environment, consisting of 12 nodes (each with between 40-70 VMs, all configured with SSD RAIDZ2 pools, and 72-80 cores - Scalable - with between 640-768GB RAM per node).
We are in the process of migrating another 12 nodes across to Proxmox from SmartOS.
Last week, we suffered the following failure on one of our nodes (see below) - hasn't happened before - which resulted in us restarting the cluster service on the affected node (which looks like it fixed the issue), but the initial dcdb failure also resulted in what appeared to be IO errors associated with a number of hung VMs (which had to be restarted):
May 07 07:44:11 pve-11 pmxcfs[38522]: [dcdb] notice: data verification successful
May 07 08:12:14 pve-11 pmxcfs[38522]: [dcdb] crit: serious internal error - stop cluster connection
May 07 08:12:16 pve-11 pmxcfs[38522]: [dcdb] crit: can't initialize service
May 07 08:28:38 pve-11 corosync[39311]: [CPG ] *** 0x562f56194180 can't mcast to group pve_dcdb_v1 state:1, error:12
May 07 08:35:23 pve-11 pmxcfs[68736]: [dcdb] crit: cpg_initialize failed: 2
May 07 08:35:23 pve-11 pmxcfs[68736]: [dcdb] crit: can't initialize service
May 07 08:35:29 pve-11 pmxcfs[68736]: [dcdb] crit: cpg_initialize failed: 2
May 07 08:35:30 pve-11 pmxcfs[69920]: [dcdb] crit: cpg_initialize failed: 2
May 07 08:35:30 pve-11 pmxcfs[69920]: [dcdb] crit: can't initialize service
May 07 08:35:36 pve-11 pmxcfs[69920]: [dcdb] crit: cpg_initialize failed: 2
May 07 08:35:37 pve-11 pmxcfs[71126]: [dcdb] crit: cpg_initialize failed: 2
May 07 08:35:37 pve-11 pmxcfs[71126]: [dcdb] crit: can't initialize service
May 07 08:35:43 pve-11 pmxcfs[71126]: [dcdb] notice: members: 11/71126
May 07 08:35:43 pve-11 pmxcfs[71126]: [dcdb] notice: all data is up to date
May 07 08:35:46 pve-11 pmxcfs[71126]: [dcdb] notice: members: 1/348122, 2/340314, 3/2738130, 4/2814404, 5/4166389, 6/1180260, 7/3551791, 8/1501512, 9/1288849, 10/7438, 11/71126, 12/39377
May 07 08:35:46 pve-11 pmxcfs[71126]: [dcdb] notice: starting data syncronisation
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: received sync request (epoch 1/348122/00000003)
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: received all states
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: leader is 1/348122
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: synced members: 1/348122, 2/340314, 3/2738130, 4/2814404, 5/4166389, 6/1180260, 7/3551791, 8/1501512, 9/1288849, 10/7438, 12/39377
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: waiting for updates from leader
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: dfsm_deliver_queue: queue length 12
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: update complete - trying to commit (got 19 inode updates)
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: all data is up to date
May 07 08:35:48 pve-11-dc5-r08-vox-teraco-jb1 pmxcfs[71126]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 15
May 07 08:35:48 pve-11-dc5-r08-vox-teraco-jb1 pmxcfs[71126]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Also, I have attached additional logs from when the issue started (08h12) to when the cluster was restarted (08h35).
As mentioned, we've not had this issue before, but it is concerning.
In addition:
# pveversion --verbose
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
Can this be entirely attributed to a cluster service failure on the affected node, or should we be looking for anything else?
Thanks and regards,
Angelo.
Hope everyone is keeping safe and sane
We run a fairly busy production Proxmox cluster environment, consisting of 12 nodes (each with between 40-70 VMs, all configured with SSD RAIDZ2 pools, and 72-80 cores - Scalable - with between 640-768GB RAM per node).
We are in the process of migrating another 12 nodes across to Proxmox from SmartOS.
Last week, we suffered the following failure on one of our nodes (see below) - hasn't happened before - which resulted in us restarting the cluster service on the affected node (which looks like it fixed the issue), but the initial dcdb failure also resulted in what appeared to be IO errors associated with a number of hung VMs (which had to be restarted):
May 07 07:44:11 pve-11 pmxcfs[38522]: [dcdb] notice: data verification successful
May 07 08:12:14 pve-11 pmxcfs[38522]: [dcdb] crit: serious internal error - stop cluster connection
May 07 08:12:16 pve-11 pmxcfs[38522]: [dcdb] crit: can't initialize service
May 07 08:28:38 pve-11 corosync[39311]: [CPG ] *** 0x562f56194180 can't mcast to group pve_dcdb_v1 state:1, error:12
May 07 08:35:23 pve-11 pmxcfs[68736]: [dcdb] crit: cpg_initialize failed: 2
May 07 08:35:23 pve-11 pmxcfs[68736]: [dcdb] crit: can't initialize service
May 07 08:35:29 pve-11 pmxcfs[68736]: [dcdb] crit: cpg_initialize failed: 2
May 07 08:35:30 pve-11 pmxcfs[69920]: [dcdb] crit: cpg_initialize failed: 2
May 07 08:35:30 pve-11 pmxcfs[69920]: [dcdb] crit: can't initialize service
May 07 08:35:36 pve-11 pmxcfs[69920]: [dcdb] crit: cpg_initialize failed: 2
May 07 08:35:37 pve-11 pmxcfs[71126]: [dcdb] crit: cpg_initialize failed: 2
May 07 08:35:37 pve-11 pmxcfs[71126]: [dcdb] crit: can't initialize service
May 07 08:35:43 pve-11 pmxcfs[71126]: [dcdb] notice: members: 11/71126
May 07 08:35:43 pve-11 pmxcfs[71126]: [dcdb] notice: all data is up to date
May 07 08:35:46 pve-11 pmxcfs[71126]: [dcdb] notice: members: 1/348122, 2/340314, 3/2738130, 4/2814404, 5/4166389, 6/1180260, 7/3551791, 8/1501512, 9/1288849, 10/7438, 11/71126, 12/39377
May 07 08:35:46 pve-11 pmxcfs[71126]: [dcdb] notice: starting data syncronisation
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: received sync request (epoch 1/348122/00000003)
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: received all states
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: leader is 1/348122
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: synced members: 1/348122, 2/340314, 3/2738130, 4/2814404, 5/4166389, 6/1180260, 7/3551791, 8/1501512, 9/1288849, 10/7438, 12/39377
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: waiting for updates from leader
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: dfsm_deliver_queue: queue length 12
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: update complete - trying to commit (got 19 inode updates)
May 07 08:35:48 pve-11 pmxcfs[71126]: [dcdb] notice: all data is up to date
May 07 08:35:48 pve-11-dc5-r08-vox-teraco-jb1 pmxcfs[71126]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 15
May 07 08:35:48 pve-11-dc5-r08-vox-teraco-jb1 pmxcfs[71126]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Also, I have attached additional logs from when the issue started (08h12) to when the cluster was restarted (08h35).
As mentioned, we've not had this issue before, but it is concerning.
In addition:
# pveversion --verbose
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
Can this be entirely attributed to a cluster service failure on the affected node, or should we be looking for anything else?
Thanks and regards,
Angelo.