Hi there,
I have a 3 node cluster, 2 of these nodes have 4 x SSDs and use ZFS with RAIDZ-1. These pools have 13% and 22% capacity used.
I've recently started converting some containers to QEMU machines to take advantage of the near-realtime migration available with ZFS. I have also enabled replication to help with this.
Here's a high level overview:
node1 has vm1
node2 has vm2
Replication is every 15 minutes and copies vm2 to node1, and vm1 to node2.
The vm2 replication was set up first and has never had an issue.
vm1 has an intermittent issue, email alert is basic:
The log from the web interface shows this:
I've read some other threads on this and they all hint at the storage being overloaded, however there is no indication of this on these nodes.
CPU is ~1.5%,
LOAD is ~0.4
RAM is ~18%
IO is esentially idle
I adjusted the schedule of the failing job (vm1) so that it wouldn't conflict with the incoming vm2 sync, buit this has not helped.
I tried to capture the output form iotop on node2 during a sync, it does not look overloaded at all.
The failure is random, I cannot infer any patterns from it.
Any ideas much appreciated!
Thanks!
Package versions:
I have a 3 node cluster, 2 of these nodes have 4 x SSDs and use ZFS with RAIDZ-1. These pools have 13% and 22% capacity used.
I've recently started converting some containers to QEMU machines to take advantage of the near-realtime migration available with ZFS. I have also enabled replication to help with this.
Here's a high level overview:
node1 has vm1
node2 has vm2
Replication is every 15 minutes and copies vm2 to node1, and vm1 to node2.
The vm2 replication was set up first and has never had an issue.
vm1 has an intermittent issue, email alert is basic:
Code:
Subject: Replication Job: 101-0 failed
import failed: exit code 29
The log from the web interface shows this:
Code:
2019-05-13 10:00:01 101-0: start replication job
2019-05-13 10:00:01 101-0: guest => VM 101, running => 35788
2019-05-13 10:00:01 101-0: volumes => local-zfs:vm-101-disk-0
2019-05-13 10:00:02 101-0: freeze guest filesystem
2019-05-13 10:00:02 101-0: create snapshot '__replicate_101-0_1557741601__' on local-zfs:vm-101-disk-0
2019-05-13 10:00:02 101-0: thaw guest filesystem
2019-05-13 10:00:02 101-0: incremental sync 'local-zfs:vm-101-disk-0' (__replicate_101-0_1557740701__ => __replicate_101-0_1557741601__)
2019-05-13 10:00:04 101-0: delete previous replication snapshot '__replicate_101-0_1557741601__' on local-zfs:vm-101-disk-0
2019-05-13 10:00:04 101-0: end replication job with error: import failed: exit code 29
I've read some other threads on this and they all hint at the storage being overloaded, however there is no indication of this on these nodes.
CPU is ~1.5%,
LOAD is ~0.4
RAM is ~18%
IO is esentially idle
I adjusted the schedule of the failing job (vm1) so that it wouldn't conflict with the incoming vm2 sync, buit this has not helped.
I tried to capture the output form iotop on node2 during a sync, it does not look overloaded at all.
Code:
Total DISK READ : 272.67 K/s | Total DISK WRITE : 386.59 K/s
Actual DISK READ: 422.08 K/s | Actual DISK WRITE: 24.41 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
929 be/4 root 0.00 B/s 0.00 B/s 0.00 % 8.47 % [txg_sync]
1514 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.44 % [kworker/u98:3]
43974 be/4 root 0.00 B/s 33.62 K/s 0.00 % 0.22 % kvm -id 100 -name db0~=300 -machine type=pc
42476 be/4 root 0.00 B/s 48.56 K/s 0.00 % 0.16 % kvm -id 100 -name db0~=300 -machine type=pc
42768 be/4 root 0.00 B/s 123.26 K/s 0.00 % 0.16 % kvm -id 100 -name db0~=300 -machine type=pc
43973 be/4 root 0.00 B/s 37.35 K/s 0.00 % 0.16 % kvm -id 100 -name db0~=300 -machine type=pc
42473 be/4 root 0.00 B/s 82.17 K/s 0.00 % 0.08 % kvm -id 100 -name db0~=300 -machine type=pc
2302 be/4 root 0.00 B/s 956.21 B/s 0.00 % 0.00 % rsyslogd -n [rs:main Q:Reg]
735 be/0 root 48.56 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_iss]
36951 be/0 root 115.79 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_iss]
36953 be/0 root 108.32 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_iss]
2849 be/4 root 0.00 B/s 60.70 K/s 0.00 % 0.00 % pmxcfs [cfs_loop]
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init
2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]
22531 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/17:1]
4 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/0:0H]
43938 be/4 postfix 0.00 B/s 0.00 B/s 0.00 % 0.00 % qmgr -l -t unix -u
7 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [mm_percpu_wq]
8 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]
9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_sched]
The failure is random, I cannot infer any patterns from it.
Any ideas much appreciated!
Thanks!
Package versions:
Code:
proxmox-ve: 5.4-1 (running kernel: 4.15.18-13-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-1
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-26
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-20
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2