Hello
I have two nodes in a cluster with the latest version of PVE (7.4-3) and after the last update the pvescheduler service has stopped working.
The service starts, and after running the first task to execute the disk replication to the second node, this process is dead.
It happens on both nodes. I have tried reinstalling the pve-manager package and restarting the nodes.
More info:
The replication tasks remain pending, but if I run them manually, they are fine.
Any ideas? Its a bug?
I have two nodes in a cluster with the latest version of PVE (7.4-3) and after the last update the pvescheduler service has stopped working.
The service starts, and after running the first task to execute the disk replication to the second node, this process is dead.
Code:
pve02 ~ # ps aux | grep pvescheduler
root 1309422 0.0 0.0 336696 109292 ? Ss 08:36 0:00 pvescheduler
root 1309423 0.0 0.0 0 0 ? Z 08:36 0:00 [pvescheduler] <defunct>
root 1313066 0.0 0.0 0 0 ? Z 08:37 0:00 [pvescheduler] <defunct>
It happens on both nodes. I have tried reinstalling the pve-manager package and restarting the nodes.
Code:
pve02 ~ # strace -yyttT -f -s 512 -p 1309422
strace: Process 1309422 attached
10:28:48.551399 restart_syscall(<... resuming interrupted read ...>) = 0 <11.599147>
10:29:00.150793 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0}, 0x7ffe7fcf8f90) = 0 <60.000169>
10:30:00.151229 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0}, 0x7ffe7fcf8f90) = 0 <60.000191>
10:31:00.151667 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0}, 0x7ffe7fcf8f90) = 0 <60.000171>
10:32:00.152079 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0}, 0x7ffe7fcf8f90) = 0 <60.000128>
10:33:00.152368 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0}, 0x7ffe7fcf8f90) = 0 <60.000178>
10:34:00.152781 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0},
More info:
Code:
pve02 ~ # pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-3
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
openvswitch-switch: 2.15.0+ds1-2+deb11u2.1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.6.3
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20221111-2
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.11-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1
Code:
pve01 ~ # pvecm status
Cluster information
-------------------
Name: PVECluster
Config Version: 2
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Mon Mar 27 10:29:43 2023
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.c7
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.101.111
0x00000002 1 192.168.101.112 (local)
The replication tasks remain pending, but if I run them manually, they are fine.
Code:
pve02 ~ # pvesr status
JobID Enabled Target LastSync NextSync Duration FailCount State
1002-0 Yes local/pve01 2023-03-26_16:08:29 pending 11.099929 0 OK
2002-0 Yes local/pve01 2023-03-27_08:36:53 pending 138.37906 0 OK
2102-0 Yes local/pve01 2023-03-26_16:33:45 pending 6.50503 0 OK
2104-0 Yes local/pve01 2023-03-26_16:10:57 2023-03-27_10:40:00 7.043028 0 OK
2106-0 Yes local/pve01 2023-03-26_16:12:34 2023-03-27_11:00:00 7.13516 0 OK
2108-0 Yes local/pve01 2023-03-26_16:15:32 2023-03-27_11:20:00 7.57505 0 OK
2112-0 Yes local/pve01 2023-03-26_16:20:49 2023-03-27_12:00:00 7.913801 0 OK
Code:
pve02 ~ # pvesr run --id 2102-0 --verbose
start replication job
guest => VM 2102, running => 180321
volumes => DiscoSSD:vm-2102-disk-0
create snapshot '__replicate_2102-0_1679905976__' on DiscoSSD:vm-2102-disk-0
using secure transmission, rate limit: none
incremental sync 'DiscoSSD:vm-2102-disk-0' (__replicate_2102-0_1679841225__ => __replicate_2102-0_1679905976__)
send from @__replicate_2102-0_1679841225__ to DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__ estimated size is 655M
total estimated size is 655M
TIME SENT SNAPSHOT DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:01 37.6M DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:02 136M DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:03 230M DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:04 318M DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:05 407M DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:06 499M DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:07 578M DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
successfully imported 'DiscoSSD:vm-2102-disk-0'
delete previous replication snapshot '__replicate_2102-0_1679841225__' on DiscoSSD:vm-2102-disk-0
(remote_finalize_local_job) delete stale replication snapshot '__replicate_2102-0_1679841225__' on DiscoSSD:vm-2102-disk-0
end replication job
Any ideas? Its a bug?
Last edited: