[SOLVED] pvescheduler defunct after HA replication

j5boot · Mar 27, 2023

Hello
I have two nodes in a cluster with the latest version of PVE (7.4-3) and after the last update the pvescheduler service has stopped working.
The service starts, and after running the first task to execute the disk replication to the second node, this process is dead.

Code:

pve02 ~ # ps aux | grep pvescheduler
root     1309422  0.0  0.0 336696 109292 ?       Ss   08:36   0:00 pvescheduler
root     1309423  0.0  0.0      0     0 ?        Z    08:36   0:00 [pvescheduler] <defunct>
root     1313066  0.0  0.0      0     0 ?        Z    08:37   0:00 [pvescheduler] <defunct>

It happens on both nodes. I have tried reinstalling the pve-manager package and restarting the nodes.

Code:

pve02 ~ # strace -yyttT -f -s 512 -p 1309422
strace: Process 1309422 attached
10:28:48.551399 restart_syscall(<... resuming interrupted read ...>) = 0 <11.599147>
10:29:00.150793 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0}, 0x7ffe7fcf8f90) = 0 <60.000169>
10:30:00.151229 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0}, 0x7ffe7fcf8f90) = 0 <60.000191>
10:31:00.151667 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0}, 0x7ffe7fcf8f90) = 0 <60.000171>
10:32:00.152079 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0}, 0x7ffe7fcf8f90) = 0 <60.000128>
10:33:00.152368 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0}, 0x7ffe7fcf8f90) = 0 <60.000178>
10:34:00.152781 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=60, tv_nsec=0},

More info:

Code:

pve02 ~ # pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-3
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
openvswitch-switch: 2.15.0+ds1-2+deb11u2.1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.6.3
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20221111-2
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.11-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

Code:

pve01 ~ # pvecm status
Cluster information
-------------------
Name:             PVECluster
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Mar 27 10:29:43 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.c7
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.101.111
0x00000002          1 192.168.101.112 (local)

The replication tasks remain pending, but if I run them manually, they are fine.

Code:

pve02 ~ # pvesr status
JobID      Enabled    Target                           LastSync             NextSync   Duration  FailCount State
1002-0     Yes        local/pve01           2023-03-26_16:08:29              pending  11.099929          0 OK
2002-0     Yes        local/pve01           2023-03-27_08:36:53              pending  138.37906          0 OK
2102-0     Yes        local/pve01           2023-03-26_16:33:45              pending    6.50503          0 OK
2104-0     Yes        local/pve01           2023-03-26_16:10:57  2023-03-27_10:40:00   7.043028          0 OK
2106-0     Yes        local/pve01           2023-03-26_16:12:34  2023-03-27_11:00:00    7.13516          0 OK
2108-0     Yes        local/pve01           2023-03-26_16:15:32  2023-03-27_11:20:00    7.57505          0 OK
2112-0     Yes        local/pve01           2023-03-26_16:20:49  2023-03-27_12:00:00   7.913801          0 OK

Code:

pve02 ~ # pvesr run --id 2102-0 --verbose  
start replication job
guest => VM 2102, running => 180321
volumes => DiscoSSD:vm-2102-disk-0
create snapshot '__replicate_2102-0_1679905976__' on DiscoSSD:vm-2102-disk-0
using secure transmission, rate limit: none
incremental sync 'DiscoSSD:vm-2102-disk-0' (__replicate_2102-0_1679841225__ => __replicate_2102-0_1679905976__)
send from @__replicate_2102-0_1679841225__ to DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__ estimated size is 655M
total estimated size is 655M
TIME        SENT   SNAPSHOT DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:01   37.6M   DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:02    136M   DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:03    230M   DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:04    318M   DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:05    407M   DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:06    499M   DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
10:33:07    578M   DiscoSSD/vm-2102-disk-0@__replicate_2102-0_1679905976__
successfully imported 'DiscoSSD:vm-2102-disk-0'
delete previous replication snapshot '__replicate_2102-0_1679841225__' on DiscoSSD:vm-2102-disk-0
(remote_finalize_local_job) delete stale replication snapshot '__replicate_2102-0_1679841225__' on DiscoSSD:vm-2102-disk-0
end replication job

Any ideas? Its a bug?

fabian · Mar 27, 2023

https://forum.proxmox.com/threads/backups-not-running-after-upgrade-to-7-4.124784/

TL;DR remove libev-perl if it's installed (and not needed by anything else on your system)

j5boot · Mar 27, 2023

Removing libev-perl and restarting pvescheduler is working again.
Thank you!

t.lamprecht · Mar 27, 2023

For future readers, removing libev-perl should not be required anymore with libpve-common-perl >= 7.3-4, as that includes a fix for this bug.

Search

Search

[SOLVED] pvescheduler defunct after HA replication

j5boot

Renowned Member

fabian

Proxmox Staff Member

j5boot

Renowned Member

t.lamprecht

Proxmox Staff Member