Some backups are not saved in the namespace

fabian · Jul 20, 2022

did you reload/restart the services? the on-disk perl modules are not re-read all the time, just when starting the service. a package upgrade triggers a reload automatically, but maybe something went wrong there (should be visible in the logs of the respective services).

ednt · Jul 21, 2022

As written in #18, I did a

Code:

systemctl reload-or-restart pveproxy pvedaemon pvescheduler

It's getting worse. Now a few VMs more (also from other nodes) are not respect the namespace (which were Ok last times):

And I can still see:

Code:

file /etc/pve/storage.cfg line 158 (section 'pm-pbs01-ednt') - unable to parse value of 'namespace': unexpected property 'namespace'

From syslog of node 175:

Code:

Jul 20 23:00:01 pm-cpu-175 pvescheduler[2087097]: <root@pam> starting task UPID:pm-cpu-175:001FD8BD:949F8726:62D86CD0:vzdump::root@pam:
Jul 20 23:00:01 pm-cpu-175 pmxcfs[3180503]: [status] notice: received log
Jul 20 23:00:02 pm-cpu-175 pmxcfs[3180503]: [status] notice: received log
Jul 20 23:00:03 pm-cpu-175 pvescheduler[2087101]: file /etc/pve/storage.cfg line 158 (section 'pm-pbs01-ednt') - unable to parse value of 'namespace': unexpected property 'namespace'
Jul 20 23:00:03 pm-cpu-175 pmxcfs[3180503]: [status] notice: received log
Jul 20 23:00:05 pm-cpu-175 pmxcfs[3180503]: [status] notice: received log
Jul 20 23:00:05 pm-cpu-175 pmxcfs[3180503]: [status] notice: received log
Jul 20 23:00:09 pm-cpu-175 pvescheduler[2087101]: INFO: starting new backup job: vzdump 118 246 140 141 199 128 145 181 1241 185 186 100 107 225 227 231 228 233 238 234 236 237 201 240 254 275
 123 170 130 180 148 232 230 229 142 143 1113 149 152 162 164 171 129 139 158 244 252 108 138 174 183 182 188 190 189 222 241 255 161 196 178 251 114 113 --mailto edv@ednt.de --mailnotificatio
n always --quiet 1 --storage pm-pbs01-ednt --mode snapshot
Jul 20 23:00:09 pm-cpu-175 pmxcfs[3180503]: [status] notice: received log
Jul 20 23:00:09 pm-cpu-175 pvescheduler[2087101]: INFO: Starting Backup of VM 114 (qemu)
Jul 20 23:00:19 pm-cpu-175 pvestatd[1268]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - got timeout
Jul 20 23:00:22 pm-cpu-175 pvestatd[1268]: status update time (8.589 seconds)
Jul 20 23:00:24 pm-cpu-175 pvedaemon[1256935]: worker exit
Jul 20 23:00:25 pm-cpu-175 pvedaemon[1291]: worker 1256935 finished
Jul 20 23:00:25 pm-cpu-175 pvedaemon[1291]: starting 1 worker(s)
Jul 20 23:00:25 pm-cpu-175 pvedaemon[1291]: worker 2088145 started
Jul 20 23:00:28 pm-cpu-175 systemd[1]: Starting CheckMK (192.168.254.62:58770)...
Jul 20 23:00:29 pm-cpu-175 pvestatd[1268]: VM 114 qmp command failed - VM 114 qmp command 'query-proxmox-support' failed - unable to connect to VM 114 qmp socket - timeout after 31 retries
Jul 20 23:00:31 pm-cpu-175 pvestatd[1268]: status update time (8.505 seconds)
Jul 20 23:00:41 pm-cpu-175 pvestatd[1268]: status update time (8.134 seconds)
Jul 20 23:03:04 pm-cpu-175 pveproxy[1309665]: worker exit
Jul 20 23:03:05 pm-cpu-175 pveproxy[1300]: worker 1309665 finished
Jul 20 23:03:05 pm-cpu-175 pveproxy[1300]: starting 1 worker(s)
Jul 20 23:03:05 pm-cpu-175 pveproxy[1300]: worker 2097654 started
Jul 20 23:04:34 pm-cpu-175 pmxcfs[3180503]: [status] notice: received log
Jul 20 23:04:45 pm-cpu-175 pvestatd[1268]: status update time (12.360 seconds)
Jul 20 23:08:45 pm-cpu-175 pvescheduler[2087101]: INFO: Finished Backup of VM 114 (00:08:36)
Jul 20 23:08:45 pm-cpu-175 pvescheduler[2087101]: INFO: Starting Backup of VM 225 (lxc)

fabian · Jul 21, 2022

please do a hard "restart" of pvescheduler while no backups or replication jobs are running: systemctl restart pvescheduler

pablopol24 · Jul 22, 2022

Hello, something similar happened to me these days, it is solved by restarting the affected pve.

ednt · Jul 25, 2022

I did this

Code:

systemctl restart pvescheduler

on friday.
Still the same message in the log of the backup job:

Code:

file /etc/pve/storage.cfg line 158 (section 'pm-pbs01-ednt') - unable to parse value of 'namespace': unexpected property 'namespace'

fabian · Jul 25, 2022

please post the full output of journalctl -b -u pvescheduler as well as sha256sum /usr/share/perl5/PVE/Storage/PBSPlugin.pm. thanks!

ednt · Jul 25, 2022

journalctl (from the day where I restarted the pvescheduler):

Code:

Jul 21 18:04:31 pm-cpu-175 systemd[1]: Stopping Proxmox VE scheduler...
Jul 21 18:04:32 pm-cpu-175 pvescheduler[3508258]: received signal TERM
Jul 21 18:04:32 pm-cpu-175 pvescheduler[3508258]: got shutdown request, signal running jobs to stop
Jul 21 18:04:32 pm-cpu-175 pvescheduler[3508258]: server stopped
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Succeeded.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Unit process 2916892 (pvescheduler) remains running after unit stopped.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Unit process 4092644 (pvescheduler) remains running after unit stopped.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Unit process 2382504 (pvescheduler) remains running after unit stopped.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Unit process 1875725 (pvescheduler) remains running after unit stopped.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Unit process 3305337 (pvescheduler) remains running after unit stopped.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Unit process 1723109 (pvescheduler) remains running after unit stopped.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Unit process 4072462 (pvescheduler) remains running after unit stopped.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Unit process 2780168 (pvescheduler) remains running after unit stopped.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Unit process 126214 (pvescheduler) remains running after unit stopped.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: Stopped Proxmox VE scheduler.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Consumed 12min 34.146s CPU time.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Found left-over process 2916892 (pvescheduler) in control group while starting unit. Ignoring.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Found left-over process 4092644 (pvescheduler) in control group while starting unit. Ignoring.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Found left-over process 2382504 (pvescheduler) in control group while starting unit. Ignoring.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Found left-over process 1875725 (pvescheduler) in control group while starting unit. Ignoring.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Found left-over process 3305337 (pvescheduler) in control group while starting unit. Ignoring.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Found left-over process 1723109 (pvescheduler) in control group while starting unit. Ignoring.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Found left-over process 4072462 (pvescheduler) in control group while starting unit. Ignoring.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Found left-over process 2780168 (pvescheduler) in control group while starting unit. Ignoring.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: pvescheduler.service: Found left-over process 126214 (pvescheduler) in control group while starting unit. Ignoring.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 21 18:04:38 pm-cpu-175 systemd[1]: Starting Proxmox VE scheduler...
Jul 21 18:04:39 pm-cpu-175 pvescheduler[1548108]: starting server
Jul 21 18:04:39 pm-cpu-175 systemd[1]: Started Proxmox VE scheduler.
Jul 21 23:00:02 pm-cpu-175 pvescheduler[2487276]: <root@pam> starting task UPID:pm-cpu-175:0025F431:95235D75:62D9BE51:vzdump::root@pam:
Jul 21 23:00:04 pm-cpu-175 pvescheduler[2487345]: file /etc/pve/storage.cfg line 158 (section 'pm-pbs01-ednt') - unable to parse value of 'namespace': unexpected property >
Jul 21 23:00:10 pm-cpu-175 pvescheduler[2487345]: INFO: starting new backup job: vzdump 118 246 140 141 199 128 145 181 1241 185 186 100 107 225 227 231 228 233 238 234 23>
Jul 21 23:00:10 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 114 (qemu)
Jul 21 23:04:27 pm-cpu-175 pvescheduler[2487345]: INFO: Finished Backup of VM 114 (00:04:17)
Jul 21 23:04:28 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 225 (lxc)
Jul 21 23:05:14 pm-cpu-175 pvescheduler[2487345]: INFO: Finished Backup of VM 225 (00:00:46)
Jul 21 23:05:14 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 227 (qemu)
Jul 21 23:06:08 pm-cpu-175 pvescheduler[2487345]: INFO: Finished Backup of VM 227 (00:00:54)
Jul 21 23:06:08 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 228 (qemu)
Jul 21 23:07:44 pm-cpu-175 pvescheduler[2487345]: INFO: Finished Backup of VM 228 (00:01:36)
Jul 21 23:07:44 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 233 (lxc)
Jul 21 23:09:22 pm-cpu-175 pvescheduler[2487345]: INFO: Finished Backup of VM 233 (00:01:38)
Jul 21 23:09:22 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 234 (qemu)
Jul 21 23:09:32 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 236 (qemu)
Jul 21 23:09:44 pm-cpu-175 pvescheduler[2487345]: INFO: Finished Backup of VM 236 (00:00:12)
Jul 21 23:09:44 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 237 (qemu)
Jul 21 23:10:08 pm-cpu-175 pvescheduler[2487345]: INFO: Finished Backup of VM 237 (00:00:24)
Jul 21 23:10:08 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 238 (qemu)
Jul 21 23:10:18 pm-cpu-175 pvescheduler[2487345]: INFO: Finished Backup of VM 238 (00:00:10)
Jul 21 23:10:18 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 240 (qemu)
Jul 21 23:10:45 pm-cpu-175 pvescheduler[2487345]: INFO: Finished Backup of VM 240 (00:00:27)
Jul 21 23:10:46 pm-cpu-175 pvescheduler[2487345]: INFO: Starting Backup of VM 275 (qemu)
Jul 21 23:14:40 pm-cpu-175 pvescheduler[2487345]: INFO: Finished Backup of VM 275 (00:03:54)
Jul 21 23:14:40 pm-cpu-175 pvescheduler[2487345]: INFO: Backup job finished successfully

But I can see sometimes:

Code:

Jul 22 20:35:14 pm-cpu-175 pvescheduler[2424789]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Jul 22 20:35:14 pm-cpu-175 pvescheduler[2424792]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

But that's out of the backup times.

The sha256 and the filedate:

Code:

root@pm-cpu-175:~# sha256sum /usr/share/perl5/PVE/Storage/PBSPlugin.pm
8191295b490f81a75c254c97d2147d6bb96af766ff791ca87549a362a4013a66  /usr/share/perl5/PVE/Storage/PBSPlugin.pm
root@pm-cpu-175:~# ls -al /usr/share/perl5/PVE/Storage/PBSPlugin.pm
-rw-r--r-- 1 root root 25331 Jul 15 13:36 /usr/share/perl5/PVE/Storage/PBSPlugin.pm

fabian · Jul 25, 2022

yeah, as you can see in the journal output *something* is preventing old pvescheduler processes from exiting, so those still have the old code loaded.

easiest way to get out of that is likely to just reboot the node. if that is not an option, kill the offending processes.

edit: if possible, it might be interesting to get the child processes of the outdated pvescheduler processes:

Code:

ps faxl > process_tree

and then either trim the process_tree file or post the full contents

ednt · Jul 25, 2022

There is no tree, only single processes:

Code:

F   UID     PID    PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND


1     0 2916892       1  20   0 333232 19788 hrtime Ss   ?         39:05 pvescheduler
4     0  608389       1  20   0  15504  5396 do_epo Ss   ?          0:02 /lib/systemd/systemd --user
5     0  608390  608389  20   0 170704     4 do_sig S    ?          0:00  \_ (sd-pam)
1     0 4092644       1  20   0 333716 19596 hrtime Ss   ?         37:00 pvescheduler
1     0 2382504       1  20   0 333692 19664 hrtime Ss   ?         37:01 pvescheduler
4     0 3282638       1 -100  - 617752 224144 do_epo SLsl ?       14299:47 /usr/sbin/corosync -f
1     0 1875725       1  20   0 333756 19940 hrtime Ss   ?         33:47 pvescheduler
1     0 3305337       1  20   0 333668 19632 hrtime Ss   ?         31:56 pvescheduler
1     0 1723109       1  20   0 333740 19780 hrtime Ss   ?         32:41 pvescheduler
1     0 4072462       1  20   0 333740 19764 hrtime Ss   ?         29:12 pvescheduler
1     0 2780168       1  20   0 324768 20212 hrtime Ss   ?         21:13 pvescheduler
4     0  527267       1  20   0   4548  2228 do_epo Ss   ?         17:08 /usr/bin/lxc-start -F -n 109
4 100000 527348  527267  20   0 169576  5584 do_epo Ss   ?          0:48  \_ /sbin/init
4 100000 528079  527348  20   0  54664 15580 do_epo Ss   ?          0:24      \_ /lib/systemd/systemd-journald
4 100000 528115  527348  20   0   8508  1628 hrtime Ss   ?          0:15      \_ /usr/sbin/cron -f
4 100111 528119  527348  20   0   8416  2404 do_sys Ss   ?          1:25      \_ avahi-daemon: running [prog-64-deb10.local]



1     0  126214       1  20   0 325816 22768 hrtime Ss   ?          4:56 pvescheduler
4     0  139605       1  20   0   5784  1340 do_sel Ss+  tty1       0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux



1     0 1548108       1  20   0 325844 64328 hrtime Ss   ?          0:42 pvescheduler
5     0 3873339       1  20   0  80140  2088 do_sys Ssl  ?          0:04 /usr/sbin/pvefw-logger

I will kill now all pvescheduler processes.

fabian · Jul 25, 2022

thanks for the output! rather strange..

ednt · Jul 25, 2022

stop does nearly nothing:

Code:

root@pm-cpu-175:~# systemctl stop pvescheduler
root@pm-cpu-175:~# ps ax|grep pvesched
 126214 ?        Ss     4:56 pvescheduler
1723109 ?        Ss    32:41 pvescheduler
1875725 ?        Ss    33:47 pvescheduler
2224202 ?        S      0:00 pvescheduler
2224204 pts/0    S+     0:00 grep pvesched
2382504 ?        Ss    37:01 pvescheduler
2780168 ?        Ss    21:13 pvescheduler
2916892 ?        Ss    39:05 pvescheduler
3305337 ?        Ss    31:56 pvescheduler
4072462 ?        Ss    29:12 pvescheduler
4092644 ?        Ss    37:00 pvescheduler

kill -9 did it:

Code:

root@pm-cpu-175:~# kill -9 126214
root@pm-cpu-175:~# kill -9 1723109
root@pm-cpu-175:~# kill -9 1875725
root@pm-cpu-175:~# kill -9 2382504
root@pm-cpu-175:~# kill -9 2780168
root@pm-cpu-175:~# kill -9 2916892
root@pm-cpu-175:~# kill -9 3305337
root@pm-cpu-175:~# kill -9 4072462
root@pm-cpu-175:~# kill -9 4092644
root@pm-cpu-175:~# ps ax|grep pvesched
2234228 pts/0    S+     0:00 grep pvesched
root@pm-cpu-175:~#

Now only 1 process is running:

Code:

root@pm-cpu-175:~# systemctl start pvescheduler
root@pm-cpu-175:~# ps ax|grep pvesched
2238997 ?        Ss     0:00 pvescheduler
2239367 pts/0    S+     0:00 grep pvesched

So I wait till tomorrow,

ednt · Jul 26, 2022

It's working now as expected.

Bash:

kill -9 xxxx

was not needed.

Bash:

killall pvescheduler

terminates all processes.

I found more pve nodes where multiple instances of pvescheduler was 'running'.

How can this happen?
And why does a start of pvescheduler does not simply terminate all old running instances?

Now I try to move the second customer servers to an other namespace.

Thank you for the continious help and tracking down the problems.

fabian · Jul 26, 2022

it obviously shouldn't

could you give the full output of journalctl -b -u pvescheduler on one of the affected nodes (if possible one were the journal is still complete for the whole boot, and not already partly rotated/gone). thanks!

ednt · Jul 26, 2022

You are lucky, I found a node where it still is the case:

Code:

root@pm-cpu-173:~# ps ax|grep pvesche
 390618 ?        Ss    61:21 pvescheduler
1271153 ?        Ss    29:05 pvescheduler
2016484 ?        Ss    50:23 pvescheduler
2711818 ?        Ss     6:58 pvescheduler
3602118 ?        Ss    43:26 pvescheduler
3946147 ?        Ss    58:57 pvescheduler
4010062 ?        Ss    44:56 pvescheduler
4076548 ?        Ss    45:45 pvescheduler
4125167 ?        Ss    51:27 pvescheduler

I attached the journal entries.

At the moment the following machines are on this node:

Code:

vm/180
vm/241
vm/247
vm/304
vm/305
vm/312
vm/313
vm/1113
vm/3303
vm/3310

You should also allow tgz files as attachment.

It is not starting from the last reboot.
But I hope it is helpful.

fabian · Jul 26, 2022

thanks! if all your nodes have been running this long without a reboot, this might be the culprit - see pve-manager changelog for 7.1-6/-7 from November 2021:

Code:

pve-manager (7.1-7) bullseye; urgency=medium

  * pvescheduler: fix potential stall on full shutdown

  * api: journal: stream the syslog (journal) data directly to the client to
    avoid loading a potentially big amount of text first in memory

 -- Proxmox Support Team <support@proxmox.com>  Wed, 24 Nov 2021 18:32:51 +0100

pve-manager (7.1-6) bullseye; urgency=medium

  * ui: dc/OpenID Connect: make username-claim field editable for arbitrary
    values

  * ui: dc/OpenID Connect realm: allow one to edit scopes and prompt

  * pvescheduler: improve error handling with some forked worker

  * pvescheduler: reworking child pid tracking and do not delay reload
    artificially

  * pvescheduler: implement graceful reloading

 -- Proxmox Support Team <support@proxmox.com>  Mon, 22 Nov 2021 20:32:21 +0100

Darkk · Aug 11, 2022

I've had the same issue with some VMs not backing up which bugged me for several days until I noticed the pvescheduler service was dead on the status screen. I've tried starting it both WebGUI and CLI. I also noticed the replication also stopped working. Figured I needed to reboot the server anyway I did a buik shutdown of the VMs / CTs and soon as I did that the replication started to work. So makes me wonder is there a lock on one of the VMs that didn't clear during replication? Nothing in the logs indicating that.

I rebooted the server anyway just to "reset" things. Wouldn't hurt.

Wanted to share my experience with this issue.

ednt · Aug 12, 2022

Since I had no replication running,

it was enough to killall pvescheduler until no pvescheduler was running anymore,
Then service pvescheduler start.
And at least migrate and remograte the VMs, that the new libpbs was used.

A reboot was not needed in my case.

Search

Search

Some backups are not saved in the namespace

fabian

Proxmox Staff Member

ednt

Renowned Member

fabian

Proxmox Staff Member

pablopol24

Active Member

ednt

Renowned Member

fabian

Proxmox Staff Member

ednt

Renowned Member

fabian

Proxmox Staff Member

ednt

Renowned Member

fabian

Proxmox Staff Member

ednt

Renowned Member

ednt

Renowned Member

fabian

Proxmox Staff Member

ednt

Renowned Member

Attachments

fabian

Proxmox Staff Member

Darkk

Well-Known Member

ednt

Renowned Member

We value your privacy