Proxmox datacenter backup job repeatedly attempting to backup VMs that literally have never existed at any point in history

Maeve

New Member
Aug 2, 2022
14
0
1
21
Louisville, Kentucky
Hey all, I'm at my wit's end.
A few days ago my node stopped backing up two of my containers. I had to update and reboot it to get them to cooperate. Pasted below is a list of all my VMs, CTs, and their IDs.

1684107349358.png

After I rebooted and verified that a single backup cooperated, I ran the backup job and it failed, saying it could not find VM 200. This makes total sense as... there was no VM 200. There had never been a VM 200. It's theoretically possible that I tried to create a VM or CT called 200 at some point but didn't finish it, but there is absolutely no actual VM 200.

It was fine for two days and then today's job did it again, with VM 152. There is no VM 152. When it was VM 200 I looked in the Backup Job scheduler and found nothing. This time it had an entry for VM 152 but it just said unknown. ???

I'm running a smart check just to be sure but like, what the hell? Does anyone have any idea why this is happening? Is it from VMs I went to create but didn't? Why would they show up in the backup job? Is it a corruption? If so, how is it doing this specifically??
 
Hey all, I'm at my wit's end.
A few days ago my node stopped backing up two of my containers. I had to update and reboot it to get them to cooperate. Pasted below is a list of all my VMs, CTs, and their IDs.

View attachment 50418

After I rebooted and verified that a single backup cooperated, I ran the backup job and it failed, saying it could not find VM 200. This makes total sense as... there was no VM 200. There had never been a VM 200. It's theoretically possible that I tried to create a VM or CT called 200 at some point but didn't finish it, but there is absolutely no actual VM 200.

It was fine for two days and then today's job did it again, with VM 152. There is no VM 152. When it was VM 200 I looked in the Backup Job scheduler and found nothing. This time it had an entry for VM 152 but it just said unknown. ???

I'm running a smart check just to be sure but like, what the hell? Does anyone have any idea why this is happening? Is it from VMs I went to create but didn't? Why would they show up in the backup job? Is it a corruption? If so, how is it doing this specifically??
Hi,
please post the output of the following:
Bash:
pveversion -v
cat /etc/pve/jobs.cfg
journanlctl -b -u pvescheduler.service
 
Interesting.

You did specify the backups jobs in the GUI via Datacenter --> Backups? You are sure those jobs do not mention 152 or 200?

Go to arachna-pve --> Shell and search for them:

Code:
~# grep 152 /etc/pve/jobs.cfg
~# grep 152 /etc/pve/vzdump.cron
~# grep 200 /etc/pve/jobs.cfg
~# grep 200 /etc/pve/vzdump.cron

"jobs" is the new mechanism, "vzdump" is legacy. Maybe there is something...


Good luck
 
Hi,
please post the output of the following:
Bash:
pveversion -v
cat /etc/pve/jobs.cfg
journanlctl -b -u pvescheduler.service
pveversion -v
Bash:
proxmox-ve: 7.4-1 (running kernel: 5.15.107-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.4-3
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-1
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

jobs.cfg:
1684177708522.png

journalctl command:

1684177773684.png

Not sure why 199 failed, but that's a real VM. It did backup, so Idk what its going on about there. 152 is the problem child.
 
Interesting.

You did specify the backups jobs in the GUI via Datacenter --> Backups? You are sure those jobs do not mention 152 or 200?

Go to arachna-pve --> Shell and search for them:

Code:
~# grep 152 /etc/pve/jobs.cfg
~# grep 152 /etc/pve/vzdump.cron
~# grep 200 /etc/pve/jobs.cfg
~# grep 200 /etc/pve/vzdump.cron

"jobs" is the new mechanism, "vzdump" is legacy. Maybe there is something...


Good luck
Confirmed, nothing shows up from those greps. Weird.
 
pveversion -v
Bash:
proxmox-ve: 7.4-1 (running kernel: 5.15.107-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.4-3
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-1
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

jobs.cfg:
View attachment 50439

journalctl command:

View attachment 50440

Not sure why 199 failed, but that's a real VM. It did backup, so Idk what its going on about there. 152 is the problem child.
Try to remove the backup job from the config an recreate it. Does the issue persist? ID 152 is showing up in the task log, while it is not in the config, so it seems that something is out of sync here.
 
Hi,
can you also check the output of ps aux | grep pvescheduler and debsums -s pve-manager (you likely need to install debsums first)?
 
Try to remove the backup job from the config an recreate it. Does the issue persist? ID 152 is showing up in the task log, while it is not in the config, so it seems that something is out of sync here.
I removed the job and recreated it after the first incident, it was fixed until eventually I had to power cycle my machine and I think it started a day after that. I recently power cycled it again and it hasn't done it, so my fingers are crossed, but I just wanna get to the bottom of this bizarre behavior in case it's indicative of something else
 
I removed the job and recreated it after the first incident, it was fixed until eventually I had to power cycle my machine and I think it started a day after that. I recently power cycled it again and it hasn't done it, so my fingers are crossed, but I just wanna get to the bottom of this bizarre behavior in case it's indicative of something else
Okay, looking trough the code, I could not find any hint of what might cause such an issue, as the job config and guest IDs are read directly from their corresponding locations under the mounted pmxcfs at /etc/pve.
Could you please check the logs of journalctl -u pve-cluster.service and the logs in general for any further hint?
 
I notice your mystery machine of ID 152 comes immediately after VM ID 150. Might I suggest what you are looking at is this:
10010110 -> Mysterious energetic bit flip -> 10011000

VM 199 to VM200:
11000111 -> Mysterious energetic bit flip -> 11001000

Do you see my point? How the last 4 bits [rightmost group] change the same in each case.

So, the critical question, have you tested the memory with memtest86?
 
I notice your mystery machine of ID 152 comes immediately after VM ID 150. Might I suggest what you are looking at is this:
10010110 -> Mysterious energetic bit flip -> 10011000

VM 199 to VM200:
11000111 -> Mysterious energetic bit flip -> 11001000

Do you see my point? How the last 4 bits [rightmost group] change the same in each case.

So, the critical question, have you tested the memory with memtest86?
Of course checking memory never hurts, but if it were a bit flip, you wouldn't have both 150 and 152 in the job, but only the flipped one. Note that all of the IDs, including the supposedly never existing 152, are already present in the starting new backup job log line.

EDIT: fixed, because the 200 is not actually there in the log above.
 
Last edited:
Of course checking memory never hurts, but if it were a bit flip, you wouldn't have both 150 and 152 in the job, but only the flipped one. Note that all of the IDs, including the supposedly never existing 152, are already present in the starting new backup job log line.

EDIT: fixed, because the 200 is not actually there in the log above.
Only mentioned 200, as it's in Ops opening message. The similarity seemed interesting.