[SOLVED] Containers and VM .conf files keep getting reverted

Jan 24, 2020
28
3
23
38
I recently upgraded to proxmox 8.0 and noticed that after rebooting, my LXC templates and VMs had been reverted to older versions. I'm not sure on the details, but things like mount points had been changed, and some instances wouldn't start due to the root disk pointing to old storage locations that were no longer there. After updating the configs manually I was able to get things working, but I've noticed that every time I reboot they revert back. Even more strange to me is that I rebooted yesterday, restored my configurations along with /etc/pve/storage.cfg and all of a sudden this morning on the running system, all of these configs were reverted back again.

There are also old VMs and LXC containers that came back, powered off but I had deleted them recently.

Can anyone please help me identify what I'm doing wrong? Maybe there's something wrong with my storage config, or something with PBS which I have scheduled for backups but not for restores. Any ideas what I should be looking for, or which logs to check?
 
Last edited:
I recently upgraded to proxmox 8.0 and noticed that after rebooting, my LXC templates and VMs had been reverted to older versions. I'm not sure on the details, but things like mount points had been changed, and some instances wouldn't start due to the root disk pointing to old storage locations that were no longer there. After updating the configs manually I was able to get things working, but I've noticed that every time I reboot they revert back. Even more strange to me is that I rebooted yesterday, restored my configurations along with /etc/pve/storage.cfg and all of a sudden this morning on the running system, all of these configs were reverted back again.

Can anyone please help me identify what I'm doing wrong? Maybe there's something wrong with my storage config, or something with PBS which I have scheduled for backups but not for restores. Any ideas what I should be looking for, or which logs to check?
Hi,

as a first starting point you should check the pve-cluster service, since this handles the proxmox cluster filesystem [0], where all of the config files are stored on. You can run a systemctl status pve-cluster.service to check the service state and journalctl -b -u pve-cluster.service to get the systemd journal entries related to that service.

Is this node part of a cluster? If so, please also share the output of pvecm status from that node.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pmxcfs
 
Thanks. This is just a single PVE node, not part of any cluster. The output of the service status shows that it's active/running but only since this morning at 8:25AM EST. The output of journalctl -b -u pve-cluster.service shows that it was started yesterday around 4:20PM which is when I rebooted it last.

It seems like something triggered a restart of the pve-cluster service, and that is likely related to the issue I'm seeing?
 
Last edited:
Please share your outputs (best in code tags for improved readability) so one might catch potential errors/issues. You can attach the journal since reboot by running journalctl -b > journal.txt and add the file as attachment.
 
Sorry, I should have just posted the full output.

Code:
root@pve:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Mon 2023-10-02 08:25:39 EDT; 1h 43min ago
    Process: 3531782 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 3531801 (pmxcfs)
      Tasks: 7 (limit: 309370)
     Memory: 46.6M
        CPU: 12.466s
     CGroup: /system.slice/pve-cluster.service
             └─3531801 /usr/bin/pmxcfs

Code:
root@pve:~# journalctl -b -u pve-cluster.service
Oct 01 16:20:17 pve systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Oct 01 16:20:17 pve pmxcfs[6072]: [main] notice: resolved node name 'pve' to '192.168.1.15' for default node IP address
Oct 01 16:20:17 pve pmxcfs[6072]: [main] notice: resolved node name 'pve' to '192.168.1.15' for default node IP address
Oct 01 16:20:18 pve systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
 

Attachments

  • journalctl.txt
    189.3 KB · Views: 2
Sorry, I should have just posted the full output.

Code:
root@pve:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Mon 2023-10-02 08:25:39 EDT; 1h 43min ago
    Process: 3531782 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 3531801 (pmxcfs)
      Tasks: 7 (limit: 309370)
     Memory: 46.6M
        CPU: 12.466s
     CGroup: /system.slice/pve-cluster.service
             └─3531801 /usr/bin/pmxcfs

Code:
root@pve:~# journalctl -b -u pve-cluster.service
Oct 01 16:20:17 pve systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Oct 01 16:20:17 pve pmxcfs[6072]: [main] notice: resolved node name 'pve' to '192.168.1.15' for default node IP address
Oct 01 16:20:17 pve pmxcfs[6072]: [main] notice: resolved node name 'pve' to '192.168.1.15' for default node IP address
Oct 01 16:20:18 pve systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
So there is nothing in the logs indicating an issue with the proxmox cluster filesystem, altough they seem to be incomplete, I only see entries up until Oct 01 19:23:03. Are these all the entries or did the rest got cut off somehow?

Does the issue persist? I suspect that either the cluster filesystem was not mounted correctly before and config files have been written to the underlying filesystem, although in that case I would suspect the pmxcfs to fail to mount. Did you encounter any issues/errors during upgrade?

Edit: Please also provide the output of mount | grep /etc/pve
 
Last edited:
Those are all the logs from the output of journalctl -b > journalctl.txt

The issue seems to occur any time I reboot which has been 3 times including the upgrade a couple weeks ago. I don't recall any issues during the upgrade, other than noticing this issue and resolving it manually afterward.

Code:
root@pve:~# mount | grep /etc/pve
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
 
So all looks okay with the pmxcfs mounted correctly. You can try the following:
  • Enable debug mode by running echo "1" > /etc/pve/.debug, which produces quite some debug output written to the journal
  • Modify a config file under /etc/pve
  • Disable debug mode again echo "0" > /etc/pve/.debug
  • Generate an up to date systemd journal dump including the debug logs from above journalctl -b > journalctl.txt
 
Another few suggestions where to look further from a colleague:
  • Do you use ZFS for the root filesystem and have some custom snapshot scripts which might interfere?
  • Do you use Ansible/Terraform to provision VMs and might have a script overwriting the config?
  • Do you get different outputs from stat from before/after the reboot for the pmxcfs database located at /var/lib/pve-cluster/config.db and/or the config files located under /etc/pve? You can monitor the output of stat /var/lib/pve-cluster/config.db
 
And another suggestion by a colleague:
  • Do you maybe have multiple PVE installations on different disks? Might be that the system boots from the wrong one. Pleases post the output of lsblk -o +FSTYPE as well as cat /etc/fstab. Do you maybe use a raid controller?
 
I enabled debugging and modified a config under /etc/pve and attached the resulting journalctl.txt file. It does seem to include up to date information this time.
I don't use ZFS on the root filesystem, it's EXT4.
I don't use Ansible or any other system to automate provisioning of configurations at this point.
I haven't had a chance to try rebooting again so I can't currently answer the question about the pmxcfs database.
I don't have more than one installation of proxmox and I don't use a RAID controller.

Code:
root@pve:~# cat /etc/fstab
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/pve/root / ext4 errors=remount-ro 0 1
/tank/vzdump /mnt/vzdump none defaults,bind 0 0
/dev/pve/swap none swap sw 0 0
proc /proc proc defaults 0 0
/tank/pve-var-log /var/log none defaults,bind 0 0
 

Attachments

  • journalctl.txt
    193 KB · Views: 1
/tank/pve-var-log /var/log none defaults,bind 0 0
Hmm, not completely sure, but this might cause issues if tank, which is a zfs pool I assume is mounted after some systemd services already opened files on the underlying filesystem. This might also explain why the logs are not up to date, since you are reading only these from the filesystem mounted on top I assume, while systemd holds a handle from the one below. And it would explain why this is the first line in the logs after what you posted before systemd[1]: var-log.mount: Deactivated successfully.

It would also explain why there are no entries from the pmxcfs at all, even in debug mode.

Try to un-mount this and remove the entry from the fstab, the reboot and check again, including an up to date journal dump and the output of mount.

Also I found var-lib-pve\x2dcluster.mount: Deactivated successfully and var-folder2ram-var-lib-pve\x2dcluster.mount: Deactivated successfully. in the logs... Do you mount some ramfs on /var? This will of course not work, as the pmxcfs backing database is stored there. So remove this as well and you should be fine again.
 
Last edited:
Ok, I removed the /var/log mount in /etc/fstab and uninstalled the folder2ram service. I had previously been trying to get proxmox to write logs somewhere besides the SSD it's installed on to reduce write wear, but clearly not doing it correctly. After rebooting, I ran journalctl -b > journalctl.txt again (attached).

Code:
root@pve:~# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=131998008k,nr_inodes=32999502,mode=755,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=26406672k,mode=755,inode64)
/dev/mapper/pve-root on / type ext4 (rw,relatime,errors=remount-ro)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,inode64)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=43478)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
ramfs on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
nfsd on /proc/fs/nfsd type nfsd (rw,relatime)
/dev/mapper/pve-root on /mnt/vzdump type ext4 (rw,relatime,errors=remount-ro)
ramfs on /run/credentials/systemd-sysctl.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
nvme1tb on /nvme1tb type zfs (rw,noatime,xattr,noacl)
nvme1tb/subvol-1001-disk-0 on /nvme1tb/subvol-1001-disk-0 type zfs (rw,noatime,xattr,posixacl)
nvme1tb/subvol-303-disk-0 on /nvme1tb/subvol-303-disk-0 type zfs (rw,noatime,xattr,posixacl)
nvme1tb/subvol-202-disk-0 on /nvme1tb/subvol-202-disk-0 type zfs (rw,noatime,xattr,posixacl)
nvme1tb/subvol-304-disk-0 on /nvme1tb/subvol-304-disk-0 type zfs (rw,noatime,xattr,posixacl)
nvme1tb/subvol-100-disk-0 on /nvme1tb/subvol-100-disk-0 type zfs (rw,noatime,xattr,posixacl)
nvme1tb/subvol-302-disk-0 on /nvme1tb/subvol-302-disk-0 type zfs (rw,noatime,xattr,posixacl)
backup1 on /backup1 type zfs (rw,xattr,noacl)
backup2 on /backup2 type zfs (rw,xattr,noacl)
usbhotbackup on /usbhotbackup type zfs (rw,xattr,noacl)
tank on /tank type zfs (rw,noatime,xattr,noacl)
tank/pve-var-log on /tank/pve-var-log type zfs (rw,noatime,xattr,noacl)
tank/iso on /tank/isos type zfs (rw,noatime,xattr,noacl)
tank/diskimages on /tank/diskimages type zfs (rw,noatime,xattr,noacl)
tank/k8snfs on /tank/k8snfs type zfs (rw,noatime,xattr,noacl)
tank/nextcloud on /tank/nextcloud type zfs (rw,noatime,xattr,noacl)
tank/cloudstorage on /tank/cloudstorage type zfs (rw,noatime,xattr,noacl)
tank/fileserver on /tank/fileserver type zfs (rw,noatime,xattr,noacl)
tank/smbshare on /tank/smbshare type zfs (rw,noatime,xattr,noacl)
tank/vzdump on /tank/vzdump type zfs (rw,noatime,xattr,noacl)
tank/pbs on /tank/pbs type zfs (rw,noatime,xattr,noacl)
ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=26406668k,nr_inodes=6601667,mode=700,inode64)

After the reboot, a couple of my VMs failed to start with this message
TASK ERROR: timeout: no zvol device link for 'vm-128-disk-0' found after 300 sec found.
I ran this and it looks OK
Code:
root@pve:~# zvol_wait
Testing 4 zvol links
All zvol links are now present.

I checked systemctl list-units --type=service and saw this:
zfs-volume-wait.service loaded failed failed Wait for ZFS Volume (zvol) links in /dev

I was able to restore the 2 VMs from backup and they are now working as expected.
 

Attachments

  • journalctl.txt
    292.8 KB · Views: 1
Ok, I removed the /var/log mount in /etc/fstab and uninstalled the folder2ram service.
So this was indeed the root cause of your issues, also the systemd journal look complete now. Running all of /var in ram will cause issues, as there are many files located on there which need to be persistent, one of them being the pmxcfs backing sqlite database.
 
  • Like
Reactions: linucksproxmox

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!