Is PBS killing windows or is windows killing PBS

approximater

New Member
Aug 26, 2025
5
0
1
We have a proxmox server running one VM of windows server 22 and one vm running PBS. The proxmox server occasionally filled up its 32GB disk with log files so I set up a datastore on the proxmox host and shared it out to PBS via NFS. At the same time I set up another NFS share on the proxmox host for backup storage as backing up to an external synology NAS (also via NFS) was glacial.
About 10 days ago the hard drive filled up on the proxmox host and not only did the backup fail the windows server became corrupted and wouldn't boot saying
C:\Windows\System32\config\SYSTEM was corrupted. Obviously we couldn't locate the files because it was hidden and who would ever need to restore a system file right? If windows marks it as hidden I guess the smart thing to do is not make it findable in the backups. So we did a full restore from the NAS and 30 hours later had our windows server running again. I changed the backup to only store 1 copy on the local drive and sync that to the NAS which stores the last 4 backups and then last night it happened again.
Only this time the single backup was only taking ~600GB space and apparently successfully completed at 10pm but the PBS logs share had a 300GB+ directory in it that filled the drive sometime between midnight and 7am this morning and once again the windows server is refusing to boot with a missing/corrupted C:\Windows\System32\config\SYSTEM
Restoring from the hard drive only took 49 minutes this time but this is about the 4th time an errant PBS process has completely filled the hard drive with GB's worth of logs and the second time it has destroyed a VM. At this point it's getting embarrassing as I was the one that recommended PBS as suitable for production and it's clearly not. Unfortunately I haven't been able to study the logs when they fill up as getting the server working has been the priority and I've had to delete them. This time I've set up a quota of 20GB on the logs dataset and will able to give it a fresh dataset without deleting next time it happens but wondering if anybody knows why PBS would generate 300GB of logs in one night and/or why running out of disk space ends up corrupting files on a VM?

Additional information that may or may not be pertinent is the windows server completed several windows updates when it restored from last nights backup and the server is blocked from sending SMTP so both Prpxmox and PBS have been configured to send notifications via webhooks which has been successfully sending notifications about backups, syncs and garbage collection with no indication of impending disaster.
 
Hi,
We have a proxmox server running one VM of windows server 22 and one vm running PBS. The proxmox server occasionally filled up its 32GB disk with log files so I set up a datastore on the proxmox host and shared it out to PBS via NFS. At the same time I set up another NFS share on the proxmox host for backup storage as backing up to an external synology NAS (also via NFS) was glacial.
what version of PBS are you running, did you check which log files fill up the disk? Further, please note that running PBS with a datastore located on a NFS is problematic for performance. Further, I assume your NFS share is backed by spinning disks? Both will not cope well with PBS, as IO will be your main bottleneck.
About 10 days ago the hard drive filled up on the proxmox host and not only did the backup fail the windows server became corrupted and wouldn't boot saying
C:\Windows\System32\config\SYSTEM was corrupted.
Are the windows VM and the PBS datastore located on the same disk/storage? Could it be that when you are filling up the disk, the Windows VM can no longer write to disk?
Do note that is not recommended at all to use the same storage you use for VM disks also for backups, if that disk fails, your VMs AND backups are gone!
Obviously we couldn't locate the files because it was hidden and who would ever need to restore a system file right? If windows marks it as hidden I guess the smart thing to do is not make it findable in the backups. So we did a full restore from the NAS and 30 hours later had our windows server running again. I changed the backup to only store 1 copy on the local drive and sync that to the NAS which stores the last 4 backups and then last night it happened again.
Only this time the single backup was only taking ~600GB space and apparently successfully completed at 10pm but the PBS logs share had a 300GB+ directory in it that filled the drive sometime between midnight and 7am this morning and once again the windows server is refusing to boot with a missing/corrupted C:\Windows\System32\config\SYSTEM
Restoring from the hard drive only took 49 minutes this time but this is about the 4th time an errant PBS process has completely filled the hard drive with GB's worth of logs and the second time it has destroyed a VM.
Again, do not use the same drive for backups. And identify what logs are filling up your system. Older versions of PBS were rather verbose in the backup task log, not the case anymore for an up to date one.
At this point it's getting embarrassing as I was the one that recommended PBS as suitable for production and it's clearly not.
Your current setup is however very much not recommended, see https://pbs.proxmox.com/docs/installation.html#recommended-server-system-requirements

Unfortunately I haven't been able to study the logs when they fill up as getting the server working has been the priority and I've had to delete them. This time I've set up a quota of 20GB on the logs dataset and will able to give it a fresh dataset without deleting next time it happens but wondering if anybody knows why PBS would generate 300GB of logs in one night and/or why running out of disk space ends up corrupting files on a VM?

Additional information that may or may not be pertinent is the windows server completed several windows updates when it restored from last nights backup and the server is blocked from sending SMTP so both Prpxmox and PBS have been configured to send notifications via webhooks which has been successfully sending notifications about backups, syncs and garbage collection with no indication of impending disaster.
 
Hi,

what version of PBS are you running, did you check which log files fill up the disk? Further, please note that running PBS with a datastore located on a NFS is problematic for performance. Further, I assume your NFS share is backed by spinning disks? Both will not cope well with PBS, as IO will be your main bottleneck.
Running PBS 9.0. I'm aware the NFS and spinning disks are not ideal but it's a case of using what the hardware I have, not the hardware I wish I had. Basically 1 pc with 2 2TB drives, a synology NAS and 2 usb drives for offsite backups. I'd like to pull the drives out of the synology and put them in the server rather than using them over NFS but that's not possible unfortunately.
Are the windows VM and the PBS datastore located on the same disk/storage? Could it be that when you are filling up the disk, the Windows VM can no longer write to disk?
Yes that's no doubt the cause of the windows corruption.
Do note that is not recommended at all to use the same storage you use for VM disks also for backups, if that disk fails, your VMs AND backups are gone!
Duly noted, that's why after a backup to the local disk at 10pm it is then synced to the NAS in the early am and synced again to an external USB in the morning when plugged in. Initially backups were going straight to the NAS over NFS but that put the VM in a degraded state for several hours.
Again, do not use the same drive for backups. And identify what logs are filling up your system. Older versions of PBS were rather verbose in the backup task log, not the case anymore for an up to date one.

Your current setup is however very much not recommended, see https://pbs.proxmox.com/docs/installation.html#pbsrecommended-server-system-requirements
Next time it happens I'll be able to preserve the logs and be back with more information but given PBS has filled it's available storage to capacity 3 times now and the latest effort was generating 300GB worth of of logs between 12am and 7am after a successful backup I have to wonder what kind of setup it takes to reliably run PBS. 300TB?
 
Running PBS 9.0
There is no PBS version 9.0, I assume you are referring to Proxmox VE running on the host here. Please post the output of proxmox-backup-manager version --verbose run on the PBS host.

Next time it happens I'll be able to preserve the logs and be back with more information but given PBS has filled it's available storage to capacity 3 times now and the latest effort was generating 300GB worth of of logs between 12am and 7am after a successful backup I have to wonder what kind of setup it takes to reliably run PBS. 300TB?
No, something seems wrong, as already stated you will have to check what logs are generated... Without more details it will not be possible to help. Did you already try to identify what is fulling up the space by tools like e.g. ncdu? Do you see errors in the systemd journal?

I'm aware the NFS and spinning disks are not ideal but it's a case of using what the hardware I have, not the hardware I wish I had. Basically 1 pc with 2 2TB drives, a synology NAS and 2 usb drives for offsite backups. I'd like to pull the drives out of the synology and put them in the server rather than using them over NFS but that's not possible unfortunately.
But you should not complain about the poor performance either then ;)
 
No, something seems wrong, as already stated you will have to check what logs are generated... Without more details it will not be possible to help. Did you already try to identify what is fulling up the space by tools like e.g. ncdu? Do you see errors in the systemd journal?
OK so it happened again and this time I was able to preserve the logs before deleting them. Basically an external USB was unplugged and PBS went crazy writing IO errors to the log many times a second...


2025-12-23T11:46:14+08:00: Starting datastore sync job 'this-server:nas-nfs:usb2::s-19e3c46a-6986'
2025-12-23T11:46:14+08:00: sync datastore 'usb2' from 'this-server/nas-nfs'
2025-12-23T11:46:14+08:00: ----
2025-12-23T11:46:14+08:00: Syncing datastore 'nas-nfs', root namespace into datastore 'usb2', root namespace
2025-12-23T11:46:14+08:00: found 1 groups to sync (out of 1 total)
2025-12-23T11:46:14+08:00: sync snapshot vm/103/2025-12-18T16:45:05Z
2025-12-23T11:46:14+08:00: sync archive qemu-server.conf.blob
2025-12-23T11:46:14+08:00: sync archive drive-virtio1.img.fidx
2025-12-24T07:38:01+08:00: removing backup snapshot "/mnt/datastore/usb2/vm/103/2025-12-18T16:45:05Z"
2025-12-24T07:38:01+08:00: cleanup error - removing backup snapshot "/mnt/datastore/usb2/vm/103/2025-12-18T16:45:05Z" failed - Input/output error (os error 5)
2025-12-24T07:38:01+08:00: percentage done: 25.00% (1/4 snapshots)
2025-12-24T07:38:01+08:00: sync group vm/103 failed - inserting chunk on store '
usb2' failed for c9d1d3105b9726f735a9762ffabb4593ec64753a4f24ef171dfe0bd8a4463973 - Atomic rename failed for file "/mnt/datastore/usb2/.chunks/c9d1/c9d1d3105b97
26f735a9762ffabb4593ec64753a4f24ef171dfe0bd8a4463973" - Input/output error (os error 5)
2025-12-24T07:38:01+08:00: error during cleanup: EIO: I/O error
2025-12-24T07:38:01+08:00: Finished syncing root namespace, current progress: 0 groups, 1 snapshots
2025-12-24T07:38:01+08:00: list groups error on datastore usb2 - EIO: I/O error
(Repeat for 20GB until disk fills..)
2025-12-25T22:46:27+08:00: list groups error on datastore usb2 - EIO: I/O error

Now what's happening here is every morning between 7-8am the client swaps over an external USB drive which begins syncing from the NFS on the synology NAS which can take close to 22 hours. At 3am a cron script unmounts all USB datastores which in theory finishes up just in time for the next USB sync to start. Apparently the previous day they did it at 11:46am instead of 7 and it wasn't finished when they swapped drives.
Yes the client is a doofus who needs to buy faster drives but if proxmox want to use words like 'reliable', enterprise' and 'professional' in it's promos I humbly submit an unplugged USB drive shouldn't bring the entire system to it's knees. If not a spam filter on logging IO errors then maybe logrotate should come standard, I've just spotted a logrotate script for proxmox so I'll look into that for now.



But you should not complain about the poor performance either then ;)
Turns out they do have enterprise SSD in the server which is why backup/restores locally take < 1 hour instead of several hours with the NAS. That's why I went the local NFS route to begin with. Problem was I didn't realise the VM drive files weren't pre-allocated so I had 900GB of quotas trying to fit into 500GB of actual disk capacity which was the other source of crashes. They've finally agreed to get another hard drive for backups so you'd think I could finally rest easy, maybe use that spare space for ZFS snapshots right? Nope they want to fill it with a windows desktop they can RDP into. :rolleyes:
 
Last edited: