[SOLVED] PBS reboots during backup "backup write data failed"

AX_florian

New Member
May 11, 2021
8
0
1
Greetings, my company is currently testing PBS. Our infrastructure consists of multiple Proxmox VEs with both CTs and VMs.
For one specific VM, the PBS backups never finish and the PBS server eventually reboots, causing the error "backup write data failed".
Can anyone explain what we are seeing? Journalctl suggests an issue on the PBS server, but why does only this VM trigger it?

PBS Version:
proxmox-backup-server 1.1.10-1 running version: 1.1.5

PVE Task:
Code:
Proxmox
Virtual Environment 6.4-5
Virtual Machine 105 (susi) on node 'le-srv02'
Filter VMID
Logs
()
INFO: starting new backup job: vzdump 105 --storage data01-sms --node le-srv02 --mode snapshot --remove 0
INFO: Starting Backup of VM 105 (qemu)
INFO: Backup started at 2021-06-29 14:42:48
INFO: status = running
INFO: VM Name: susi
INFO: include disk 'scsi0' 'local-lvm:vm-105-disk-1' 600G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/105/2021-06-29T12:42:48Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '73d88055-17cc-415c-bb91-a7af470ea882'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO:   0% (184.0 MiB of 600.0 GiB) in 3s, read: 61.3 MiB/s, write: 50.7 MiB/s
INFO:   1% (6.0 GiB of 600.0 GiB) in 5m 22s, read: 18.7 MiB/s, write: 18.7 MiB/s
INFO:   2% (12.0 GiB of 600.0 GiB) in 10m 38s, read: 19.5 MiB/s, write: 19.5 MiB/s
INFO:   2% (16.2 GiB of 600.0 GiB) in 18m 43s, read: 8.9 MiB/s, write: 8.9 MiB/s
ERROR: backup write data failed: command error: protocol canceled
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 105 failed - backup write data failed: command error: protocol canceled
INFO: Failed at 2021-06-29 15:01:32
INFO: Backup job finished with errors
TASK ERROR: job errors
 

Attachments

  • pveversion.txt
    1.5 KB · Views: 1
  • storage_cfg.txt
    406 bytes · Views: 3
  • pvesm_list.txt
    4.2 KB · Views: 3
  • journalctl.txt
    18.7 KB · Views: 4
  • vm_conf.txt
    307 bytes · Views: 0
Last edited:
what kind of filesystem is in the folder '/mnt/datastore/le-srv02/' ? i.e. how did you configure the datastore?
how much memory does the pbs server have? is it bare-metal or a vm? did you already rule out hardware issues? (memory, psu, etc)
 
The filesystem is ZFS.
Code:
root@data01:~# mount | grep 'datastore' | column -t
le-srv02              on  /mnt/datastore/le-srv02          type  zfs         (rw,xattr,noacl)
le-srv03              on  /mnt/datastore/le-srv03          type  zfs         (rw,xattr,noacl)

Memory is 45GiB.
Code:
root@data01:~# free -h
              total        used        free      shared  buff/cache   available
Mem:           45Gi        25Gi        19Gi       1.0Mi       748Mi        19Gi
Swap:         8.0Gi          0B       8.0Gi

PBS is a bare-metal installation on fresh server hardware. Memory checks were performed by the vendor and by us. Hardware issues are an unlikely problem source, since reboots seem to be caused by the specific VM. I'll try and check it anyway.
I'm not entirely sure what you mean by "how did you configure the datastore?", but i can tell you we used two physical disks and did the rest via GUI.
 
Last edited:
can you check and post the related backup task log from the pbs side too?

can you maybe let 'journalctl -f' run during such a backup, e.g. in an ssh session or on the server itself? maybe the journal does not contain any infos because it cannot write the journal to disk anymore (is there enough disk space?)

also a 'dmesg -w' during such a backup would be interesting
 
PBS Task:
Code:
2021-07-01T09:31:38+02:00: starting new backup on datastore 'le-srv02': "vm/105/2021-07-01T07:31:37Z"
2021-07-01T09:31:38+02:00: GET /previous: 400 Bad Request: no valid previous backup
2021-07-01T09:31:38+02:00: created new fixed index 1 ("vm/105/2021-07-01T07:31:37Z/drive-scsi0.img.fidx")
2021-07-01T09:31:38+02:00: add blob "/mnt/datastore/le-srv02/vm/105/2021-07-01T07:31:37Z/qemu-server.conf.blob" (277 bytes, comp: 277)

There is plenty of space for the journal, it just doesn't show any useful output. Is there any way to increase the log level for PBS?
 

Attachments

  • journalctl_follow.txt
    1.5 KB · Views: 2
  • dmesg.txt
    81.5 KB · Views: 2
it seems like the server simply shuts off, no errors no warnings etc.
it is very unlikely that the pbs services have such an effect,
i'd rather suspect faulty hardware that resets the server, or maybe a kernel issue, but even that would print *something* into the dmesg

does that server have some kind of management interface (ilo, idrac, ipmi, etc.) where maybe the machine reset gets logged?
if the hardware encountered an error it will probably logged there

also what just comes to mind:

what exaclty does the vm you want to backup?
anything the pbs relies on ? (e.g iscsi luns etc ?)
if yes, this cannot work since the vm will be paused shortly on backup start
 
Last edited:
  • Like
Reactions: AX_florian
IPMI did indeed show a lot of CPU errors. We are currently checking with our vendor for possible causes.
Still, it's strange that only one specific VM causes this behaviour. I'll update this thread when we reach a solution.

Thank you for your time and effort.

Edit: The VMs are completely unrelated.
 
Last edited:
Completely replacing CPU and RAM seems to have fixed it, the backup was successfull and the system didn't crash last night. However, we got the following error during scheduled verification.
Code:
2021-07-07T20:00:01+02:00: verify group le-srv02:vm/105 (1 snapshots)
2021-07-07T20:00:01+02:00: verify le-srv02:vm/105/2021-07-07T10:19:56Z
2021-07-07T20:00:01+02:00:   check qemu-server.conf.blob
2021-07-07T20:00:01+02:00:   check drive-scsi0.img.fidx
2021-07-07T20:00:01+02:00: can't verify chunk, stat failed - No such file or directory (os error 2)
2021-07-07T21:45:07+02:00:   verified 454384.39/455488.00 MiB in 6306.64 seconds, speed 72.05/72.22 MiB/s (1 errors)
2021-07-07T21:45:07+02:00: verify le-srv02:vm/105/2021-07-07T10:19:56Z/drive-scsi0.img.fidx failed: chunks could not be verified

The only reference to this error i could find was this forum thread, which indicates a problem with the "/mnt/datastore/le-srv02/.chunks" directory. Since it's the same VM as before, the issues are likely related. Is there a way to delete all references to a VM and start from scratch?
 

Attachments

  • successfull_backup_pve.txt
    9.4 KB · Views: 0
  • successfull_backup_pbs.txt
    1.2 KB · Views: 0
Last edited:
Is there a way to delete all references to a VM and start from scratch?
first i would try to only delete the affected snapshots, in the case you posted "vm/105/2021-07-07T10:19:56Z"
if all snapshots are affected, simply remove the whole backup group for 'vm/105' (you could not have restored from that anyway if chunks were missing)
 
First I tried deleting the failed snapshot, but the next snapshot failed in the same way.
I could not find a way to remove the group over the GUI, but deleting the directory "/mnt/datastore/le-srv02/vm/105" did it. Now incremental backups are created as expected.

Since the hardware issue is fixed and snapshots are working now, this thread is solved.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!