[SOLVED] Certain VMs from a cluster cannot be backed up and managed

For me it happens on all guests on one host. Have Ballooning enabled everywhere though so I'll look into it. Two days ago I restarted all VMs after hours and everything worked for a while but today backups failed again and I cannot use console. Haven't experienced any guest errors or hangs though so I'm not sure if it's the same error, you guys are having trouble with. I started tracing few VMs so I'll see if it catches anythin.
 
I'm not sure if this helps, but I had the same issue as you Kalzaus, only VM's on one node had problems. I rebooted that node after hours and the VM's are running fine, including backups.
 
I'm not sure if this helps, but I had the same issue as you Kalzaus, only VM's on one node had problems. I rebooted that node after hours and the VM's are running fine, including backups.
I haven't tried rebooting the host itself only all of the VMs on it, and it worked for two days fine, but then started haning console and backups again.
I'll probably try rebooting the host on the weekend.
Anyways, one of the machine which I started tracing failed a backup yesterday at around 23:20 CEST, so I attached trace log in the message. I also noticed that console issues started before the backup issues. Because Wednesdays' night backup was succesful but console already wasn't working.
 

Attachments

Here is another trace if it helps. On one of our nodes after rebooting the Vm's they seem to be stable on the other node rebooting lasts the day and they catch the "management flu" every night when we back them up.
 

Attachments

I do tracing too, but tracing freeze about time when scheduled backup started backup VM
File: qemu_trace_107
Size: 12045941 Blocks: 23536 IO Block: 4096 regular file
Device: 801h/2049d Inode: 4719482 Links: 1
Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2019-09-29 13:33:15.456640359 +0200
Modify: 2019-09-27 22:04:37.074866718 +0200
Change: 2019-09-27 22:04:37.074866718 +0200

syslog
Sep 27 22:03:41 prx1 pvestatd[2142]: status update time (6.349 seconds)
Sep 27 22:03:47 prx1 ntpd[43469]: Soliciting pool server 89.221.210.188
Sep 27 22:03:55 prx1 ntpd[43469]: Soliciting pool server 212.111.30.190
Sep 27 22:03:56 prx1 ntpd[43469]: Soliciting pool server 147.231.100.5
Sep 27 22:04:00 prx1 systemd[1]: Starting Proxmox VE replication runner...
Sep 27 22:04:00 prx1 systemd[1]: pvesr.service: Succeeded.
Sep 27 22:04:00 prx1 systemd[1]: Started Proxmox VE replication runner.

message
Sep 27 21:32:04 prx1 kernel: [3841696.826134] vmbr0: port 15(tap103i0) entered forwarding state
Sep 27 21:44:50 prx1 kernel: [3842462.842299] vmbr0: port 15(tap103i0) entered disabled state Sep 27 21:44:51 prx1 qmeventd[1363]: Starting cleanup for 103
Sep 27 21:44:51 prx1 qmeventd[1363]: Finished cleanup for 103
Sep 27 22:25:41 prx1 kernel: [3844914.051183] device tap108i0 entered promiscuous mode
Sep 27 22:25:41 prx1 kernel: [3844914.068173] vmbr0: port 15(tap108i0) entered blocking state
Sep 27 22:25:41 prx1 kernel: [3844914.068176] vmbr0: port 15(tap108i0) entered disabled state
Sep 27 22:25:41 prx1 kernel: [3844914.068457] vmbr0: port 15(tap108i0) entered blocking state
Sep 27 22:25:41 prx1 kernel: [3844914.068459] vmbr0: port 15(tap108i0) entered forwarding state
Sep 27 22:28:01 prx1 kernel: [3845053.549757] vmbr0: port 15(tap108i0) entered disabled state

backup log
105: 2019-09-27 21:55:16 INFO: aborting backup job
105: 2019-09-27 22:05:16 ERROR: VM 105 qmp command 'backup-cancel' failed - got timeout
105: 2019-09-27 22:05:17 ERROR: Backup of VM 105 failed - got timeout

106: 2019-09-27 22:05:17 INFO: Starting Backup of VM 106 (qemu)
106: 2019-09-27 22:05:17 INFO: status = running
106: 2019-09-27 22:05:18 INFO: update VM 106: -lock backup
106: 2019-09-27 22:05:19 INFO: VM Name: simkod
106: 2019-09-27 22:05:19 INFO: include disk 'scsi0' 'vm-thinpool:vm-106-disk-1' 120G
106: 2019-09-27 22:05:19 INFO: backup mode: snapshot
106: 2019-09-27 22:05:19 INFO: ionice priority: 7
106: 2019-09-27 22:05:19 INFO: creating archive '/mnt/autofs/cesnet/dump/vzdump-qemu-106-2019_09_27-22_05_17.vma.lzo'
106: 2019-09-27 22:05:32 ERROR: got timeout
106: 2019-09-27 22:05:32 INFO: aborting backup job
106: 2019-09-27 22:15:32 ERROR: VM 106 qmp command 'backup-cancel' failed - got timeout
106: 2019-09-27 22:15:33 ERROR: Backup of VM 106 failed - got timeout

107: 2019-09-27 22:15:33 INFO: Starting Backup of VM 107 (qemu)
107: 2019-09-27 22:15:33 INFO: status = running
107: 2019-09-27 22:15:34 INFO: update VM 107: -lock backup
107: 2019-09-27 22:15:34 INFO: VM Name: edusql
107: 2019-09-27 22:15:34 INFO: include disk 'scsi0' 'vm-thinpool:vm-107-disk-1' 32G
107: 2019-09-27 22:15:34 INFO: backup mode: snapshot
107: 2019-09-27 22:15:34 INFO: ionice priority: 7
107: 2019-09-27 22:15:34 INFO: creating archive '/mnt/autofs/cesnet/dump/vzdump-qemu-107-2019_09_27-22_15_33.vma.lzo'
107: 2019-09-27 22:15:38 ERROR: got timeout
107: 2019-09-27 22:15:38 INFO: aborting backup job
107: 2019-09-27 22:25:38 ERROR: VM 107 qmp command 'backup-cancel' failed - got timeout
107: 2019-09-27 22:25:39 ERROR: Backup of VM 107 failed - got timeout

So can be some relation with services started in this time
 
same Problem here now :-(
4 Node Cluster with NFS Shared Storage.
Only one Node affected.
 
Error occurs here for me too after upgrade from Proxmox 5 to 6
One node of a 3 node cluster with Ceph storage affected
Backup, Console access, Snapshot, Migration for VMs all get timeouts
After shutting down VM and restarting on the same node problem occurs soon again
After migrating VM to one of the other nodes, problem doesn't occur
 
we have identified a likely cause, a patched pve-qemu-kvm package (pve-qemu-kvm 4.0.0-6) should be available on pvetest once it's gone through internal testing.
 
we have identified a likely cause, a patched pve-qemu-kvm package (pve-qemu-kvm 4.0.0-6) should be available on pvetest once it's gone through internal testing.

available on pvetest now.
 
Same problem with drbd storage, after a cluster migration... No more tips from debug/log/syslog than others people in this thread. Tried to play with apparmor without results. Shutdown/Stop then start is correcting all migration/backup/consols & consort problems. I can't see a relation but, it seems that qemu guest agent presence allow a better resilience against this bug on my Linux and Windows. Sadly, my OpenBSD have no guest agent and still suffers from this problem. If I move VM on a 5.4 pve cluster, everything is fine.
 
Same problem with drbd storage, after a cluster migration... No more tips from debug/log/syslog than others people in this thread. Tried to play with apparmor without results. Shutdown/Stop then start is correcting all migration/backup/consols & consort problems. I can't see a relation but, it seems that qemu guest agent presence allow a better resilience against this bug on my Linux and Windows. Sadly, my OpenBSD have no guest agent and still suffers from this problem. If I move VM on a 5.4 pve cluster, everything is fine.

please test the qemu-server package from pvetest (see my previous post).
 
  • Like
Reactions: fireon
I prefer to wait that others does : my pve 6.0 cluster have some critical VM that I can't move to my 5.4 cluster :-(
 
I tried the new version and the error does not occur in the same way. Now the backups do start but get interrupted:
Code:
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2019-10-04 00:00:02
INFO: status = running
INFO: update VM 100: -lock backup
INFO: VM Name: ABC
INFO: include disk 'virtio0' 'vm01-sata:vm-100-disk-0' 200G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/opt/backup//dump/vzdump-qemu-100-2019_10_04-00_00_02.vma.lzo'
INFO: started backup task '98205932-cda7-4400-acd8-11544b5ebdee'
INFO: status: 0% (36306944/214748364800), sparse 0% (7999488), duration 3, read/write 12/9 MB/s
INFO: status: 1% (2149187584/214748364800), sparse 0% (30916608), duration 23, read/write 105/104 MB/s
INFO: status: 2% (4314497024/214748364800), sparse 0% (39452672), duration 59, read/write 60/59 MB/s
INFO: status: 3% (6494552064/214748364800), sparse 0% (62484480), duration 110, read/write 42/42 MB/s
INFO: status: 4% (8638889984/214748364800), sparse 0% (76939264), duration 151, read/write 52/51 MB/s
INFO: status: 5% (10825498624/214748364800), sparse 0% (86171648), duration 169, read/write 121/120 MB/s
INFO: status: 6% (12974882816/214748364800), sparse 0% (89653248), duration 228, read/write 36/36 MB/s
INFO: status: 7% (15077474304/214748364800), sparse 0% (95498240), duration 246, read/write 116/116 MB/s
INFO: status: 8% (17197694976/214748364800), sparse 0% (97767424), duration 310, read/write 33/33 MB/s
INFO: status: 9% (19414843392/214748364800), sparse 0% (102793216), duration 382, read/write 30/30 MB/s
INFO: status: 10% (21562982400/214748364800), sparse 0% (109240320), duration 405, read/write 93/93 MB/s
INFO: status: 11% (23722590208/214748364800), sparse 0% (116121600), duration 466, read/write 35/35 MB/s
INFO: status: 12% (25776816128/214748364800), sparse 0% (120967168), duration 498, read/write 64/64 MB/s
INFO: status: 13% (27987410944/214748364800), sparse 0% (132014080), duration 517, read/write 116/115 MB/s
INFO: status: 14% (30088691712/214748364800), sparse 0% (136548352), duration 581, read/write 32/32 MB/s
ERROR: interrupted by signal
INFO: aborting backup job
ERROR: Backup of VM 100 failed - interrupted by signal
INFO: Failed at 2019-10-04 00:10:54
ERROR: Backup job failed - interrupted by signal

TASK ERROR: interrupted by signal

The system logs did not show anything interesting during that time. I wonder what aborts the backup?

Regards,
Sebastian
 
Had the same issue - "Got timeout" during a nightly snapshot backup.

Upgraded to pve-manager/6.0-7/28984024 last weekend- no timeouts since :)
 
Just a small update on the pve-qemu-kvm 4.0.0-6 testing: I had to roll back to pve-qemu-kvm 4.0.0-5 as VMs where randomly dying during a backup run. These were not even actively backed up during the time, so the updated package seems a bit unstable.
 
Our first post:

Same error message after pve-qemu-kvm 4.0.0-6 testing update.
Problem is coming back during nightly backups on one vm and therefore we have decided to rollback to stable version 5.4.

  1. Backup started at 03:30:02 and found him frozen in the morning at 07:52:01.
    We then stopped the backup task manually (see below the line "interrupted by signal")
  2. The vm disks are on NFS share
  3. After stopping the backup task, our vm was not available and therefore a hard stop with the command qm stop [id] was required.
    We then tried to restart the vm and this also failed with the message: "TASK ERROR: timeout waiting on systemd"
  4. We got this fixed after rebooting the node

Code:
INFO: starting new backup job: vzdump 200 --storage backup --mailnotification failure --quiet 1 --mode stop --compress lzo
INFO: Starting Backup of VM 200 (qemu)
INFO: Backup started at 2019-10-07 03:30:02
INFO: status = running
INFO: update VM 200: -lock backup
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: abmsrv2
INFO: include disk 'scsi0' 'Buffalo_NFS:200/vm-200-disk-0.raw' 100G
interrupted by signal
could not parse qemu-img info command output for '/mnt/pve/Buffalo_NFS/images/200/vm-200-disk-0.raw'
INFO: stopping vm
TASK ERROR: got unexpected control message: INFO: stopping vm
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!