Backup failed for vm running in Ceph

PiotrD

Active Member
Apr 10, 2014
44
1
28
Hi,
I noticed that during Backup of vms, one backup fails. It happened twice in last 2 weeks and only for one vm and almost at exact time.

138: Dec 05 01:41:49 INFO: Starting Backup of VM 138 (qemu)
138: Dec 05 01:41:49 INFO: status = running
138: Dec 05 01:41:50 INFO: update VM 138: -lock backup
138: Dec 05 01:41:50 INFO: backup mode: snapshot
138: Dec 05 01:41:50 INFO: ionice priority: 7
138: Dec 05 01:41:50 INFO: snapshots found (not included into backup)
138: Dec 05 01:41:50 INFO: creating archive '/mnt/pve/backup/dump/vzdump-qemu-138-2014_12_05-01_41_49.vma.lzo'
138: Dec 05 01:41:50 INFO: started backup task '5af639e8-ae33-4d25-a98a-25a52a7c6ca3'
138: Dec 05 01:41:53 INFO: status: 0% (114753536/201863462912), sparse 0% (22028288), duration 3, 38/30 MB/s
138: Dec 05 01:42:26 INFO: status: 1% (2039087104/201863462912), sparse 0% (1409708032), duration 36, 58/16 MB/s
138: Dec 05 01:42:56 INFO: status: 2% (4072734720/201863462912), sparse 1% (3144876032), duration 66, 67/9 MB/s
138: Dec 05 01:43:25 INFO: status: 3% (6077087744/201863462912), sparse 2% (4887990272), duration 95, 69/9 MB/s
138: Dec 05 01:43:54 INFO: status: 4% (8156676096/201863462912), sparse 3% (6710759424), duration 124, 71/8 MB/s
138: Dec 05 01:44:19 INFO: status: 5% (10167058432/201863462912), sparse 4% (8718745600), duration 149, 80/0 MB/s
138: Dec 05 01:44:44 INFO: status: 6% (12178423808/201863462912), sparse 5% (10725748736), duration 174, 80/0 MB/s
138: Dec 05 01:45:18 INFO: status: 7% (14163968000/201863462912), sparse 6% (12200509440), duration 208, 58/15 MB/s
138: Dec 05 01:45:45 INFO: status: 8% (16198205440/201863462912), sparse 6% (14013587456), duration 235, 75/8 MB/s
138: Dec 05 01:46:26 INFO: status: 9% (18205179904/201863462912), sparse 7% (15217860608), duration 276, 48/19 MB/s
138: Dec 05 01:46:42 ERROR: VM 138 not running
138: Dec 05 01:46:42 INFO: aborting backup job
138: Dec 05 01:46:42 ERROR: VM 138 not running
138: Dec 05 01:46:44 ERROR: Backup of VM 138 failed - VM 138 not running

As it turns out vm died during backup process. I am not sure if it was down immediately or after some freeze. However, if there was freeze it only lasted for few minutes - I only received info from nagios that system went down.
I checked all the logs from compute node and from all osds and there weren't any errors like PG errors and so on. The only error I found was inside the vm and that about failed backup.

Dec 5 01:46:01 *-srv01 systemd: serial-getty@ttyS0.service holdoff time over, scheduling restart.
Dec 5 01:46:01 *-srv01 systemd: Stopping Serial Getty on ttyS0...
Dec 5 01:46:01 *-srv01 systemd: Starting Serial Getty on ttyS0...
Dec 5 01:46:01 *-srv01 systemd: Started Serial Getty on ttyS0.
Dec 5 01:46:11 *-srv01 systemd: serial-getty@ttyS0.service holdoff time over, scheduling restart.
Dec 5 01:46:11 *-srv01 systemd: Stopping Serial Getty on ttyS0...
Dec 5 01:46:11 *-srv01 systemd: Starting Serial Getty on ttyS0...
Dec 5 01:46:11 *-srv01 systemd: Started Serial Getty on ttyS0.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
Dec 5 03:59:36 *-srv01 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="759" x-info="http://www.rsyslog.com"] start
Dec 5 03:59:27 *-srv01 journal: Runtime journal is using 8.0M (max 399.2M, leaving 598.8M of free 3.8G, current limit 399.2M).
Dec 5 03:59:27 *-srv01 kernel: Initializing cgroup subsys cpuset
Dec 5 03:59:27 *-srv01 kernel: Initializing cgroup subsys cpu
Dec 5 03:59:27 *-srv01 kernel: Initializing cgroup subsys cpuacct
Dec 5 03:59:27 *-srv01 kernel: Linux version 3.17.4-2.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #1 SMP Sun Nov

It looks like there was some issue with storage, but it is hard to find any info about that in logs

What could be the problem with that? Is there any way to log info from running kvm so that I could see why it crashed ? Or maybe it is known issue and I should just upgrade ?

I am using Proxmox and Ceph in the following versions.
# pveversion
pve-manager/3.2-4/e24a91c1 (running kernel: 2.6.32-31-pve)

# ceph -v
ceph version 0.80.5

Kind regards,
Piotr D
 
During this night the same thing happened again, however this time for different vm and on different compute node. What could cause that ?
 
Bump.
Does anyone have an idea what can be wrong with it ?
I will upgrade system as soon as it is possible, but I am not sure if that would help.
 
I use VM backup (virtual disks von ceph) daily (about 1 TB in total), but I do not see this.

Currently I run latest 3.3 with 2.6.32 kernel, ceph (giant) is running on 3 Proxmox VE hosts, 12 OSD (SSD only).
 
I use VM backup (virtual disks von ceph) daily (about 1 TB in total), but I do not see this.

Currently I run latest 3.3 with 2.6.32 kernel, ceph (giant) is running on 3 Proxmox VE hosts, 12 OSD (SSD only).


Hi,
Hmm so the only thing that I have in mind right now are some kind of network issues with current switches we are using.
Ok and what about Ceph giant version (I am running firefly version), I know that it is the latest stable ceph release, but do you recommend it ?
If yes, is the only right way to upgrade it by setting new repo and running apt-get upgrade ?

Kind regards,
Piotr D
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!