Backup failed for vm running in Ceph

PiotrD · Dec 5, 2014

Hi,
I noticed that during Backup of vms, one backup fails. It happened twice in last 2 weeks and only for one vm and almost at exact time.

138: Dec 05 01:41:49 INFO: Starting Backup of VM 138 (qemu)
138: Dec 05 01:41:49 INFO: status = running
138: Dec 05 01:41:50 INFO: update VM 138: -lock backup
138: Dec 05 01:41:50 INFO: backup mode: snapshot
138: Dec 05 01:41:50 INFO: ionice priority: 7
138: Dec 05 01:41:50 INFO: snapshots found (not included into backup)
138: Dec 05 01:41:50 INFO: creating archive '/mnt/pve/backup/dump/vzdump-qemu-138-2014_12_05-01_41_49.vma.lzo'
138: Dec 05 01:41:50 INFO: started backup task '5af639e8-ae33-4d25-a98a-25a52a7c6ca3'
138: Dec 05 01:41:53 INFO: status: 0% (114753536/201863462912), sparse 0% (22028288), duration 3, 38/30 MB/s
138: Dec 05 01:42:26 INFO: status: 1% (2039087104/201863462912), sparse 0% (1409708032), duration 36, 58/16 MB/s
138: Dec 05 01:42:56 INFO: status: 2% (4072734720/201863462912), sparse 1% (3144876032), duration 66, 67/9 MB/s
138: Dec 05 01:43:25 INFO: status: 3% (6077087744/201863462912), sparse 2% (4887990272), duration 95, 69/9 MB/s
138: Dec 05 01:43:54 INFO: status: 4% (8156676096/201863462912), sparse 3% (6710759424), duration 124, 71/8 MB/s
138: Dec 05 01:44:19 INFO: status: 5% (10167058432/201863462912), sparse 4% (8718745600), duration 149, 80/0 MB/s
138: Dec 05 01:44:44 INFO: status: 6% (12178423808/201863462912), sparse 5% (10725748736), duration 174, 80/0 MB/s
138: Dec 05 01:45:18 INFO: status: 7% (14163968000/201863462912), sparse 6% (12200509440), duration 208, 58/15 MB/s
138: Dec 05 01:45:45 INFO: status: 8% (16198205440/201863462912), sparse 6% (14013587456), duration 235, 75/8 MB/s
138: Dec 05 01:46:26 INFO: status: 9% (18205179904/201863462912), sparse 7% (15217860608), duration 276, 48/19 MB/s
138: Dec 05 01:46:42 ERROR: VM 138 not running
138: Dec 05 01:46:42 INFO: aborting backup job
138: Dec 05 01:46:42 ERROR: VM 138 not running
138: Dec 05 01:46:44 ERROR: Backup of VM 138 failed - VM 138 not running

As it turns out vm died during backup process. I am not sure if it was down immediately or after some freeze. However, if there was freeze it only lasted for few minutes - I only received info from nagios that system went down.
I checked all the logs from compute node and from all osds and there weren't any errors like PG errors and so on. The only error I found was inside the vm and that about failed backup.

Dec 5 01:46:01 *-srv01 systemd: serial-getty@ttyS0.service holdoff time over, scheduling restart.
Dec 5 01:46:01 *-srv01 systemd: Stopping Serial Getty on ttyS0...
Dec 5 01:46:01 *-srv01 systemd: Starting Serial Getty on ttyS0...
Dec 5 01:46:01 *-srv01 systemd: Started Serial Getty on ttyS0.
Dec 5 01:46:11 *-srv01 systemd: serial-getty@ttyS0.service holdoff time over, scheduling restart.
Dec 5 01:46:11 *-srv01 systemd: Stopping Serial Getty on ttyS0...
Dec 5 01:46:11 *-srv01 systemd: Starting Serial Getty on ttyS0...
Dec 5 01:46:11 *-srv01 systemd: Started Serial Getty on ttyS0.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
Dec 5 03:59:36 *-srv01 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="759" x-info="http://www.rsyslog.com"] start
Dec 5 03:59:27 *-srv01 journal: Runtime journal is using 8.0M (max 399.2M, leaving 598.8M of free 3.8G, current limit 399.2M).
Dec 5 03:59:27 *-srv01 kernel: Initializing cgroup subsys cpuset
Dec 5 03:59:27 *-srv01 kernel: Initializing cgroup subsys cpu
Dec 5 03:59:27 *-srv01 kernel: Initializing cgroup subsys cpuacct
Dec 5 03:59:27 *-srv01 kernel: Linux version 3.17.4-2.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #1 SMP Sun Nov

It looks like there was some issue with storage, but it is hard to find any info about that in logs

What could be the problem with that? Is there any way to log info from running kvm so that I could see why it crashed ? Or maybe it is known issue and I should just upgrade ?

I am using Proxmox and Ceph in the following versions.
# pveversion
pve-manager/3.2-4/e24a91c1 (running kernel: 2.6.32-31-pve)

# ceph -v
ceph version 0.80.5

Kind regards,
Piotr D

PiotrD · Dec 6, 2014

During this night the same thing happened again, however this time for different vm and on different compute node. What could cause that ?

PiotrD · Dec 15, 2014

Bump.
Does anyone have an idea what can be wrong with it ?
I will upgrade system as soon as it is possible, but I am not sure if that would help.

tom · Dec 15, 2014

I use VM backup (virtual disks von ceph) daily (about 1 TB in total), but I do not see this.

Currently I run latest 3.3 with 2.6.32 kernel, ceph (giant) is running on 3 Proxmox VE hosts, 12 OSD (SSD only).

PiotrD · Dec 15, 2014

tom said:
I use VM backup (virtual disks von ceph) daily (about 1 TB in total), but I do not see this.

Currently I run latest 3.3 with 2.6.32 kernel, ceph (giant) is running on 3 Proxmox VE hosts, 12 OSD (SSD only).

Hi,
Hmm so the only thing that I have in mind right now are some kind of network issues with current switches we are using.
Ok and what about Ceph giant version (I am running firefly version), I know that it is the latest stable ceph release, but do you recommend it ?
If yes, is the only right way to upgrade it by setting new repo and running apt-get upgrade ?

Kind regards,
Piotr D

rengiared · Dec 16, 2014

looks like the same problem as mine
http://forum.proxmox.com/threads/20460-Failed-Backup-stops-VM

in the last days i did the backup with "uncompressed" or "gzip" and all went good, but it is damn slow in comparision to lzo

Search

Search

Backup failed for vm running in Ceph

PiotrD

Active Member

PiotrD

Active Member

PiotrD

Active Member

tom

Proxmox Staff Member

PiotrD

Active Member

rengiared

Renowned Member

We value your privacy