[SOLVED] VM Backup hangs and affects GUI (sporadic)

DJSnoopy

New Member
Aug 8, 2011
9
0
1
Hi Proxmox team.

Since update to 3.4 we got problems with our backup.
Sounds like the problem mentioned here http://forum.proxmox.com/threads/20541-vm-hang-up-on-backup-and-sporadic

System runs in a cluster of three nodes connected to a NFS storage server for backups.
The last 3 days the backups went fine and this night they got stucked again.

We´ve already splitted up the backups of everey server to different times to test but same error occurs.

Code:
# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-37-pve: 2.6.32-147
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

/var/log/syslog
Code:
Mar  6 03:00:01 server14 /USR/SBIN/CRON[553131]: (root) CMD (vzdump --quiet 1 --mailnotification always --mode snapshot --mailto xxx@xxx.xx --all 1 --node server14 --compress lzo --storage server16)
Mar  6 03:00:01 server14 vzdump[553132]: <root@pam> starting task UPID:server14:000870AE:019E89F8:54F90A21:vzdump::root@pam:
Mar  6 03:00:01 server14 vzdump[553134]: INFO: starting new backup job: vzdump --mailnotification always --quiet 1 --mailto xxx@xxx.xx --mode snapshot --node server14 --all 1 --compress lzo --storage server16
Mar  6 03:00:01 server14 vzdump[553134]: INFO: Starting Backup of VM 101 (qemu)
Mar  6 03:00:02 server14 qm[553139]: <root@pam> update VM 101: -lock backup
Mar  6 03:17:01 server14 /USR/SBIN/CRON[555185]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  6 04:00:01 server14 /USR/SBIN/CRON[560388]: (root) CMD (vzdump --quiet 1 --mailnotification always --mode snapshot --mailto xxx@xxx.xx --all 1 --node server15 --compress lzo --storage server16)
Mar  6 04:17:01 server14 /USR/SBIN/CRON[562442]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  6 05:17:01 server14 /USR/SBIN/CRON[569665]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  6 06:17:02 server14 /USR/SBIN/CRON[576945]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  6 06:25:01 server14 /USR/SBIN/CRON[577910]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Mar  6 06:25:02 server14 pveproxy[577967]: send HUP to 3065
Mar  6 06:25:02 server14 pveproxy[3065]: received signal HUP
Mar  6 06:25:02 server14 pveproxy[3065]: server closing
Mar  6 06:25:02 server14 pveproxy[3065]: server shutdown (restart)
Mar  6 06:25:02 server14 pveproxy[403360]: worker exit
Mar  6 06:25:02 server14 pveproxy[403359]: worker exit
Mar  6 06:25:02 server14 pveproxy[403358]: worker exit
Mar  6 06:25:02 server14 pvepw-logger[403316]: received terminate request (signal)
Mar  6 06:25:02 server14 pvepw-logger[403316]: stopping pvefw logger
Mar  6 06:25:02 server14 pvepw-logger[577992]: starting pvefw logger
Mar  6 06:25:02 server14 rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2376" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Mar  6 06:25:02 server14 pveproxy[3065]: restarting server
Mar  6 06:25:02 server14 pveproxy[3065]: worker 403359 finished
Mar  6 06:25:02 server14 pveproxy[3065]: worker 403358 finished
Mar  6 06:25:02 server14 pveproxy[3065]: worker 403360 finished
Mar  6 06:25:02 server14 pveproxy[3065]: starting 3 worker(s)
Mar  6 06:25:02 server14 pveproxy[3065]: worker 578046 started
Mar  6 06:25:02 server14 pveproxy[3065]: worker 578047 started
Mar  6 06:25:02 server14 pveproxy[3065]: worker 578048 started

Output for one node in GUI is animated as running, but status detail: stopped: OK
Code:
INFO: starting new backup job: vzdump --quiet 1 --mailnotification always --mailto xxx@xxx.xx --mode snapshot --all 1 --node server13 --compress lzo --storage server16
INFO: Starting Backup of VM 112 (openvz)
INFO: CTID 112 exist mounted running
INFO: status = running
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating lvm snapshot of /dev/mapper/pve-data ('/dev/pve/vzsnap-server13-0')
INFO: Logical volume "vzsnap-server13-0" created
INFO: creating archive '/mnt/pve/server16/dump/vzdump-openvz-112-2015_03_06-02_00_02.tar.lzo'
INFO: Total bytes written: 6734008320 (6.3GiB, 19MiB/s)
INFO: archive file size: 3.53GB
INFO: delete old backup '/mnt/pve/server16/dump/vzdump-openvz-112-2015_02_26-02_00_02.tar.lzo'
INFO: Finished Backup of VM 112 (00:05:53)
INFO: Starting Backup of VM 105 (qemu)
INFO: status = running
INFO: update VM 105: -lock backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/server16/dump/vzdump-qemu-105-2015_03_06-02_05_55.vma.lzo'
INFO: started backup task '298e2f02-d9d1-4314-98fe-d5ebc81a0f2e'
INFO: status: 0% (116391936/107374182400), sparse 0% (10571776), duration 3, 38/35 MB/s
INFO: status: 1% (1092878336/107374182400), sparse 0% (237723648), duration 24, 46/35 MB/s
INFO: status: 2% (2160721920/107374182400), sparse 0% (270192640), duration 58, 31/30 MB/s
INFO: status: 3% (3231907840/107374182400), sparse 0% (346353664), duration 85, 39/36 MB/s
INFO: status: 4% (4297195520/107374182400), sparse 0% (354697216), duration 114, 36/36 MB/s
INFO: status: 5% (5382275072/107374182400), sparse 0% (419848192), duration 143, 37/35 MB/s
INFO: status: 6% (6477053952/107374182400), sparse 0% (538894336), duration 173, 36/32 MB/s
INFO: status: 7% (7524777984/107374182400), sparse 0% (685649920), duration 201, 37/32 MB/s
INFO: status: 8% (8596357120/107374182400), sparse 0% (764256256), duration 229, 38/35 MB/s
INFO: status: 9% (9666166784/107374182400), sparse 0% (804245504), duration 258, 36/35 MB/s
INFO: status: 10% (10759831552/107374182400), sparse 0% (805789696), duration 290, 34/34 MB/s
INFO: status: 11% (11821056000/107374182400), sparse 0% (871227392), duration 316, 40/38 MB/s
INFO: status: 12% (12903120896/107374182400), sparse 0% (878071808), duration 348, 33/33 MB/s
INFO: status: 13% (13995540480/107374182400), sparse 0% (924262400), duration 380, 34/32 MB/s
INFO: status: 14% (15071576064/107374182400), sparse 0% (935993344), duration 408, 38/38 MB/s
INFO: status: 15% (16172580864/107374182400), sparse 1% (1173659648), duration 431, 47/37 MB/s
INFO: status: 16% (17181245440/107374182400), sparse 1% (1324769280), duration 453, 45/38 MB/s
INFO: status: 17% (18284937216/107374182400), sparse 1% (1402331136), duration 483, 36/34 MB/s
INFO: status: 18% (19339935744/107374182400), sparse 1% (1403027456), duration 513, 35/35 MB/s
INFO: status: 19% (20401291264/107374182400), sparse 1% (1438769152), duration 543, 35/34 MB/s
INFO: status: 20% (21510356992/107374182400), sparse 1% (1438769152), duration 574, 35/35 MB/s
INFO: status: 21% (22584754176/107374182400), sparse 1% (1501061120), duration 605, 34/32 MB/s
INFO: status: 22% (23623958528/107374182400), sparse 1% (1504280576), duration 633, 37/36 MB/s
INFO: status: 23% (24712052736/107374182400), sparse 1% (1548492800), duration 662, 37/35 MB/s
INFO: status: 24% (25782714368/107374182400), sparse 1% (1550688256), duration 692, 35/35 MB/s
INFO: status: 25% (26872381440/107374182400), sparse 1% (1610268672), duration 717, 43/41 MB/s
INFO: status: 26% (27944550400/107374182400), sparse 1% (1611427840), duration 746, 36/36 MB/s
INFO: status: 27% (29002563584/107374182400), sparse 1% (1656954880), duration 778, 33/31 MB/s
INFO: status: 28% (30067064832/107374182400), sparse 1% (1665400832), duration 809, 34/34 MB/s
INFO: status: 29% (31165120512/107374182400), sparse 1% (1780113408), duration 839, 36/32 MB/s
INFO: status: 30% (32248430592/107374182400), sparse 1% (1988784128), duration 864, 43/34 MB/s
INFO: status: 31% (33316143104/107374182400), sparse 1% (2125041664), duration 893, 36/32 MB/s
INFO: status: 32% (34388180992/107374182400), sparse 2% (2246934528), duration 918, 42/38 MB/s
INFO: status: 33% (35460349952/107374182400), sparse 2% (2394128384), duration 944, 41/35 MB/s
INFO: status: 34% (36523802624/107374182400), sparse 2% (2542723072), duration 969, 42/36 MB/s
INFO: status: 35% (37591384064/107374182400), sparse 2% (2680578048), duration 994, 42/37 MB/s
INFO: status: 36% (38699008000/107374182400), sparse 2% (2798108672), duration 1021, 41/36 MB/s
INFO: status: 37% (39740375040/107374182400), sparse 2% (2952597504), duration 1044, 45/38 MB/s
INFO: status: 38% (40808939520/107374182400), sparse 2% (2954043392), duration 1076, 33/33 MB/s
INFO: status: 39% (41910009856/107374182400), sparse 2% (2996973568), duration 1107, 35/34 MB/s
INFO: status: 40% (42960551936/107374182400), sparse 2% (2998431744), duration 1141, 30/30 MB/s
INFO: status: 41% (44028002304/107374182400), sparse 2% (3042590720), duration 1172, 34/33 MB/s
INFO: status: 42% (45116948480/107374182400), sparse 2% (3100221440), duration 1203, 35/33 MB/s
INFO: status: 43% (46197047296/107374182400), sparse 3% (3669098496), duration 1220, 63/30 MB/s
INFO: status: 44% (47262793728/107374182400), sparse 3% (3772997632), duration 1246, 40/36 MB/s
INFO: status: 45% (48338567168/107374182400), sparse 3% (3812941824), duration 1275, 37/35 MB/s
INFO: status: 46% (49403854848/107374182400), sparse 3% (3812941824), duration 1307, 33/33 MB/s
INFO: status: 47% (50476285952/107374182400), sparse 3% (3850866688), duration 1333, 41/39 MB/s
INFO: status: 48% (51563724800/107374182400), sparse 3% (3860537344), duration 1363, 36/35 MB/s
INFO: status: 49% (52629864448/107374182400), sparse 3% (3902955520), duration 1391, 38/36 MB/s
INFO: status: 50% (53688270848/107374182400), sparse 3% (3907231744), duration 1416, 42/42 MB/s
INFO: status: 51% (54772760576/107374182400), sparse 3% (3948109824), duration 1447, 34/33 MB/s
INFO: status: 52% (55843487744/107374182400), sparse 3% (3948109824), duration 1476, 36/36 MB/s
INFO: status: 53% (56917753856/107374182400), sparse 3% (4082229248), duration 1501, 42/37 MB/s
INFO: status: 54% (57983303680/107374182400), sparse 3% (4082229248), duration 1530, 36/36 MB/s
INFO: status: 55% (59089092608/107374182400), sparse 3% (4212383744), duration 1560, 36/32 MB/s
INFO: status: 56% (60131180544/107374182400), sparse 3% (4219883520), duration 1587, 38/38 MB/s
INFO: status: 57% (61208788992/107374182400), sparse 4% (4341141504), duration 1613, 41/36 MB/s
INFO: status: 58% (62308417536/107374182400), sparse 4% (4358160384), duration 1642, 37/37 MB/s
INFO: status: 59% (63376261120/107374182400), sparse 4% (4394188800), duration 1669, 39/38 MB/s
INFO: status: 60% (64428441600/107374182400), sparse 4% (4398039040), duration 1693, 43/43 MB/s
INFO: status: 61% (65512013824/107374182400), sparse 4% (4485931008), duration 1722, 37/34 MB/s
INFO: status: 62% (66576646144/107374182400), sparse 4% (4494209024), duration 1749, 39/39 MB/s
INFO: status: 63% (67731587072/107374182400), sparse 4% (5028589568), duration 1766, 67/36 MB/s
INFO: status: 64% (68848844800/107374182400), sparse 5% (6102593536), duration 1769, 372/14 MB/s
INFO: status: 65% (69819105280/107374182400), sparse 5% (6157160448), duration 1794, 38/36 MB/s
INFO: status: 66% (70868598784/107374182400), sparse 5% (6162382848), duration 1820, 40/40 MB/s
INFO: status: 67% (71962198016/107374182400), sparse 5% (6193143808), duration 1847, 40/39 MB/s
INFO: status: 68% (73018769408/107374182400), sparse 5% (6213013504), duration 1873, 40/39 MB/s
INFO: status: 69% (74089365504/107374182400), sparse 5% (6340018176), duration 1897, 44/39 MB/s
INFO: status: 70% (75198758912/107374182400), sparse 6% (6556942336), duration 1923, 42/34 MB/s
INFO: status: 71% (76253626368/107374182400), sparse 6% (6660853760), duration 1950, 39/35 MB/s
INFO: status: 72% (77340213248/107374182400), sparse 6% (6662164480), duration 1981, 35/35 MB/s
INFO: status: 73% (78401241088/107374182400), sparse 6% (6789074944), duration 2007, 40/35 MB/s
INFO: status: 74% (79465086976/107374182400), sparse 6% (6789074944), duration 2034, 39/39 MB/s
INFO: status: 75% (80534896640/107374182400), sparse 6% (6827339776), duration 2061, 39/38 MB/s
INFO: status: 76% (81643307008/107374182400), sparse 6% (6830104576), duration 2090, 38/38 MB/s
INFO: status: 77% (82703876096/107374182400), sparse 6% (6923538432), duration 2116, 40/37 MB/s
INFO: status: 78% (83761889280/107374182400), sparse 6% (6927667200), duration 2145, 36/36 MB/s
INFO: status: 79% (84857978880/107374182400), sparse 6% (7061225472), duration 2170, 43/38 MB/s
INFO: status: 80% (85912190976/107374182400), sparse 6% (7067545600), duration 2197, 39/38 MB/s
INFO: status: 81% (86978461696/107374182400), sparse 6% (7103467520), duration 2225, 38/36 MB/s
INFO: status: 82% (88081825792/107374182400), sparse 6% (7121428480), duration 2253, 39/38 MB/s
INFO: status: 83% (89150390272/107374182400), sparse 6% (7165394944), duration 2277, 44/42 MB/s
INFO: status: 84% (90196017152/107374182400), sparse 6% (7179550720), duration 2302, 41/41 MB/s
INFO: status: 85% (91298332672/107374182400), sparse 6% (7223967744), duration 2333, 35/34 MB/s
INFO: status: 86% (92377841664/107374182400), sparse 6% (7223967744), duration 2357, 44/44 MB/s
INFO: status: 87% (93455515648/107374182400), sparse 6% (7362150400), duration 2379, 48/42 MB/s
INFO: status: 88% (94513725440/107374182400), sparse 6% (7370977280), duration 2404, 42/41 MB/s
INFO: status: 89% (95582158848/107374182400), sparse 6% (7410032640), duration 2431, 39/38 MB/s
INFO: status: 90% (96639909888/107374182400), sparse 6% (7410089984), duration 2454, 45/45 MB/s
INFO: status: 91% (97733378048/107374182400), sparse 6% (7438331904), duration 2480, 42/40 MB/s
INFO: status: 92% (98793357312/107374182400), sparse 6% (7443714048), duration 2504, 44/43 MB/s
INFO: status: 93% (99864084480/107374182400), sparse 6% (7451799552), duration 2530, 41/40 MB/s
INFO: status: 94% (100968759296/107374182400), sparse 6% (7452610560), duration 2556, 42/42 MB/s
INFO: status: 95% (102025527296/107374182400), sparse 6% (7452778496), duration 2579, 45/45 MB/s
INFO: status: 96% (103125483520/107374182400), sparse 6% (7453405184), duration 2603, 45/45 MB/s
INFO: status: 97% (104173731840/107374182400), sparse 6% (7454015488), duration 2625, 47/47 MB/s
INFO: status: 98% (105249243136/107374182400), sparse 6% (7454547968), duration 2648, 46/46 MB/s
INFO: status: 99% (106349133824/107374182400), sparse 6% (7455825920), duration 2672, 45/45 MB/s
INFO: status: 100% (107374182400/107374182400), sparse 6% (7457972224), duration 2698, 39/39 MB/s
INFO: transferred 107374 MB in 2698 seconds (39 MB/s)
INFO: archive file size: 76.24GB
INFO: delete old backup '/mnt/pve/server16/dump/vzdump-qemu-105-2015_02_26-02_06_36.vma.lzo'
INFO: Finished Backup of VM 105 (00:45:11)
INFO: Backup job finished successfully
TASK OK


Backup of other node animated as running BUT detail status still marked as running:
Code:
INFO: starting new backup job: vzdump --mailnotification always --quiet 1 --mailto xxx@xxx.xx --mode snapshot --node server14 --all 1 --compress lzo --storage server16
INFO: Starting Backup of VM 101 (qemu)
INFO: status = stopped

So the backup hang, task is not marked as finished and the GUI losses informations like VM name etc.
Only way to get rid of this status to rebooot the server. But this is no permanent solution for us.
So somethin here produces a lock and all other application parts get stucked.
Local backup can be used for testing, but again a non permanent solution.

If you need any further informations just give me a hint.

kind regards
Michael
 
Last edited:
Hi spirit.

The stroage server got an Dual-Core AMD Opteron and 4 x HDDs with Adaptec Raid Controller. The backups are timed to get only one concurrent writing backup.
I´ve updated the kernel and rebooted the storage server. I´ll have a look the next days due the hang is only sprodaic.

Thanks again and i´ll update this thread with the results the next days.

kind regards
Michael
 
Hi spirit.

After a kernel upgrade and a reboot of the storage server all works fine again since my last post.
Thanks again for your attention.

kind regards
Michael