Proxmox 3.0 OpenVZ Snapshot Errors NFS

Voleatech

Renowned Member
Mar 22, 2013
25
0
66
Hi,

since I upgraded to Proxmox 3.0 my Backups of my OpenVZ machines are making me huge problems.
The nightly Snapshots are breaking the entire server. It seems like there is (also) a problem with backing up to an NFS storage.
It starts the snapshot but hangs with the first machine. All VMs are unresponsible afterwards and I have to restart the server.

It ends up giving me Out of memory errors in the OpenVZ machines when it runs.
Here is the part of the log file:

Jun 2 01:00:01 proxmox /USR/SBIN/CRON[636264]: (root) CMD (vzdump 100 101 102 103 104 105 106 107 108 109 110 111 112 113 115 --quiet 1 --mode snapshot --mailto sven@voleatech.com --compress gzip --storage backup)
Jun 2 01:00:02 proxmox vzdump[636265]: <root@pam> starting task UPID:proxmox:0009B56C:0032F5AE:51AA7CF2:vzdump::root@pam:
Jun 2 01:00:02 proxmox vzdump[636268]: INFO: starting new backup job: vzdump 100 101 102 103 104 105 106 107 108 109 110 111 112 113 115 --quiet 1 --mailto sven@voleatech.com --mode snapshot --compress gzip --storage backup
Jun 2 01:00:07 proxmox pvestatd[3277]: WARNING: command 'df -P -B 1 /media/backup/eku' failed: got timeout (THIS IS THE NFS DRIVE)
Jun 2 01:00:16 proxmox vzdump[636268]: INFO: Starting Backup of VM 100 (openvz)
Jun 2 01:00:16 proxmox pvestatd[3277]: status update time (10.948 seconds)
Jun 2 01:00:16 proxmox kernel: EXT3-fs: barriers disabled
Jun 2 01:00:16 proxmox kernel: kjournald starting. Commit interval 5 seconds
Jun 2 01:00:16 proxmox kernel: EXT3-fs (dm-4): using internal journal
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51004945
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51004703
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51004942
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51128245
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51128110
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 1261584
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 1261583
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 1261582
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 1261581
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 1261580
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 1354024
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51020207
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51020205
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51020202
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51020201
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51020198
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51020181
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51019929
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51019928
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51019927
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 51019882
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 147491
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 147489
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 147488
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 147480
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 147477
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 50889059
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 50889050
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 50889024
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 50889015
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 50888997
Jun 2 01:00:16 proxmox kernel: ext3_orphan_cleanup: deleting unreferenced inode 50995205
Jun 2 01:00:16 proxmox kernel: EXT3-fs (dm-4): 32 orphan inodes deleted
Jun 2 01:00:16 proxmox kernel: EXT3-fs (dm-4): recovery complete
Jun 2 01:00:16 proxmox kernel: EXT3-fs (dm-4): mounted filesystem with ordered data mode

Jun 2 01:05:25 proxmox kernel: INFO: task flush-253:3:1863 blocked for more than 120 seconds.
Jun 2 01:05:25 proxmox kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 2 01:05:25 proxmox kernel: flush-253:3 D ffff88042c42f3a0 0 1863 2 0 0x00000000
Jun 2 01:05:25 proxmox kernel: ffff88042c431980 0000000000000046 0000000000000000 ffffffff8140b5ec
Jun 2 01:05:25 proxmox kernel: ffff88042c4319b0 ffffffff811c92f0 ffff88042c431d18 0000000000000282
Jun 2 01:05:25 proxmox kernel: ffff88042c431930 0000000101fb484b ffff88042c431fd8 ffff88042c431fd8
Jun 2 01:05:25 proxmox kernel: Call Trace:

Jun 2 02:47:17 proxmox kernel: ct0 nfs: server NAS not responding, still trying
Jun 2 03:02:43 proxmox kernel: >>> 100 oom generation 0 starts
Jun 2 03:02:43 proxmox kernel: 754037 (smbd) invoked loc oom-killer: gfp 0x200d2 order 0 oomkilladj=0
Jun 2 03:02:43 proxmox kernel: UB-100-Mem-Info:
Jun 2 03:02:43 proxmox kernel: Node 0 DMA prio:41 portion:486 scan:0 a_anon:0 40726492ms i_anon:0 40726492ms a_file:0 40726492ms i_file:0 40726492ms unevictable:0 reclaim_stat: 0 0 0 0
Jun 2 03:02:43 proxmox kernel: Node 0 DMA/shadow prio:41 portion:0 scan:0 a_anon:0 40726492ms i_anon:0 40726492ms a_file:0 40726492ms i_file:0 40726492ms unevictable:0 reclaim_stat: 0 0 0 0
Jun 2 03:02:43 proxmox kernel: Node 0 DMA32 prio:41 portion:104316 scan:0 a_anon:0 40726492ms i_anon:0 40726492ms a_file:0 40726492ms i_file:0 40726492ms unevictable:0 reclaim_stat: 0 0 0 0
Jun 2 03:02:43 proxmox kernel: Node 0 DMA32/shadow prio:41 portion:0 scan:0 a_anon:0 40726492ms i_anon:0 40726492ms a_file:0 40726492ms i_file:0 40726492ms unevictable:0 reclaim_stat: 0 0 0 0
Jun 2 03:02:43 proxmox kernel: Node 0 Normal prio:16 portion:418542 scan:91 a_anon:175408 1882835ms i_anon:91217 622835ms a_file:16 22835ms i_file:12 22835ms unevictable:0 reclaim_stat: 64669 41461 7908 2463
Jun 2 03:02:43 proxmox kernel: Node 0 Normal/shadow prio:41 portion:0 scan:0 a_anon:0 40726492ms i_anon:291902 40726492ms a_file:0 40726492ms i_file:61910 40726492ms unevictable:0 reclaim_stat: 0 0 0 0
Jun 2 03:02:43 proxmox kernel: Node 1 Normal prio:17 portion:525230 scan:46 a_anon:278093 2543437ms i_anon:49914 383437ms a_file:27 23437ms i_file:0 23437ms unevictable:0 reclaim_stat: 49585 33368 6403 1983
Jun 2 03:02:43 proxmox kernel: Node 1 Normal/shadow prio:41 portion:0 scan:0 a_anon:0 40726492ms i_anon:231572 40726492ms a_file:0 40726492ms i_file:27364 40726492ms unevictable:0 reclaim_stat: 0 0 0 0
Jun 2 03:02:43 proxmox kernel: Total 1207435 anon:1118106 file:89329 a_anon:453501 i_anon:664605 a_file:43 i_file:89286 unevictable:0
Jun 2 03:02:43 proxmox kernel: RAM: 1048576 / 1048576 [1] SWAP: 524288 / 524288 [1] KMEM: 1852403712 / 2147483648 [0] DCSZ: 43947787 / 1073741824 [0] OOMG: 911107 / inf [0] Dirty 0 Wback 0 Dche 46475 Prnd 13755
Jun 2 03:02:43 proxmox kernel: Out of memory in UB: Kill process 3668 (named) score -992 or sacrifice child
.....

Any suggestions?

Thanks
Sven
 
Looks like 2 different problems to me. First, the NFS server is unresponsive, and second you run out of memory inside the CT.
 
I am running Proxmox since 1.8 and never had an issues with this setup before which lets me assume that something went wrong with the 3.0 update.
I am a programmer myself and have some experience with Kernel development too.
The NFS unresponsive is mysterious because it works just fine when I test it on the shell and when I do snapshot backups manually.
The second nfs server unresponsive error is because the DNS server runs inside a VM and Proxmox is blocked completely after the failed backup.

The OOM error is a symptom of that as well. The CT has enough memory and it looks to me that the memory management is crashing with the backup error and it fills up slowly.
None of these errors occurred before 3.0. I am testing it with the old Kernel 32-19 for the cron job tonight.
Looks to me like a deadlock on the filesystem but I am not sure.
I will let you know tomorrow if it was the new kernel.
 
Everything worked fine with the old Kernel 32.19 last night.
All Backups/Snapshots went through.
I think there is a bug in the new Kernel.
 
Hi,

I have the same problem with snapshot backups on kernel 32-20. Lot's of

EXT3-fs error (device dm-4): ext3_find_entry errors,

and some

BUG: soft lockup - CPU#6 stuck for 79s!

In the backup log lot's of:

Read error at byte 0, while reading 7922 bytes: Input/output error...

(but I can read these files perfectly manually). And sometimes the whole machine locks up.

Dunno if this was with a previous kernel also because I just recently started with snapshot backups.

Falling back to rsync & burp (
http://burp.grke.org/) backups for now.

Wessel
 
The Kernel change didn't fix it for me, it was working for a few days and then the error came back.
I turned of snapshots and I am using suspend backups now which is a workaround for now.
I think it is an LVM problem and the same than http://forum.proxmox.com/threads/9097-vzdump-using-lvm-snapshot-kills-the-box

Thanks for this thread.
I think my HP DL385 has a P400 controller also. I'm considering a support contract for this issue. If I get any results I will post them here.

Wessel
 
Proxmox team adviced me the following:

> root@pinsir:~# cat /etc/vzdump.conf


As you use the defaults (1024 MB), I assume the snapshot runs out of space. increase the snapshot, e.g. to 8192.


> size: 8192

and

> root@pinsir:~# pveperf> CPU BOGOMIPS: 52803.18
> REGEX/SECOND: 814149
> HD SIZE: 94.49 GB (/dev/mapper/pve-root)
> BUFFERED READS: 193.68 MB/sec
> AVERAGE SEEK TIME: 6.61 ms
> FSYNCS/SECOND: 25.66


your FSYNCS/SECOND are terrible slow (should be at least 1000), looks like you do not use the write-cache on your controller. you need to set this to write-back, but you need a BBU for that. ask your HP vendor for details.
this will speed up your server considerably.


I did both: Upgrade the server with BBU (P/N: 578882-001 + 587324-001 for my HP DL385 with P410 controller) and increased the vzdump.conf size parameter.

FSYNCS/SECOND: 25.66 is now FSYNCS/SECOND: 2120.50 and snapshot backup went fine on the instance with problems (35GB size)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!