Serious vzdump IO problem

gkovacs

Renowned Member
Dec 22, 2008
516
51
93
Budapest, Hungary
We have two PVE servers P1 and P2, OpenVZ only. P1 hosts a lot of Apache type containers, while P2 has twoMySQL containers and some small Apache on it. Both are identicaly setup (Core2 Quad CPU, 8GB RAM, Adaptec SATA RAID, PVE 1.9, kernel 2.6.32-4). P1 starts the daily vzdump backup routine at 10pm, P2 at 2am (to an NFS share hosted on a third server).

We have been informed that during these backups, for a couple of minutes the sites slow down and even time out, and sometimes they can't reach the MySQL server. After some investigation we found that right after the snapshot backup starts at 10pm, the sites on P1 start to become unresponsive so much that they fill up the available connections on the MySQL servers on P2.

PVE reports an IO delay around 30% when the vzdump backups are running, but the load on the containers is between 40-160. Network performance is sluggish as well, ssh logins take minutes. Basically vzdump kills the complete IO subsystem of these servers.

We have already changed the default IO scheduler from CFQ to deadline, because:
- we read that CFQ is not aware of the IO queue the Adaptec RAID uses so it's unnecessary
- servers were completely unusable with CFQ during backups

Any ideas how to run vzdump so it doesn't kill the IO of the server?
 
Last edited:
The complete hang of the server happens in the first 5 minutes of a 3 hour container backup, well before the find and tar commands are executed, so I don't think limiting bandwidth would do anything helpful (bwlimit is already set to 32768). We suspect it is happening during the snapshot creation phase, but unfortunately we can't see exactly what vzdump is doing.

Additionally, we have switched to the noop scheduler (from deadline), and while it did not solve the above problem, it improved performance somewhat. Vzdump is a couple of percent faster, and the server much more usable during backups. Looks like the Adaptec RAID processor is significantly more effective at managing reads and writes of a multi-disk array than the kernel, which is only seeing a single block device.
 
The complete hang of the server happens in the first 5 minutes of a 3 hour container backup, well before the find and tar commands are executed, so I don't think limiting bandwidth would do anything helpful (bwlimit is already set to 32768). We suspect it is happening during the snapshot creation phase, but unfortunately we can't see exactly what vzdump is doing.
Hi,
perhaps your IO-System is not fast enough to handle the snapshot creation smooth?
What kind of raid do you use (raid-level, disks). Do you have different raid-volumes? Do you start simultaneous backups from different volumegroups?
Are all pvs from a volumegroup on the same (fast) raid-volume?

Udo