Proxmox 3 ready for production?

ericmachine · Jul 31, 2013

Hi everyone,

I am getting new servers soon and wondering whether I should install the proxmox 2x version of can I go with 3.0.

Is 3.0 stable especially DRBD related? Read some posts and seem people are having problems with 3.0.

Any help? Thanks.

subversion · Jul 31, 2013

From my point of view ProxMox 3.0.x is a very nice upgrade, it's similar to 2.x but with many improvements. The only problem I have had with it is with vzdump backups crashing / locking up the box. There are others reporting this issue as well, specifically it seems anyone using megaraid / LSI based RAID controllers and arrays (we use Dell servers, R620 and R8xx series systems with 32 or more GB of RAM / ext3 and ext4 filesystems, iSCSI mounts and NFS backup mount points) - see other threads concerning these issues. Outside of that it has been rock-solid. Backups are obviously a critical issue and the ProxMox team has been very responsive in trying to help those of use with those issues get them sorted out.

Hope this helps!
Joe Jenkins
URE

ericmachine · Aug 1, 2013

subversion said:
From my point of view ProxMox 3.0.x is a very nice upgrade, it's similar to 2.x but with many improvements. The only problem I have had with it is with vzdump backups crashing / locking up the box. There are others reporting this issue as well, specifically it seems anyone using megaraid / LSI based RAID controllers and arrays (we use Dell servers, R620 and R8xx series systems with 32 or more GB of RAM / ext3 and ext4 filesystems, iSCSI mounts and NFS backup mount points) - see other threads concerning these issues. Outside of that it has been rock-solid. Backups are obviously a critical issue and the ProxMox team has been very responsive in trying to help those of use with those issues get them sorted out.

Hope this helps!
Joe Jenkins
URE

My server has the LSI hardware raid controller. Does that mean I will have problems using Proxmox 3? If I don't use vzdump, will it has issue with the LSI raid controller?

Lastly, I never use vzdump before for Proxmox v2. What's the benefits and is it the only way to backup?

ebiss · Aug 1, 2013

I use PVE 3.0 with KVM only, It's stable. no problem here.
Server hardware is DELL R620

dietmar · Aug 1, 2013

subversion said:
Backups are obviously a critical issue

This bug only triggers if you do LVM snapshot backups on containers. KVM snapshot backup works without any issues.

ericmachine · Aug 1, 2013

dietmar said:
This bug only triggers if you do LVM snapshot backups on containers. KVM snapshot backup works without any issues.

When I create a new VM in Proxmox 2.3-13, I can't see any options whether it's KVM or LVM? Did I miss out anything or is my proxmox outdated? Any help? Thanks.

dietmar · Aug 1, 2013

ericmachine said:
When I create a new VM in Proxmox 2.3-13, I can't see any options whether it's KVM or LVM?

In short, the bug can only hit you if you use containers.

ericmachine said:
Did I miss out anything or is my proxmox outdated? Any help? Thanks.

2.X is outdated.

ericmachine · Aug 2, 2013

dietmar said:
In short, the bug can only hit you if you use containers.

2.X is outdated.

noted and thanks

m.ardito · Aug 2, 2013

@dietmar

My cluster has 2 nodes IBM x3650m2 with LSI hw raid, and now I'm worried to upgrade to 3, even if I wish to, reading the posts related.
I have just 1 CT now but who knows. Do you feel that this bug can be solved in the near future? I mean in upstream kernel/drivers, or shoud I plan to change raid controllers (if possible)?

Thanks,
Marco

tom · Aug 2, 2013

just to note, we do not see this issue on our own servers, in our test lab and also no one of our customer got the issue currently.

we are still waiting for a testcase from our community users but so far no one was able to submitted one.

subversion · Aug 5, 2013

Tom,

What do you require for a test case? I can reproduce this issue simply by attempting a manual backup from the command line on my server. Automated backups also fail and take down the server. I have detailed the hardware / logs and issues in several other related threads.

I have tried different filesystems, the newest kernels released here, different mount points for backups - local / iSCSI / NFS. It's happening on our Dell servers, only in 3.0.x - some of these servers are about 18 months old, some of them I just bought and built a 3.0.x Prox cluster with. All are PERC RAID controllers. 4 identical servers running Prox 1.9 are backing up just fine, I was working on migrating the whole thing to 3.0.x when this happened so I haven't completed upgrading the remaining servers.

Let me know what you need if it helps and I'll see what I can provide. Be happy to help if I can.

Thanks!
Joe Jenkins

tom · Aug 5, 2013

subversion said:
Tom,

What do you require for a test case? I can reproduce this issue simply by attempting a manual backup from the command line on my server. Automated backups also fail and take down the server. I have detailed the hardware / logs and issues in several other related threads.

I have tried different filesystems, the newest kernels released here, different mount points for backups - local / iSCSI / NFS. It's happening on our Dell servers, only in 3.0.x - some of these servers are about 18 months old, some of them I just bought and built a 3.0.x Prox cluster with. All are PERC RAID controllers. 4 identical servers running Prox 1.9 are backing up just fine, I was working on migrating the whole thing to 3.0.x when this happened so I haven't completed upgrading the remaining servers.

Let me know what you need if it helps and I'll see what I can provide. Be happy to help if I can.

Thanks!
Joe Jenkins

We need a step-by-step instruction to re-produce the issue in our test lab. If we have the needed hardware, we can do deeper analysis.

dietmar · Aug 5, 2013

subversion said:
I can reproduce this issue simply by attempting a manual backup from the command line on my server.

We cannot reproduce the bug with that - everything works here.

subversion · Aug 5, 2013

Thanks Tom / Dietmar,

I understand you guys haven't been able to reproduce it, but it's impacting a lot of users on varied setups (judging by all the threads related to the issue.) I'll try to detail my particular situation, it may be useful, I am not sure.

I have been running a ProxMox 1.9 cluster on 5 servers - all identical Dell machines - service tag / hardware inventory can be seen here:
http://www.dell.com/support/troubleshooting/us/en/04/Servicetag/857JSR1 - I was wrong on the age of the machines, they are almost 3 years old now but still running fine in every respect except for this weird backup issue

With the release of 3.0 I decided that it was time to upgrade the ProxMox versions on those machines, using one of the five machines as my test deployment platform. I migrated any running VMs off this box to other cluster members and using an ISO of the release version of ProxMox 3.0, I did a clean install of 3.0. After the install I set up our iSCSI mount which is where we run all our VMs from (a Dell Equallogic SAN) Once this was up and running, I moved the VMs back to the box one at a time and did a vzrestore or qmrestore as needed from the command line until all VMs were back online.

I created a scheduled backup, all running VMs (mostly OpenVZ containers and 2-3 KVM VMs) - when the scheduled backup runs it seems to complete 1-3 backups and then just hangs there, the load on the server skyrockets to over 1000 or more, VMs stop responding and they cannot be stopped or the cause of the load cannot be pinpointed. System logs show a lot of non-specific I/O / wait / filesystem jargon (several log dumps and mount details etc have been posted by me and others in several threads) What's odd is that I can still easily SSH into the server or access the webgui but I am unable to do anything with the box until I physically power it down and reboot it.

At first I assumed it was an NFS issue, so I tried local filesystem backups with no success. I've tried EXT3, EXT4, and doing the recommended kernel upgrades and scheduler changes you guys have recommended. I have tried only backing up OpenVZ containers (I thought maybe KVM VMs were the issue) with no luck. I have attempted a single backup from the command line of one OpenVZ container only to have it lock the box up solid. All of our VMs are running various versions of CentOS(5 / 6) and Ubuntu 12.04 or Ubuntu 12.10.

Just a couple of months ago I purchased six new Dell R620 servers with 32GB of RAM, local RAID storage, etc. I built an entirely new Prox 3.0 cluster with these servers - each server is identical and only runs 4-6 OpenVZ containers (Ubuntu 12.10) each. I set up recurring backups on these as well and they seemed to run fine for the first month or so I had them in production. Recently on two of these new servers the same problem I described above started happening with these machines. System inventory / service tag can be viewed here:
http://www.dell.com/support/troubleshooting/us/en/04/Servicetag/BBM6FX1

There isn't much more details than that, step by step it was simple clean install and migration or a clean install and new VMs - but I am happy to provide as much specific information as I can if it makes clearing this up any easier. If remote access to the affected system would be helpful, I can provide that as well, keeping in mind that they are production servers so the only time I try testing anything where I have to reboot the box because of the failure is done at night so it doesn't impact our users as much.

Have a great day and let me know if I can be more helpful.

Cheers,
Joe Jenkins

tom · Aug 5, 2013

Again, we need a step-by-step bug report. as it works here, we cannot debug as we are unable to build a system with the reported behavior. All our system (and also the system from our customers) are running well. As soon as we have a re-producible bug description we will dig deeper.

If you want that our support team take a deeper look on your servers, take a look on our support offerings including SSH remote login.

Raymond Burns · Aug 5, 2013

If I understand correctly, Proxmox team just needs new information or variable to make the bug reappear. So far, the hardware has been replicated, the process or procedure has been replicated. The only thing I see that is different is the content or data, and/or backup amount. And this proves true on your own systems. Proxmox will backup some CT but not a specific one. What makes that specific container different. The content/data it has, the OS it's running, or the order in which it is backed up. These are what's left.
I don;t have the issue, or the hardware. Just adding my two cents. What is the OS by the way, maybe that would help

RobFantini · Aug 6, 2013

ericmachine said:
Hi everyone,

I am getting new servers soon and wondering whether I should install the proxmox 2x version of can I go with 3.0.

Is 3.0 stable especially DRBD related? Read some posts and seem people are having problems with 3.0.

Any help? Thanks.

I think that high availability storage is the main thing to be concerned about. But I do not know which is the best shared storage to use. NFS seems to have the most issues [ search backup and cluster issues.], ceph and sheepdog have only been around a few years. so that leaves drbd.

we've used drbd for 7 years and I think that is the most reliable storage system to use. I do not buy into the moronic cloud. local ha is best for filling customer orders and doing billing. let alone SOGO storage.

Drbd has all the solutions I've needed regarding split brain at their on line docs.

So which is the most reliable storage system to use now?

Search

Search

Proxmox 3 ready for production?

ericmachine

New Member

subversion

New Member

ericmachine

New Member

ebiss

Renowned Member

dietmar

Proxmox Staff Member

ericmachine

New Member

dietmar

Proxmox Staff Member

ericmachine

New Member

m.ardito

Famous Member

tom

Proxmox Staff Member

subversion

New Member

tom

Proxmox Staff Member

dietmar

Proxmox Staff Member

subversion

New Member

tom

Proxmox Staff Member

Raymond Burns

Member

RobFantini

Famous Member