Is Proxmox 1.9 more reliable than Proxmox 2.3?

Cayuga · Apr 3, 2013

Last night we saw about 14 out of 80+ machines hang. Most had the "unable to connect to VM XXX socket - timeout after 31 retries" problem when we went to see what was wrong with them, but a few had the "hda: lost interrupt" problem (when VM's crash here, it's almost always one of these two errors).

These are all systems that ran for months at a time in Proxmox 1.9 and after 3 months, none of them have been able to run for a whole month in Proxmox 2.x (most recently 2.3). We've tried both local and shared storage without any noticeable difference.

Are we doing something wrong? Is Proxmox 1.9 more stable than Proxmox 2.3?

Where should I be looking to track this down?

Thanks!

tom · Apr 3, 2013

1.9 is EOL already. if you have any issue with latest 2.x you need to analyze it.

if you can´t find the cause, our support team can assist (if you have a valid subscription).

drott · Apr 3, 2013

tom said:
1.9 is EOL already. if you have any issue with latest 2.x you need to analyze it.

if you can´t find the cause, our support team can assist (if you have a valid subscription).

A couple of questions come to mind. Are these machines with hardware or software raid? What type of drives, SAS or just sata. The HDA disconnection is indicating a problem with the drive, drive controller, driver or interaction of all the above and memory issues. If you are losing HDA, it will not make any difference if you are on shared stortage since local OS and configuration information is stored on the local machine.

Since it was subset, were the ones that went down KVM vs CT and which ones indicated the HDA problem. There could be marginal sector developing on the drive.

Depending on make and vendor, I know that HP has been having lots of firmware updates for Raid Controllers and various SAS and Sata drives to prevent heads parking or going to places where you do not want them to go and having a bad things happen. Last month HP updated a very large portion of server drives with firmware updates. Those are segate and wd drives.

Are all firmwares uptodate that pertain to bios, raid controllers and drivers.

Run some good disk diag utilities against the drives and run memtest to rule out any potential memory issues. I know there have been some threads where people have had troubles with 2.3 such as with backup and it all came down to the memory used in the machine was not up to the job.

Cayuga · Apr 4, 2013

Thanks for the suggestions. The hardware is a mixture of IBM Bladecenter Blades and Dell Bladeserver Blades. Local storage is all SAS mostly individual drives, but some RAID1. Network storage is a CEPH cluster. All of the machines that died last night were KVM. I'm thinking back over the problems we've had in the last 6-8 weeks and I don't remember any CT hangs, but I'll keep a close eye on that going forward.

FYI - this is a 6 node PVE cluster and we say failures on at least 4 nodes. My experience has been that we'll be fine for 4-10 days and then have a bad night (like last night) and then fine again for days at a time. While we have had the occasional HDA and/or Socket problem during the day, we've never had a bunch of them during the day, only at night and therefore might be related somehow to backups as those run at night.

Thanks again and I'll report back as I can learn more.

drott · Apr 4, 2013

Cayuga said:
Thanks for the suggestions. The hardware is a mixture of IBM Bladecenter Blades and Dell Bladeserver Blades. Local storage is all SAS mostly individual drives, but some RAID1. Network storage is a CEPH cluster. All of the machines that died last night were KVM. I'm thinking back over the problems we've had in the last 6-8 weeks and I don't remember any CT hangs, but I'll keep a close eye on that going forward.

FYI - this is a 6 node PVE cluster and we say failures on at least 4 nodes. My experience has been that we'll be fine for 4-10 days and then have a bad night (like last night) and then fine again for days at a time. While we have had the occasional HDA and/or Socket problem during the day, we've never had a bunch of them during the day, only at night and therefore might be related somehow to backups as those run at night.

Thanks again and I'll report back as I can learn more.

What type of switch do you have? Singe or multipe interfaces from the HN to the switch. LACP or something else. What do the switch counters indicate for errors or other packet problems between the interfaces. All CAT 5E or 6 Cable. Switch fully patched? There was a thread about someone having switch issues and network attached storage.

I dont know how many hosts you do have to backup. But what I have been doing lately is staggering the backup. I run a manaual snapshop to see length of time to do a backup. Then increment time and backup the next one. It would be nice if the backup could go off of 5 minute increments but it works. The thinking is reduced disk IO overhead and if there is something that is being quikly there is plenty of time for a job to complete or nearly be done as the next one starts. I leaned with Backup Exec that you could do multiple backups but running them concurrently, would seem to create disk IO/ problems and really bog networks down. You can even take 4 small CTs get them done in a minute or two and then give more time for the larger KVMS. I run only CT's at this time. With the new backup strategy downtime is really not noticeable. Then what microsoft SQL or an AD controller will under load, I dont know.

w3ph · Apr 21, 2013

We're having this problem - it only affects CentOS 5.x KVM VPSs, all of which were set up with IDE drives. It does not affect CentOS 6, Ubuntu or Debian, all of which use virtio, and I think this may be a clue. It hits the CenOS 5 KVM guest that's being backed up. Hardware is Dell r710 and r720 servers, some RAID 10, some RAID 6. All image storage is local, no NFS or iSCSI for images. Sometimes the KVM guest recovers, sometimes it hangs and when that happens it's messy - we have to stop the backup, stop and restart the KVM guest (can't restart if the backup is running because it's locked, so if we do manage to kill an unresponsive KVM guest during its backup, we have to stop the backup to restart it). KVM guest image format is qcow2 for these guests.

The backup logs look normal - no errors noted. The problems are happening inside the CentOS 5 KVM guests.

I've been able to reduce the frequency of the problem a bit by editing /etc/vzdump.conf, adding bwlimit: 30000 to throttle the disk I/O - this makes the backups take noticeably longer but it has reduced the frequency of KVM guest hangs that are bad enough that we have to manually intervene.

I don't know whether this is a CentOS 5 problem or an IDE driver problem, or a combination, but the problem only affects our CentOS 5 KVM guests, which all happened to be set up using IDE, and it only affects them while they are being backed up - the CentOS 5 KVM guests don't get the "HDA lost interrupt" problem while different guests on the same physical machine are being backed up, and purposely creating heavy disk I/O on the physical servers doesn't seem to trigger the problem.

When one of these CentOS 5 guests is being backed up, we notice two things in the guest itself: we start seeing erors that look like:

hda: irq timeout: status=0xd0 { Busy }
ide: failed opcode was: unknown
ide0: reset: success
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: task_out_intr: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown

and disk IO gets very slow, often only 1 or 2MB/sec read speed according to hdparm. I/O speed on the other KVM guests on the same physical server is normal at this time, and when the backup is finished on the CentOS 5 guest that's showing the 'lost interrupt' problem I/O goes back to normal.

I love the new backup mechanism, but we're losing a lot of sleep tending CentOS 5 KVM guests that hang in the middle of the night during their backups and for us, Proxmox 2.3 is definitely less reliable because of this than previous versions.

w3ph · Apr 22, 2013

After a weekend chained at the computer, I think I've worked this out. We converted both test and production clusters to Prox 2.3, and since doing that, on the production cluster, some KVM guests have been hanging in the middle of the night - not always, maybe 2-3 times/week. The common element: the troubled guests are all CentOS KVM guests that use the IDE driver. I've looked at every guest, and while any guest (Debian/Ubunu, CentOS or RHEL in our environment) that uses IDE for disks is likely to report timeouts during backups under Prox 2.3, the CentOS/RHEL ones are the worst. Every one of these showed 'hda: lost interupt' during backups under prox 2.3, and the busiest (mail servers, mainly) hung when doing heavy I/O while they were being backed up, causing much middle-of-the-night unhappiness. This didn't happen prior to Prox 2.3, but it has been chronic as of 2.3.

Found this writeup: http://blog.rackcorp.com/?p=213 on how to convert CentOS 5 guests from IDE to virtio. Have applied it to the CentOS 5 guests that use IDE, run backups, and so far no more problems. IDE=bad. VIRTIO=good.

We have some Win2008 server guests that are using IDE, but I don't think I'm going to mess with them - they'd be faster with virtio, but Windows seems tolerant of I/O slowness and isn't complaining as far as I can tell. For linux, the Debian/Ubuntu and CentOS 6.x KVM guests here all use VIRTIO, and they're happy and don't hang during backups.

Cayuga · Apr 22, 2013

Thanks for the information. We have a mixture of CentOS, BSD and Solaris VM's using IDE. I've changed the CentOS machines to use VirtIO and we'll see if that helps. We're also upgrading to the new kernel and KVM and with luck, between the two, things will improve.

w3ph · Apr 22, 2013

I'm happy to report that this made a GIANT difference here. Not only is disk I/O much better on the CentOS 5 VPSs I converted from IDE to VIRTIO, but backups work as expected with no complaints from CentOS about IDE timeouts. Big relief - this was killing us. We have some old FreeBSD KVM guests that use IDE but they aren't complaining, so I've let those alone. Same with Windows Server 2008. In my setup, it was only the KVM CentOS 5 guests with IDE that were choking during backups (and otherwise being slow). Last night's backups ran perfectly with no throttling and no complaints or signs of trouble from the now-configured-with-VIRTIO CentOS 5 KVM guests.

Search

Search

Is Proxmox 1.9 more reliable than Proxmox 2.3?

Cayuga

Renowned Member

tom

Proxmox Staff Member

drott

Member

Cayuga

Renowned Member

drott

Member

w3ph

Renowned Member

w3ph

Renowned Member

Cayuga

Renowned Member

w3ph

Renowned Member