New Kernel and bug fixes

argonym · Aug 21, 2012

grandthief said:
Did the problem with bsod's on win 2003 x64 with high network bandwith fixed in this kernel?

Oh, I am not the only one who experienced this. Do you have links to a bug report or discussions? Does this for sure occur only on x64?
I just tried with most recent proxmox and a XP x64 guest with virtio-win-0.1-30, and it's still there. However, I don't get a BSOD, but errors when writing to a windows share due to dropped network packets.

Nemo · Aug 22, 2012

dietmar said:
Does a reload, or logout/login helps?

Like I wrote: Not even a full reboot. What helped, however, was (probably) emptying the browser cache. So, cosmetic problem solved. Thanks.

spirit · Aug 24, 2012

argonym said:
Oh, I am not the only one who experienced this. Do you have links to a bug report or discussions? Does this for sure occur only on x64?
I just tried with most recent proxmox and a XP x64 guest with virtio-win-0.1-30, and it's still there. However, I don't get a BSOD, but errors when writing to a windows share due to dropped network packets.

maybe can you post an issue here:
https://github.com/YanVugenfirer/kvm-guest-drivers-windows

the redhat developper is really reactive.

fafdk · Aug 24, 2012

mfgamma said:
dear all

I am using proxmox 2.1 version provided by OVH (on debian squeeze 64) and it works perfectly.

but when I do update & full-upgrade , without any other change, the machine becomes unreachable

after re starting the machine by ovh maintenance I got the following from pveversion -v. (see below)

It seems that the grub file is not updated with pve kernel by the process. proxmow web iterface is reachable but nothing works

thanks for your help and suggestions

Yes, I had the same problem with grub, but I could not boot at all!

What I did was to boot into recovery, chroot into my filesystem and reinstall and update grub. Don't know if you have a recovery-console with OVH?

All runs well now, no issues since the upgrade, not with Proxmox at least. Had to reinstall a few modules though, but all was good after that.

tin · Aug 25, 2012

Unimportant UI bug: I noticed a spelling mistake in the host console output when shutting down my home server earlier. When stopping containers it has "Stoping" (missing a P).
Didn't think it warranted it's own thread

dietmar · Aug 25, 2012

tin said:
When stopping containers it has "Stoping" (missing a P).

Will be fixed in next upload - many thanks for reporting!

JonB · Aug 27, 2012

I had a problem with my server. It constantly flipflopped on the connection to the NUT (Network UPS Tools) server. Also, my SSH connection to my proxmoxve server (10 meters away) was laggy and scrolled in jumps. I noticed that something in dmesg had crashed, so I rebooted and it seems like the symptons vent away. But not the problem, because dmesg still has a problem. I have attached my pveversion diffed against the top one posted first in this thread. I have also attached my dmesg which contains the Call Trace: All my problems started after upgrading to Linux dkproxmoxve1.laerdal.global 2.6.32-14-pve #1 SMP Tue Aug 21 08:24:37 CEST 2012 x86_64 GNU/Linux

JonB · Aug 30, 2012

The problem with slow laggy network is back, no change since last posting

spirit · Aug 30, 2012

JonB said:
The problem with slow laggy network is back, no change since last posting

Hi,
can you try to remove intel_iommu=on from your grub ?
I see a bad stacktrace in your dmesg
------------[ cut here ]------------
WARNING: at drivers/pci/intel-iommu.c:2775 intel_unmap_page+0x15f/0x180() (Not tainted)

spirit · Aug 30, 2012

I found a bugreport on redhat with exactly the same error
https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=815998

a patch is available here

0001-pci-dma-x86-use-include-linux-pci-dma.h.patchhttps://bugzilla.redhat.com/attachment.cgi?id=584047

I don't know if the patch is already apply in the current kernel, I'll look at it.
If not, I'll make a kernel package with patch for you to test,ok ?

JonB · Aug 30, 2012

You are very welcome to provide a kernel package, the machine is still in burnin fase so I do not have to schedule downtime. I have changed GRUB_CMDLINE_LINUX_DEFAULT="quiet", run update-grub2 and rebooted.

spirit · Aug 30, 2012

do you use iommu in your vms? (pci passthrough)

JonB · Aug 30, 2012

not yet, but I might want to.

peetaur · Aug 31, 2012

I have 2 questions about the cman stopping bug. https://bugzilla.proxmox.com/show_bug.cgi?id=238

1) Does the fix also prevent cman from stopping during normal operation if quorum/connection is lost, or just when it boots?

2) And will a node now reconnect if it loses connection? Before, cman was down, so it would never reconnect, so this was impossible to test.

dietmar · Aug 31, 2012

peetaur said:
1) Does the fix also prevent cman from stopping during normal operation if quorum/connection is lost, or just when it boots?

cman does not stop during normal operation if quorum/connection is lost (that is new to me).

peetaur · Aug 31, 2012

A bit the other way around... Randomly I find cman stopped... and *if* quorum is lost *as a result* (due to more than one lost), I have to reboot some nodes to get quorum again before /etc/init.d/cman restart will make the node rejoin. And I'm curious on how to fix it.

I don't know for sure, but I think some transmission fails for unknown reasons, and then quorum is lost on that one node (and retained on the rest), and then cman is found stopped afterwards. Quorum on the whole cluster is only lost if enough nodes drop out that the votes is lower than expected.

Here's a short sample. If you'd like, I can start a separate thread and dump all my saved logs in there.

Code:

 root@bcvm2:~# clustat
Could not connect to CMAN: Connection refused
root@bcvm2:~# /etc/init.d/cman status
Found stale pid file
root@bcvm2:~# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Starting qdiskd... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]

peetaur · Aug 31, 2012

dietmar said:
cman does not stop during normal operation if quorum/connection is lost (that is new to me).

The forum lost my last post... so forgive me if it appears and then this is duplicate.

Here's what it looked like yesterday:

Code:

root@bcvm2:~# clustat
Could not connect to CMAN: Connection refused

root@bcvm2:~# /etc/init.d/cman status
Found stale pid file

The first node that dropped out had this in a log:

Code:

# gunzip -c /var/log/cluster/corosync.log.1.gz  | less
[...]
Aug 29 19:19:32 corosync [TOTEM ] FAILED TO RECEIVE
[...]

So I believe that some random packet failed and caused the node cluster communication to fail, and then rather than retrying, cman crashed or ended intentionally. Based on your post, I guess it wasn't intentional.

And then when trying to restart cman on a server (to regain quorum on the first server that was still connected).

Code:

root@bcvm2:~# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Starting qdiskd... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]

root@bcvm2:~# clustat
Cluster Status for bcproxmox1 @ Thu Aug 30 10:27:40 2012
Member Status: Inquorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 bcvm2                                                               1 Online, Local
 bcvm3                                                               2 Offline
 bcvm1                                                               3 Offline
 /dev/loop1                                                          0 Offline, Quorum Disk

(And I'm guessing you'll have something to say about a loop device qdisk (for NFS), but it seems free of any side effects, and I can't use iSCSI without adding a new server; and the first time I had this exact same problem, I had no qdisk or loop device)

dietmar · Aug 31, 2012

Code:

Aug 29 19:19:32 corosync [TOTEM ] FAILED TO RECEIVE

This error causes cman/corosync to exit. Do you use iptables (see http://forum.proxmox.com/threads/8665-cman-keeps-crashing)?

peetaur · Aug 31, 2012

dietmar said:
cman does not stop during normal operation if quorum/connection is lost (that is new to me).

The forum lost my last 2 posts... so forgive me if it appears and then this is duplicate.

Here's what it looked like yesterday:

Code:

root@bcvm2:~# clustat
Could not connect to CMAN: Connection refused

root@bcvm2:~# /etc/init.d/cman status
Found stale pid file

The first node that dropped out had this in a log:

Code:

# gunzip -c /var/log/cluster/corosync.log.1.gz  | less
[...]
Aug 29 19:19:32 corosync [TOTEM ] FAILED TO RECEIVE
[...]

So I believe that some random packet failed and caused the node cluster communication to fail, and then rather than retrying, cman crashed or ended intentionally. Based on your post, I guess it wasn't intentional.

And then when trying to restart cman on a server (to regain quorum on the first server that was still connected).

Code:

root@bcvm2:~# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Starting qdiskd... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]

root@bcvm2:~# clustat
Cluster Status for bcproxmox1 @ Thu Aug 30 10:27:40 2012
Member Status: Inquorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 bcvm2                                                               1 Online, Local
 bcvm3                                                               2 Offline
 bcvm1                                                               3 Offline
 /dev/loop1                                                          0 Offline, Quorum Disk

(And I'm guessing you'll have something to say about a loop device qdisk (for NFS), but it seems free of any side effects, and I can't use iSCSI without adding a new server; and the first time I had this exact same problem, I had no qdisk or loop device)

And I have lots more logs to share, if you'd like to deal with this in another thread.

tom · Aug 31, 2012

peetaur said:
The forum lost my last 2 posts... so forgive me if it appears and then this is duplicate.

the forum does not loose posts. post from new member are moderated, if you post you will see a short note.

as soon as you are a valued member of the forum, your posts will be visible immediately without moderation.

New Kernel and bug fixes

argonym

Guest

Member

Distinguished Member

New Member

Renowned Member

Proxmox Staff Member

Member

Attachments

Member

Distinguished Member

Distinguished Member

Member

Distinguished Member

Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

We value your privacy