I am running a PVE 1.8 cluster of three hosts, only linux KVM guests are deployed. The guests are running Debian 5 Lenny 64bit with virtio network cards. Host runs kernel 2.6.32-4-pve, guests run kernel 2.6.32-bpo.5-amd64.
Three of the guests are configured as PHP5 application servers; they receive requests on :80 from a reverse proxy (which are also KVM guests on the same cluster). The application is pretty bandwidth-heavy as we're running a file-transfer-solution; sustained data rates of 30-60Mbit/s for long parts of the day is normal.
Everything works great, but about once a week, one of the appservers becomes unresponsive (a different one each time):
Once logged in, the shell is pretty responsive. No IO wait, load is low. It is possible to ping out, resolve DNS, etc. I notice that nothing new is written into /var/log/syslog from the moment the system "hangs".
dmesg shows a couple of segfaults:
(The appserver probably tries to connect to remote memcache instances, but dies in the process.)
After a reboot, all is good again (sigh...).
At the moment, it feels to me like a network driver issue, not a user-land issue. I have replaced virtio with e1000 for now and will observe for a week or so, but that's just fishing in the dark... It kills me that I don't really see what's going on/wrong! Also, I cannot (yet) reliably reproduce the issue.
Does anyone have similar experiences? Any suggestions where to look next time?
Three of the guests are configured as PHP5 application servers; they receive requests on :80 from a reverse proxy (which are also KVM guests on the same cluster). The application is pretty bandwidth-heavy as we're running a file-transfer-solution; sustained data rates of 30-60Mbit/s for long parts of the day is normal.
Everything works great, but about once a week, one of the appservers becomes unresponsive (a different one each time):
- PING works
- Telnet to port 80 works (connection is established) but no response
- SSH login takes forever, but finally login is established
- console login takes very very very long
Once logged in, the shell is pretty responsive. No IO wait, load is low. It is possible to ping out, resolve DNS, etc. I notice that nothing new is written into /var/log/syslog from the moment the system "hangs".
dmesg shows a couple of segfaults:
Code:
[457508.135737] php5[16098] general protection ip:7fd417ba019f sp:7fff8e4c40f0 error:0 in memcache.so[7fd417b99000+14000]
After a reboot, all is good again (sigh...).
At the moment, it feels to me like a network driver issue, not a user-land issue. I have replaced virtio with e1000 for now and will observe for a week or so, but that's just fishing in the dark... It kills me that I don't really see what's going on/wrong! Also, I cannot (yet) reliably reproduce the issue.
Does anyone have similar experiences? Any suggestions where to look next time?