Proxmox 1.8 cluster .. one side not working properly

lonegroover · Jul 30, 2012

Hi.

First - I appreciate that v1.8 is outdated now and I do plan to upgrade when I can - but I have a more pressing problem at the moment.

I moved our cluster - consisting essentially of two physical servers, two Cisco NAS units where the (KVM) VM images live and two switches, to a new data centre where they now have new IP addresses. I reconfigured basic networking on the two servers, updated the IP addresses in /etc/pve/cluster.cfg and rebooted the boxes, master node first.

The storage is set up as /dev/drbdvg0 and /dev/drbdvg1. I didn't install this myself and I'mnot that familiar with DRBD or indeed iSCSI.

Everything looked fine, until I attempted to start a VM on the second (slave) node. It took ages to start, hanging for thirty seconds at a time. It was clearly miscommunicating with the NAS.

Furthermore, any attempt to view the 'hardware' tab of a VM config on the second node results in an Embperl error page that reads, after the usual Apache "Internal Server Error" message,

Code:

  [3016]ERR:  24:  Error in Perl code: 500 read timeout

All of the images, including those set up on the second node, will run fine on the first (and that's what I'm doing for now).

On the first box, /proc/drbd looks like this:

Code:

version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757 
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:0 dw:27568823 dr:156762105 al:309656 bm:309639 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:10184632
 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:0 dw:2451648 dr:14918745 al:1244 bm:1211 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:1152564

.. and very similar on the second:

Code:

version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757 
 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r----
    ns:0 nr:0 dw:0 dr:1705944 al:0 bm:107 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:954596
 1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r----
    ns:0 nr:0 dw:0 dr:1821288 al:0 bm:107 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:520192

So it looks like at some level they aren't talking to each other - I don't see the usual "UpToDate/UpToDate".

I'm also seeing lots of messages like this on the second node:

Code:

connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4329026692, last ping 4329027942, now 4329029192
 connection1:0: detected conn error (1011)

Can anyone suggest what might have gone wrong here? A cabling issue maybe? Or how to fix it? I'm particular anxious to avoid losing updates to the images as seen by the first node if they manage to sync up - don't want to lose or corrupt the VM images!

Very grateful for any advice.

udo · Aug 1, 2012

lonegroover said:
Hi.

First - I appreciate that v1.8 is outdated now and I do plan to upgrade when I can - but I have a more pressing problem at the moment.

I moved our cluster - consisting essentially of two physical servers, two Cisco NAS units where the (KVM) VM images live and two switches, to a new data centre where they now have new IP addresses. I reconfigured basic networking on the two servers, updated the IP addresses in /etc/pve/cluster.cfg and rebooted the boxes, master node first.

The storage is set up as /dev/drbdvg0 and /dev/drbdvg1. I didn't install this myself and I'mnot that familiar with DRBD or indeed iSCSI.

Everything looked fine, until I attempted to start a VM on the second (slave) node. It took ages to start, hanging for thirty seconds at a time. It was clearly miscommunicating with the NAS.

Furthermore, any attempt to view the 'hardware' tab of a VM config on the second node results in an Embperl error page that reads, after the usual Apache "Internal Server Error" message,

Code:

[3016]ERR: 24: Error in Perl code: 500 read timeout

All of the images, including those set up on the second node, will run fine on the first (and that's what I'm doing for now).

On the first box, /proc/drbd looks like this:

Code:

version: 8.3.7 (api:88/proto:86-91) srcversion: EE47D8BF18AC166BE219757 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r---- ns:0 nr:0 dw:27568823 dr:156762105 al:309656 bm:309639 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:10184632 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r---- ns:0 nr:0 dw:2451648 dr:14918745 al:1244 bm:1211 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:1152564

.. and very similar on the second:

Code:

version: 8.3.7 (api:88/proto:86-91) srcversion: EE47D8BF18AC166BE219757 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r---- ns:0 nr:0 dw:0 dr:1705944 al:0 bm:107 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:954596 1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r---- ns:0 nr:0 dw:0 dr:1821288 al:0 bm:107 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:520192

So it looks like at some level they aren't talking to each other - I don't see the usual "UpToDate/UpToDate".

I'm also seeing lots of messages like this on the second node:

Code:

connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4329026692, last ping 4329027942, now 4329029192 connection1:0: detected conn error (1011)

Can anyone suggest what might have gone wrong here? A cabling issue maybe? Or how to fix it? I'm particular anxious to avoid losing updates to the images as seen by the first node if they manage to sync up - don't want to lose or corrupt the VM images!

Very grateful for any advice.

Hi,
do you have also assigned the new IP in /etc/hosts?
And the right for drbd? Look in /etc/drbd.d/r0.res and /etc/drbd.d/r1.res (if this splitted in ressourcefiles).
Can you ping the other drbd-node?

You have two drbd-ressources - one for each server? In this case, you can after configuring the network resync without data-loss.

Udo

Search

Search

Proxmox 1.8 cluster .. one side not working properly

lonegroover

Guest

udo

Distinguished Member