ok, I have another question, and I won't be able to work on this until later.....
on the wiki:
auto vmbr1
iface vmbr1 inet static
address 10.10.11.1
netmask 255.255.255.0
bridge_ports none
bridge_stp off
bridge_fd 0
post-up echo 1 >...
pve host is 192.168.1.10
pve host has 2 nics, only one in use now.
gateway is 192.168.1.2
is it possible for ct's to use 10.100.0.0 network and access wan ?
if so how do i set that up?
I'll put to wiki if you want my rsnapshot setup later .
what I backup is
/etc
/etc/pve
dumps
the backups go to a 500GB or greater partition on each node.
this is what the backups look like:
ls /bkup/rsnapshot-pve
daily.0 daily.1 daily.2 daily.3 daily.4 daily.5 daily.6 weekly.0...
to add more info.
on node fbc1, i setup workstation , and did the following. it could be that somethin installed interfered with cluster:
history|grep install
415 Tuesday 2011-12-20 [09:00:07 -0500] aptitude install gnome-core gdm3 libcurl3 xdg-utils
463 Tuesday 2011-12-20...
the thing I need to figure out is how to recover a node that has been kicked off the cluster due to an extended network outage.
I'll setup another cluster after the holidays .
rsnapshot a debian package using rsync to backup data. see aptitude show rsnapshot .
using it I have daily/weekly/monthly backups on and server with extra space.
I know it is hard to debug this, but we are at different time zones and I post issues as they come up , and then try to fix...
so a disconnected network wire and a network flooded during a backup can break the cluster.
the question now is can it be fixed?
rsync can use --bwlimit=KBPS , which i use for off site backups.
for the cluster network, it seems that like drbd/heartbeat , dedicated network hardware should...
the issue started when rsnapshot was rsyncing a lot of data on the network.
Dec 21 04:07:10 fbc207 rrdcached[1485]: flushing old values
Dec 21 04:07:10 fbc207 rrdcached[1485]: rotating journals
Dec 21 04:07:10 fbc207 rrdcached[1485]: started new journal...
I had already rebooted a few times.
i did not setup or install clvmd or rgmanager .
this morning another node , fbc207 has this,
df: `/etc/pve': Transport endpoint is not connected
and the last part of syslog:
Dec 21 05:37:24 fbc207 pvestatd[470390]: WARNING: ipcc_send_rec failed...
here is info from fbc207's syslog:
Dec 20 09:44:16 fbc207 corosync[2033]: [TOTEM ] A processor failed, forming new configuration.
Dec 20 09:44:18 fbc207 corosync[2033]: [CLM ] CLM CONFIGURATION CHANGE
Dec 20 09:44:18 fbc207 corosync[2033]: [CLM ] New Configuration:
Dec 20 09:44:18...
now this is getting to /var/log/syslog nonstop:
Dec 20 21:01:02 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:02 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec failed: Connection refused
Dec 20 21:01:02 fbc208 pvestatd[2033]: WARNING: ipcc_send_rec...
it looks like the link was down from
Dec 20 11:27:53 fbc208 kernel: e1000e: eth0 NIC Link is Down
Dec 20 11:27:53 fbc208 kernel: vmbr0: port 1(eth0) entering disabled state
to
Dec 20 12:24:37 fbc208 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 20 12:24:37...
ok this is earlier before we moved node to rack.
the network was down for a little time .. and here is what seems to be 1-st un usual activity in syslog:
Dec 20 12:24:37 fbc208 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 20 12:24:37 fbc208 kernel: vmbr0...
node can not rejoin cluster after network outage
see my replies, as the 1-st post was not the 1-st issue.. i looked earlier in logs so deleted a lot from this post.