I am pasting an excerpt of a log with the "error", please find red coloured the part where the interfaces are downed and restarted. As you can notice, this happens in the middle of DRBD operations:
Jan 24 10:20:28 xxxxxx kernel: block drbd0: updated UUIDs FD2A5B0876C42B79:0000000000000000:C74C3384CF2F565B:C74B3384CF2F565B
Jan 24 10:20:28 xxxxxx kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Jan 24 10:20:28 xxxxxx kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
Jan 24 10:20:28 xxxxxx kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
Jan 24 10:20:29 xxxxxx kernel: block drbd1: bitmap WRITE of 7451 pages took 1297 jiffies
Jan 24 10:20:29 xxxxxx kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jan 24 10:20:29 xxxxxx kernel: block drbd0: bitmap WRITE of 7439 pages took 477 jiffies
Jan 24 10:20:29 xxxxxx kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:30 xxxxxx corosync[2339]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 24 10:20:30 xxxxxx corosync[2339]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.3) ; members(old:1 left:0)
Jan 24 10:20:30 xxxxxx corosync[2339]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 24 10:20:31 xxxxxx fenced[2500]: fenced 1352871249 started
Jan 24 10:20:31 xxxxxx dlm_controld[2513]: dlm_controld 1352871249 started
Jan 24 10:20:32 xxxxxx kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Jan 24 10:20:32 xxxxxx kernel: kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround
Jan 24 10:20:32 xxxxxx kernel: tun: Universal TUN/TAP device driver, 1.6
Jan 24 10:20:32 xxxxxx kernel: tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
Jan 24 10:20:32 xxxxxx kernel: ip6_tables: (C) 2000-2006 Netfilter Core Team
Jan 24 10:20:32 xxxxxx kernel: Enabling conntracks and NAT for ve0
Jan 24 10:20:32 xxxxxx kernel: nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Jan 24 10:20:32 xxxxxx kernel: RPC: Registered named UNIX socket transport module.
Jan 24 10:20:32 xxxxxx kernel: RPC: Registered udp transport module.
Jan 24 10:20:32 xxxxxx kernel: RPC: Registered tcp transport module.
Jan 24 10:20:32 xxxxxx kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Jan 24 10:20:32 xxxxxx kernel: FS-Cache: Loaded
Jan 24 10:20:32 xxxxxx kernel: Registering the id_resolver key type
Jan 24 10:20:32 xxxxxx kernel: FS-Cache: Netfs 'nfs' registered for caching
Jan 24 10:20:33 xxxxxx kernel: ploop_dev: module loaded
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:34 xxxxxx corosync[2339]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 24 10:20:34 xxxxxx corosync[2339]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.3) ; members(old:1 left:0)
Jan 24 10:20:34 xxxxxx corosync[2339]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 24 10:20:37 xxxxxx pvesh: <root@pam> starting task UPID:xxxxxx:00000AF6:00000CE7:5100FCE5:startall::root@pam:
Jan 24 10:20:37 xxxxxx pvesh: <root@pam> end task UPID:xxxxxx:00000AF6:00000CE7:5100FCE5:startall::root@pam: OK
Jan 24 10:20:37 xxxxxx kernel: vmbr0: port 1(eth2) entering disabled state
Jan 24 10:20:37 xxxxxx kernel: tg3 0000:03:04.1: eth1: Link is down
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:37 xxxxxx corosync[2339]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 24 10:20:37 xxxxxx corosync[2339]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.3) ; members(old:1 left:0)
Jan 24 10:20:37 xxxxxx corosync[2339]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 24 10:20:37 xxxxxx kernel: tg3 0000:03:04.0: eth0: Link is down
Jan 24 10:20:40 xxxxxx kernel: block drbd1: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Jan 24 10:20:40 xxxxxx kernel: block drbd1: new current UUID 59AC650D269C2BEB:8A32918CA1F5AF2F:4230DD55C83291C3:422FDD55C83291C3
Jan 24 10:20:40 xxxxxx kernel: block drbd1: asender terminated
Jan 24 10:20:40 xxxxxx kernel: block drbd1: Terminating asender thread
Jan 24 10:20:40 xxxxxx kernel: block drbd1: Connection closed
Jan 24 10:20:40 xxxxxx kernel: block drbd1: conn( NetworkFailure -> Unconnected )
Jan 24 10:20:40 xxxxxx kernel: block drbd1: receiver terminated
Jan 24 10:20:40 xxxxxx kernel: block drbd1: Restarting receiver thread
Jan 24 10:20:40 xxxxxx kernel: block drbd1: receiver (re)started
Jan 24 10:20:40 xxxxxx kernel: block drbd1: conn( Unconnected -> WFConnection )
Jan 24 10:20:41 xxxxxx kernel: tg3 0000:03:04.0: eth0: Link is up at 1000 Mbps,
Thanks,
Jose.
Hi,
Upon rebooting one of my machines (a cluster of two), everything seems to be right with DRBD re-sync. Nevertheless, after a noticeable (thorugh ping) network loss (a couple of seconds), one of the DRBD partitions (that already appeared as UpToDate) losses sync and I get a split brain. Following logs, it seems that network interfaces get up and ready and DRBD things start working. In the middle of the log it appears that network interfaces are dropped and brought up again. This then makes a split brain to appear because both machines have been running as Primary during the network loss.
So I was wondering if any of the PVE services (or corosync or whatever) is to blame for this temporary network loss and that If I should not start DRBD until everything related to networks is up and stable.
The runlevel I have is this:
README S17vzeventd S18postfix S19cman S20rgmanager S25nxserver
S14portmap S18acpid S18proftpd S19cpufrequtils S20saned S25pvebanner
S15nfs-common S18anacron S18pve-cluster S19cron S21pvedaemon S25rc.local
S17binfmt-support S18atd S18rsync S19drbd S21qemu-server S25rmnologin
S17fancontrol S18dbus S18snmpd S19gdm3 S21vz S25stop-bootlogd
S17ksmtuned S18exim4 S18ssh S20bootlogs S22apache2
S17rrdcached S18kerneloops S18sysstat S20clvm S23pvestatd
S17rsyslog S18loadcpufreq S19avahi-daemon S20cups S24pve-manager
S17sudo S18ntp S19bluetooth S20netperf S25nxsensor
Would you recommend me to move drbd to S26 (and stop-bootlogd to S27 to keep logs) ??? Could this make any harm?... or perhaps a less elegant way like adding some delay to the drbd init script??
Thanks in advance.
Jose