DRBD, network and runlevel

kankamuso

Active Member
Oct 19, 2011
76
0
26
Hi,

Upon rebooting one of my machines (a cluster of two), everything seems to be right with DRBD re-sync. Nevertheless, after a noticeable (thorugh ping) network loss (a couple of seconds), one of the DRBD partitions (that already appeared as UpToDate) losses sync and I get a split brain. Following logs, it seems that network interfaces get up and ready and DRBD things start working. In the middle of the log it appears that network interfaces are dropped and brought up again. This then makes a split brain to appear because both machines have been running as Primary during the network loss.

So I was wondering if any of the PVE services (or corosync or whatever) is to blame for this temporary network loss and that If I should not start DRBD until everything related to networks is up and stable.

The runlevel I have is this:

README S17vzeventd S18postfix S19cman S20rgmanager S25nxserver
S14portmap S18acpid S18proftpd S19cpufrequtils S20saned S25pvebanner
S15nfs-common S18anacron S18pve-cluster S19cron S21pvedaemon S25rc.local
S17binfmt-support S18atd S18rsync S19drbd S21qemu-server S25rmnologin
S17fancontrol S18dbus S18snmpd S19gdm3 S21vz S25stop-bootlogd
S17ksmtuned S18exim4 S18ssh S20bootlogs S22apache2
S17rrdcached S18kerneloops S18sysstat S20clvm S23pvestatd
S17rsyslog S18loadcpufreq S19avahi-daemon S20cups S24pve-manager
S17sudo S18ntp S19bluetooth S20netperf S25nxsensor

Would you recommend me to move drbd to S26 (and stop-bootlogd to S27 to keep logs) ??? Could this make any harm?... or perhaps a less elegant way like adding some delay to the drbd init script??

Thanks in advance.

Jose
 
I am pasting an excerpt of a log with the "error", please find red coloured the part where the interfaces are downed and restarted. As you can notice, this happens in the middle of DRBD operations:


Jan 24 10:20:28 xxxxxx kernel: block drbd0: updated UUIDs FD2A5B0876C42B79:0000000000000000:C74C3384CF2F565B:C74B3384CF2F565B
Jan 24 10:20:28 xxxxxx kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Jan 24 10:20:28 xxxxxx kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
Jan 24 10:20:28 xxxxxx kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
Jan 24 10:20:29 xxxxxx kernel: block drbd1: bitmap WRITE of 7451 pages took 1297 jiffies
Jan 24 10:20:29 xxxxxx kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jan 24 10:20:29 xxxxxx kernel: block drbd0: bitmap WRITE of 7439 pages took 477 jiffies
Jan 24 10:20:29 xxxxxx kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:30 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:30 xxxxxx corosync[2339]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 24 10:20:30 xxxxxx corosync[2339]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.3) ; members(old:1 left:0)
Jan 24 10:20:30 xxxxxx corosync[2339]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 24 10:20:31 xxxxxx fenced[2500]: fenced 1352871249 started
Jan 24 10:20:31 xxxxxx dlm_controld[2513]: dlm_controld 1352871249 started
Jan 24 10:20:32 xxxxxx kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Jan 24 10:20:32 xxxxxx kernel: kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround
Jan 24 10:20:32 xxxxxx kernel: tun: Universal TUN/TAP device driver, 1.6
Jan 24 10:20:32 xxxxxx kernel: tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
Jan 24 10:20:32 xxxxxx kernel: ip6_tables: (C) 2000-2006 Netfilter Core Team
Jan 24 10:20:32 xxxxxx kernel: Enabling conntracks and NAT for ve0
Jan 24 10:20:32 xxxxxx kernel: nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Jan 24 10:20:32 xxxxxx kernel: RPC: Registered named UNIX socket transport module.
Jan 24 10:20:32 xxxxxx kernel: RPC: Registered udp transport module.
Jan 24 10:20:32 xxxxxx kernel: RPC: Registered tcp transport module.
Jan 24 10:20:32 xxxxxx kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Jan 24 10:20:32 xxxxxx kernel: FS-Cache: Loaded
Jan 24 10:20:32 xxxxxx kernel: Registering the id_resolver key type
Jan 24 10:20:32 xxxxxx kernel: FS-Cache: Netfs 'nfs' registered for caching
Jan 24 10:20:33 xxxxxx kernel: ploop_dev: module loaded
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:34 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:34 xxxxxx corosync[2339]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 24 10:20:34 xxxxxx corosync[2339]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.3) ; members(old:1 left:0)
Jan 24 10:20:34 xxxxxx corosync[2339]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 24 10:20:37 xxxxxx pvesh: <root@pam> starting task UPID:xxxxxx:00000AF6:00000CE7:5100FCE5:startall::root@pam:
Jan 24 10:20:37 xxxxxx pvesh: <root@pam> end task UPID:xxxxxx:00000AF6:00000CE7:5100FCE5:startall::root@pam: OK
Jan 24 10:20:37 xxxxxx kernel: vmbr0: port 1(eth2) entering disabled state
Jan 24 10:20:37 xxxxxx kernel: tg3 0000:03:04.1: eth1: Link is down
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] CLM CONFIGURATION CHANGE
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] New Configuration:
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] #011r(0) ip(192.168.0.3)
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] Members Left:
Jan 24 10:20:37 xxxxxx corosync[2339]: [CLM ] Members Joined:
Jan 24 10:20:37 xxxxxx corosync[2339]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 24 10:20:37 xxxxxx corosync[2339]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.3) ; members(old:1 left:0)
Jan 24 10:20:37 xxxxxx corosync[2339]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 24 10:20:37 xxxxxx kernel: tg3 0000:03:04.0: eth0: Link is down
Jan 24 10:20:40 xxxxxx kernel: block drbd1: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Jan 24 10:20:40 xxxxxx kernel: block drbd1: new current UUID 59AC650D269C2BEB:8A32918CA1F5AF2F:4230DD55C83291C3:422FDD55C83291C3
Jan 24 10:20:40 xxxxxx kernel: block drbd1: asender terminated
Jan 24 10:20:40 xxxxxx kernel: block drbd1: Terminating asender thread
Jan 24 10:20:40 xxxxxx kernel: block drbd1: Connection closed
Jan 24 10:20:40 xxxxxx kernel: block drbd1: conn( NetworkFailure -> Unconnected )
Jan 24 10:20:40 xxxxxx kernel: block drbd1: receiver terminated
Jan 24 10:20:40 xxxxxx kernel: block drbd1: Restarting receiver thread
Jan 24 10:20:40 xxxxxx kernel: block drbd1: receiver (re)started
Jan 24 10:20:40 xxxxxx kernel: block drbd1: conn( Unconnected -> WFConnection )
Jan 24 10:20:41 xxxxxx kernel: tg3 0000:03:04.0: eth0: Link is up at 1000 Mbps,


Thanks,

Jose.

Hi,

Upon rebooting one of my machines (a cluster of two), everything seems to be right with DRBD re-sync. Nevertheless, after a noticeable (thorugh ping) network loss (a couple of seconds), one of the DRBD partitions (that already appeared as UpToDate) losses sync and I get a split brain. Following logs, it seems that network interfaces get up and ready and DRBD things start working. In the middle of the log it appears that network interfaces are dropped and brought up again. This then makes a split brain to appear because both machines have been running as Primary during the network loss.

So I was wondering if any of the PVE services (or corosync or whatever) is to blame for this temporary network loss and that If I should not start DRBD until everything related to networks is up and stable.

The runlevel I have is this:

README S17vzeventd S18postfix S19cman S20rgmanager S25nxserver
S14portmap S18acpid S18proftpd S19cpufrequtils S20saned S25pvebanner
S15nfs-common S18anacron S18pve-cluster S19cron S21pvedaemon S25rc.local
S17binfmt-support S18atd S18rsync S19drbd S21qemu-server S25rmnologin
S17fancontrol S18dbus S18snmpd S19gdm3 S21vz S25stop-bootlogd
S17ksmtuned S18exim4 S18ssh S20bootlogs S22apache2
S17rrdcached S18kerneloops S18sysstat S20clvm S23pvestatd
S17rsyslog S18loadcpufreq S19avahi-daemon S20cups S24pve-manager
S17sudo S18ntp S19bluetooth S20netperf S25nxsensor

Would you recommend me to move drbd to S26 (and stop-bootlogd to S27 to keep logs) ??? Could this make any harm?... or perhaps a less elegant way like adding some delay to the drbd init script??

Thanks in advance.

Jose
 
this may not solve your issue, but I noticed that when a drbd pve system reboots, there is a warning that drbd can not be stopped be cause it is in use.

do not change the start number in rc2.d , but the kill in /etc/rc6.d needs to be changed. so we did this.

edit /etc/init.d/drbd and change the followg line. note there can only be one space before pve-cluster
Code:
# X-Stop-After:   heartbeat corosync pve-cluster
then run this to regenerate rc*.d entries:
Code:
update-rc.d drbd defaults
 
Last edited:
this may not solve your issue, but I noticed that when a drbd pve system reboots, there is a warning that drbd can not be stopped be cause it is in use.

do not change the start number in rc2.d , but the kill in /etc/rc6.d needs to be changed. so we did this.

edit /etc/init.d/drbd and change the followg line. note there can only be one space before pve-cluster
Code:
# X-Stop-After:   heartbeat corosync pve-cluster
then run this to regenerate rc*.d entries:
Code:
update-rc.d drbd defaults

Thanks a lot for your reply. Nevertheless, I think the problem I post is quite important for many people as DRBD is very used by a lot of us. I hope someone will be able to clarify...

Regards,

Jose.
 
DRBD should start early. Sure the network connection for drbd needs to start first. But drbd needs to start before most services in case those services need the drbd storage.

I mentioned the rc6.d/*drbd as it is related to init.d, but I can now see that it does not solve your issue.

can you post your /etc/drbd.d/r0.res and /etc/network/interfaces ?
 
DRBD should start early. Sure the network connection for drbd needs to start first. But drbd needs to start before most services in case those services need the drbd storage.

I mentioned the rc6.d/*drbd as it is related to init.d, but I can now see that it does not solve your issue.

can you post your /etc/drbd.d/r0.res and /etc/network/interfaces ?

Just by chance... What are the contents from your /etc/rc.local ???. I think my problems came from calls to ethtool someone included there....
 
Just by chance... What are the contents from your /etc/rc.local ???. I think my problems came from calls to ethtool someone included there....


the the 2 drbd systems, we just have
Code:
dmesg | mail -s "system241 local system startup "  root

In your 1-st or 2-nd post there is an error about eth1 down. check if drbd uses eth1 . that is why I wanted to see the 2 files mentioned in my prior post.
 
Just by chance... What are the contents from your /etc/rc.local ???. I think my problems came from calls to ethtool someone included there....

rc.local has no commands in a fresh pve or debian install..

so
comment out all ones except the 'exit 0' at the end.

Also, it may not be the best thing in the world to have more then one person doing configuration changes to a production drbd / pve system.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!